Engineering Scientific Serendipity

Space expands. We know this from our study of general relativity and cosmology. I think an alternate phrasing is also correct. Noting that there is more than one kind of space, one might say: spaces expand. Whatever space it is of, independently of every other, it has a dimension, or more accurately, one or more degrees of freedom, to expand into, which sometimes interact with other spaces' degrees of freedom. We just have to be observant of it and grow harmoniously into the expanded spaces.

To scientists of all callings, the problems they encounter may seem extremely hard before their questions are phrased correctly, while their solutions often have multidimensional uses outside the scientific invention's contexts. Who would have predicted that Einstein's general relativity would have been used to make GPS? Certainly not Einstein himself.

Our engineering skills lived up to the serendipity offered by Einstein's discovery. This is what I loved about watching Edward Boyden's talk about Engineering Serendipity, which I embed here. Please watch it.

The 20th Century's foundational inventions and discoveries were mostly in sociology, computation, mathematics, physics, biology and chemistry. This is not an exhaustive list. Astrobiology and geology are in this list as well.

So now our use cases for these inventions, probably completely unprecedented from that point of view, are in figuring out geography, climate, energy and personal space.

A non-exhaustive unordered list of our future's subjects of study, and my predictions for their uses:


Geography for living space.

Climate for the food, water, the cool breeze and breathing space for all life.

Energy to unlock the productivity space for every one on earth. See Eric Drexler's The Antiparallel Structure of Science and Engineering, and Exploratory Engineering for a discussion of this. He talked about this 7 years ago at the Future of Humanity Institute' inauguration of their program on the impact of future technology:

Biology for the freedom to expand the space for personal growth, by investing in longevity research and development.

Art for the expansion of the space of meaningful conversations.

COCOHub: A crowdsourced dataset builder and community for NLP in underrepresented languages

Because it is so hard to find appropriately structured datasets when learning/researching NLP (specifically machine translation and image captioning) for underrepresented languages, I decided to create a crowdsourced site that translates MS-COCO 2015 to create two kinds of dataset:

1. Machine translation for any two languages in COCOHub. This language list is an open, append-only list of language projects that can be added to as long as someone asks for it.

2. Image captioning. This is very useful from an education and accessibility point of view. Without regard for whether the datasets themselves will be sufficient to completely solve the task, they will certainly be necessary, to begin with.

Goals: to create a collection of open datasets offering novel language pairs for machine translation, captions for images, and using existing open source infrastructure to support the evaluation of competitions that advance SoTA in both translation and image captioning for underrepresented languages.

The beneficiaries of this project include students and researchers first, then end users who will eventually get translators and image captioning systems working for their local languages.

1. Image captioning

The MS-COCO 2015 Image Captioning challenge published nearly one million sentences, with five sentences attached to each of around 330,000 images. Being an image captioning dataset, translating it automatically gives us captions in the underrepresented languages that we select. Each language is its own project, managed by a team of verifiers and champions, feeding a data pipeline that publishes the same data structure as the original, except in each of the target languages. Each image ID has 5 sentences independently provided by professionals (see paper). At the end of it all, there will be a simple web page that lets people search for a language and download a compressed JSON file, split into training, validation and test sets, for their language of choice. This opens up image captioning as a competition-level research opportunity for many languages that previously wouldn't have been available. Further, following on the linked paper, an analysis framework can be built, and all data preprocessing tasks required for each language will now have sample data to test with.

2. Machine translation

There will probably be more than 50 language targets in COCOHub. The five independently sourced descriptions attached to each unique image ID will have 5-10 translations contributed, both to mitigate the negative effects of spam, and to find interesting translation variants that can be voted on by the community. The dataset also has unique integer IDs for each sentence, which means that, given a single source (English) sentence, each integer maps to a translation in as many languages as there are volunteers to help complete the project. So the translations aren't just to English - they point to other languages as well. This has nice linguistic properties as it reflects how a lot of Africans think, by code-switching.

COCOHub's crowdsourcing tools will support voting and (eventually) statistical verification, letting people vote on the best of many contributed translations for a single sentence. This will ensure that when a project is completed, i.e. when all the English sentences have been translated to a language and people have spent time voting and verifying, the highest quality sentences get published and used by students and researchers who want to work in those languages. 

A bonus of this approach will be that for each completed language project, there may be another language that would never have been a translation target for it, so people can mix and match languages in very interesting ways. We'll be fundamentally creating the same data structure, just in 50+ different languages. Imagine someone deciding to translate Luhya to Yoruba as a research/study project just because the option is there to play around with.

There are various preprocessing tasks that are needed for the resultant technology to be practically achievable, from Sentencepiece tokenization and lemmatization to web scraping of unstructured documents. Another way to find monolingual corpora is to use the Common Crawl dataset, by writing language detection systems using fastText models that let us extract the relevant sentences from it. It is very computationally expensive as Common Crawl has terabytes of data, so it is important to publish this data as well, in order to help people down the road. One further step is using the scraped data to construct fastText word vectors that provide very useful word-level context for word use. An example of using fastText to bootstrap an English-French translation model - a form of transfer learning for machine translation - is here.


The original MS-COCO captions dataset is released on a Creative Commons 4.0 Attribution license, which gives leeway to make the resultant derivative datasets open by default. This means that there will be a need for a team of verifiers for each language pair, probably professional linguists. University faculty can plug in nicely to this part of the project.