Field of Science

Magic Singlish

A number of non-native English speakers get "Singaporean" as the top guess for their native language. You can actually see that by playing around in our dialect navigator. Here's screenshot of a particularly illuminating view:

As you can see, "Singaporean" is connected to a big bundle of non-native dialects. Most of the other native dialects are off in a chain in the bottom right. Here is another view with a slightly weaker filter on connectedness:

Again, you can see that most of the non-native dialects cluster together. Most of the native dialects do not connect directly to that cluster but rather connect to Singaporean. Again, you can see Standard American and AAVE off in their own cluster.

Of course, this view just tells you what is connected to what. It's possible that Swedish is actually more similar to Irish than to Singaporean, even though the chain of connections is farther for Swedish and Irish. If you click on one of the dialects, the panel on the left will show you how closely related that dialect is to all others:

We're working on a browser that will let you see *why* different dialects are more or less related -- that is, what answers in the quiz are typical of which dialects. I'm hoping it will be ready soon. In the meantime, enjoy the dialect browser.

Updated results on the relationship between English dialects

I've updated the interactive visualization of the relationships between the Englishes of the world to include a couple dozen additional native languages. Check it out.

Forums find GamesWithWords

A number of forums have picked up the WhichEnglish quiz, and have produced some really intelligent and insightful conversation. I recommend in particular this conversation on metafilter. There is also an extensive conversation at hacker news and a somewhat older discussion at reddit. And there is a lot of discussion in Finnish and Hungarian, but I have no idea what they are saying...

Handling viral traffic

Around 4am EST on May 28, we started getting *a lot* of traffic to the website. This very quickly overloaded the server, resulting in the website running very slowly. We did some optimization. Things sped up, and our reward was more traffic. So we switched to a more powerful server. And so on.

Things are finally under control. At least for the moment, anyway. You can see that we've managed to get the average page load time down to a reasonable length of time for the last day or so, without any large spikes:

Of course, overwhelming amounts of traffic is a good problem to have, and I won't complain if things overheat again.

Which English: The Science, Part 1

I've gotten a number of questions about the science behind our WhichEnglish quiz. Actually, I had intended to post a more detailed discussion days ago, but I got distracted by other matters.

In this project, we are looking at three interrelated issues:

1. How does the age at which you start learning a language affect how well you learn that language?
2. How is learning a foreign language affected by the language you already know?
3. How are the grammars of different English dialects related?

And of course, we train an algorithm to predict participants' native language and dialect of English based on their answers. I return to that at the end.

Age of Acquisition

Although WhichEnglish has a few scientific targets, age-of-acquisition effects were the original inspiration. Everybody knows that the older you are when you start learning a foreign language, the harder it is to learn. One possibility is that there is a critical period: Up to some age, you can learn a language like a native. After that age, you will never learn it perfectly. The other possibility is that there is no specific period for language learning; rather, language-learning simply gets a little harder every day.

The evidence is unclear. Ideally, you would compare people who started learning some language (say, English) from birth with people who started as 1 year-olds and people who started as 2 year-olds, etc. Or maybe you would want something even finer-grained. The problem is that you need a decent number of people at each age (50 would be a good number), and it quickly becomes infeasible.

One study that came close to this ideal used census data. The authors -- led by Kenji Hakuta -- realized that the US census asks foreign-born residents to rate their own English ability. The authors compared this measure of English ability with the year of immigration (an approximation for the age at which the person started learning English). Their results showed a steady decline, rather than a critical period.

We are trying to build on this work in a few ways. For one, it would be nice to confirm (or disconfirm) the previous results with a more sensitive measure of English ability. So rather than just ask people how good their English is, we have them take a test. Also, we are getting more precise information about when the participant started learning English and in what contexts.

Also, there is good reason to suspect that age-of-acquisition affects different aspects of language differently. Studies have shown that even people who began learning a language as toddlers have detectable -- if very subtle -- accents. However, people who start learning foreign languages as adults usually report that learning vocabulary isn't so hard. Grammar seems to be somewhere in between. The Hakuta study didn't distinguish these different aspects of language.

WhichEnglish focuses on grammar. We also have a vocabulary quiz to look at vocabulary. A pronunciation test is in the works.

First language effects

When we started thinking about studying age-of-acquisition effects, we quickly realized a problem. We needed questions that would be difficult for someone who learned English as a second language. But which aspects of grammar are difficult seems to depend on your first language. I personally have difficulty with aspect in Russian because the English aspect system is much less complex. However, dealing with tense in Russian is relatively straightforward, since the Russian tense system is much less complex that English's.

Since we didn't know for sure what the language backgrounds of our participants would be, we wanted a range of questions that covered the different kinds of problems people with different backgrounds might have.

As we combed the literature, we realized that it was pretty fragmented. One study might say that grammar rule x is difficult for Japanese-speakers and grammar rule y is difficult for German-speakers, but there would be no information on how Japanese-speakers fare with grammar rule y and how German-speakers manage with grammar rule x. This actually makes sense: Most studies look at speakers of one or at most a handful of language backgrounds. This is partly a matter of research interest (the researchers are usually interested in some particular language) and partly a matter of feasibility (in a lab setting, you can only test so many participants). We realized that our study, by virtue of being on the Internet and recruiting people from a wide array of backgrounds, would provide an opportunity to get more systematic data across a large number of languages.

This is pretty exploratory. We don't have strong hypotheses. But as data comes in, we will be analyzing to see what we get, and we will report it here.

The Grammars of English

In designing our age-of-acquisition study, we realized a second problem. Correct English grammar varies across different dialects. In Newfoundland, you can say "Throw me down the stairs the hammer," but most places, you can't. (I have heard that this is said in parts of Rhode Island, too, but only anecdotally.) We don't want to count a late-learner of English who says "Throw me down the stairs the hammer" as not knowing English if in fact she lives in Newfoundland!

So what we really wanted were questions for which the correct answer is the same in all English dialects. But we didn't know what those were. Again, the literature was only partly helpful here. For obvious reasons, researchers tend to be interested in understanding peculiar constructions specific to certain dialects, rather than recording what is the same everywhere (boring).

We picked out a lot of grammar rules that we at least had no reason to believe varied across dialect. But we also realized that there was an opportunity here to study differences across dialects. So we included a subset of items that we thought probably would be different across dialects so that we can explore relationships across dialects.

The algorithm

When you take the quiz, at the end we give you our best guess as to what your native language is and what dialect of English you speak. How is that related to the three issues I just discussed?

It's deeply related. The best way of proving that you understand how people's understanding of grammar is affected by the age at which they started learning, their first language (if any), and the dialect of English they speak, is to show that you can actually distinguish people based on their grammar. In fact, training an algorithm to make just that distinction is a common way of analyzing and exploring data.

There are also obvious practical applications for an algorithm that can guess someone's language background based on their grammar (for education, localization of websites, and so on).

But an important reason we included the algorithm's predictions in the quiz itself was to present the results of the study to participants in the study as the study goes on. Certainly, you can read this and other blog posts I've written about the project as well. But it probably took you as long to read this post as to do the quiz. The algorithm and its predictions boil down the essence of the study in a compelling way. Based on the (numerous) emails I have gotten, it has inspired a lot of people to think more about language. Which is great. The best Web-based studies are a two-way street, where the participants get something out of the experience, too.

We chose the particular algorithm we use because it runs quickly and could be trained on very little data. You can read more about it by clicking on "how it works" in our data visualization. We are testing out more sophisticated algorithms as well, which are likely to do much better. Algorithms for detecting underlying patterns is actually a specialty of my laboratory, and this will be a fantastic dataset to work with. These algorithms mostly run too slowly to use as part of the quiz (nobody wants to wait 10 minutes for their results), but the plan is to describe those results in future posts and/or in future data visualizations.

In conclusion

If you have any questions about this work, please ask in the comments below or shoot me an email at