Field of Science

Which English: The Science, Part 1

I've gotten a number of questions about the science behind our WhichEnglish quiz. Actually, I had intended to post a more detailed discussion days ago, but I got distracted by other matters.

In this project, we are looking at three interrelated issues:

1. How does the age at which you start learning a language affect how well you learn that language?
2. How is learning a foreign language affected by the language you already know?
3. How are the grammars of different English dialects related?

And of course, we train an algorithm to predict participants' native language and dialect of English based on their answers. I return to that at the end.

Age of Acquisition

Although WhichEnglish has a few scientific targets, age-of-acquisition effects were the original inspiration. Everybody knows that the older you are when you start learning a foreign language, the harder it is to learn. One possibility is that there is a critical period: Up to some age, you can learn a language like a native. After that age, you will never learn it perfectly. The other possibility is that there is no specific period for language learning; rather, language-learning simply gets a little harder every day.

The evidence is unclear. Ideally, you would compare people who started learning some language (say, English) from birth with people who started as 1 year-olds and people who started as 2 year-olds, etc. Or maybe you would want something even finer-grained. The problem is that you need a decent number of people at each age (50 would be a good number), and it quickly becomes infeasible.

One study that came close to this ideal used census data. The authors -- led by Kenji Hakuta -- realized that the US census asks foreign-born residents to rate their own English ability. The authors compared this measure of English ability with the year of immigration (an approximation for the age at which the person started learning English). Their results showed a steady decline, rather than a critical period.

We are trying to build on this work in a few ways. For one, it would be nice to confirm (or disconfirm) the previous results with a more sensitive measure of English ability. So rather than just ask people how good their English is, we have them take a test. Also, we are getting more precise information about when the participant started learning English and in what contexts.

Also, there is good reason to suspect that age-of-acquisition affects different aspects of language differently. Studies have shown that even people who began learning a language as toddlers have detectable -- if very subtle -- accents. However, people who start learning foreign languages as adults usually report that learning vocabulary isn't so hard. Grammar seems to be somewhere in between. The Hakuta study didn't distinguish these different aspects of language.

WhichEnglish focuses on grammar. We also have a vocabulary quiz to look at vocabulary. A pronunciation test is in the works.

First language effects

When we started thinking about studying age-of-acquisition effects, we quickly realized a problem. We needed questions that would be difficult for someone who learned English as a second language. But which aspects of grammar are difficult seems to depend on your first language. I personally have difficulty with aspect in Russian because the English aspect system is much less complex. However, dealing with tense in Russian is relatively straightforward, since the Russian tense system is much less complex that English's.

Since we didn't know for sure what the language backgrounds of our participants would be, we wanted a range of questions that covered the different kinds of problems people with different backgrounds might have.

As we combed the literature, we realized that it was pretty fragmented. One study might say that grammar rule x is difficult for Japanese-speakers and grammar rule y is difficult for German-speakers, but there would be no information on how Japanese-speakers fare with grammar rule y and how German-speakers manage with grammar rule x. This actually makes sense: Most studies look at speakers of one or at most a handful of language backgrounds. This is partly a matter of research interest (the researchers are usually interested in some particular language) and partly a matter of feasibility (in a lab setting, you can only test so many participants). We realized that our study, by virtue of being on the Internet and recruiting people from a wide array of backgrounds, would provide an opportunity to get more systematic data across a large number of languages.

This is pretty exploratory. We don't have strong hypotheses. But as data comes in, we will be analyzing to see what we get, and we will report it here.

The Grammars of English

In designing our age-of-acquisition study, we realized a second problem. Correct English grammar varies across different dialects. In Newfoundland, you can say "Throw me down the stairs the hammer," but most places, you can't. (I have heard that this is said in parts of Rhode Island, too, but only anecdotally.) We don't want to count a late-learner of English who says "Throw me down the stairs the hammer" as not knowing English if in fact she lives in Newfoundland!

So what we really wanted were questions for which the correct answer is the same in all English dialects. But we didn't know what those were. Again, the literature was only partly helpful here. For obvious reasons, researchers tend to be interested in understanding peculiar constructions specific to certain dialects, rather than recording what is the same everywhere (boring).

We picked out a lot of grammar rules that we at least had no reason to believe varied across dialect. But we also realized that there was an opportunity here to study differences across dialects. So we included a subset of items that we thought probably would be different across dialects so that we can explore relationships across dialects.

The algorithm

When you take the quiz, at the end we give you our best guess as to what your native language is and what dialect of English you speak. How is that related to the three issues I just discussed?

It's deeply related. The best way of proving that you understand how people's understanding of grammar is affected by the age at which they started learning, their first language (if any), and the dialect of English they speak, is to show that you can actually distinguish people based on their grammar. In fact, training an algorithm to make just that distinction is a common way of analyzing and exploring data.

There are also obvious practical applications for an algorithm that can guess someone's language background based on their grammar (for education, localization of websites, and so on).

But an important reason we included the algorithm's predictions in the quiz itself was to present the results of the study to participants in the study as the study goes on. Certainly, you can read this and other blog posts I've written about the project as well. But it probably took you as long to read this post as to do the quiz. The algorithm and its predictions boil down the essence of the study in a compelling way. Based on the (numerous) emails I have gotten, it has inspired a lot of people to think more about language. Which is great. The best Web-based studies are a two-way street, where the participants get something out of the experience, too.

We chose the particular algorithm we use because it runs quickly and could be trained on very little data. You can read more about it by clicking on "how it works" in our data visualization. We are testing out more sophisticated algorithms as well, which are likely to do much better. Algorithms for detecting underlying patterns is actually a specialty of my laboratory, and this will be a fantastic dataset to work with. These algorithms mostly run too slowly to use as part of the quiz (nobody wants to wait 10 minutes for their results), but the plan is to describe those results in future posts and/or in future data visualizations.

In conclusion

If you have any questions about this work, please ask in the comments below or shoot me an email at gameswithwords@gmail.com.

34 comments:

Peggy said...

How are you balancing age of acquisition with amount of time (eg. A 20 year old who began learning a lg at age 8 vs a 40 year old who began at age 28 vs a 40 year old who began at age 8)?

GamesWithWords said...

Excellent question. In theory, you'd like to be able to isolate the age-of-acquisition variable, controlling for current age and years of learning. Of course, you can't do that!

But these data allow us to do the next best thing. Over 200,000 people have done the quiz so far. This means that we should be able to compare people of the same age who started learning at different ages and also compare people who started learning at different ages but who have been learning for the same amount of time. That way, we can get a sense of the relative contribution of number of years of study and the age at which you started.

Also, there's probably a plateau. At first, you learn rapidly, but at a certain point, you aren't learning that much anymore. From our data, we'll be able to figure that out. So let's say in your example that the 40 year old who started at 8 and the 40 year old who started at 28 have both reached peak. Then it doesn't really matter that the first person has had longer to learn -- neither of them are getting better at this point, anyway.

BTW For these analyses 200,000 is still not that much. I'd like to get at least 200,000 more. There are still a lot of people taking the quiz, so this seems doable.

Helmer Aslaksen said...

I'm Norwegian, but I've lived 22 years in Singapore and 8 years in the US. You guessed

Our top three guesses for your English dialect:
1. Singaporean
2. American (Standard)
3. New Zealand
Our top three guesses for your native (first) language:
1. German
2. Norwegian
3. English

Which sentences marked me as Singaporean or Norwegian?

Thanks!

Anonymous said...

Will you - at one point - release the data set? That would be awesome for further studies!

Anonymous said...

typo: "but there would be no information on how Japanese-speakers fair with grammar rule y" fair -> fare

GamesWithWords said...

Yes, we definitely plan on releasing an anonymized version of the dataset.

(Yes, I know that we don't ask for names. But sometimes people identify themselves in the comments, etc. So we'll need to do things like remove the comments and maybe also dates and time in order to make sure that individuals can't be identified.)

Michael Collins said...

I'm curious - you're clearly aware that Newfoundland English is a distinct dialect family, but it's not included in your results. Why is this? The island's population is only 500,000, but there's a huge Newfoundland diaspora who would also have some of the dialect quirks. I realize it's probably not worthwhile to include a lot of small dialects (I wouldn't expect you to include the English spoken by Pitcairn Islanders, for example), but it's kind of frustrating to see Newfoundland English used as an example but then not included as a result.

GamesWithWords said...

@Michael -- The algorithm only can make guesses when it has enough experience with people who speak that dialect. We update the algorithm as often as possible, though right now most of our time is taken up just keeping the web server from crashing and responding to emails (I currently get 20-30/hr).

GamesWithWords said...

@Michael -- to be more clear: Last time we updated, we didn't have enough data on Newfoundland English. There are a few other regional dialects we'd like to include as well (I get a lot of requests to add Cornish).

Our latest update (from last night) was to include around a dozen new foreign languages. Currently, I am working on updating our visualization. Adding new regional dialects may come next, but it depends largely on the volume of email and how well the server holds up!

Michael Collins said...

Thank you so much for the quick and informative response! I'll watch how your project develops. I work on Newfoundland literature and language from a cultural studies / humanities perspective, so I have something of a vested interest in seeing Newfoundland English (or Englishes, more correctly - the example you use in your post would only make sense in parts of the island) represented / seeing how it relates to other dialects. Good luck and thanks again! :)

Jim Campbell said...

Glad I found my way through to this blog otherwise your promise of "We will explain the purpose of the experiment at the end of the experiment, along with any potential implications of the research." would have gone unfulfilled.

GamesWithWords said...

So...the 'done' page for that study was getting really cluttered, and when things are too cluttered, people don't read anything.

About 25% go on to the 'findings' page or the 'visualization' page, which seems pretty good, especially given that over 33% just return to the first page again. I'm not sure how many people click on the "algorithm" button or the two help buttons.

But this is still a work in progress.

Ben said...

There was some recent research that using different measures than L2 (the Euclidean distance) gave better clustering for high dimensional spaces.

Specifically, using *fractional* exponents, so e.g.
((x0 - y0)^N + (x1 - y1)^N + ...)^(1/N) gave better results for fractional N, with higher dimensions leading to lower fractions. I can't remember the numbers, but you might want to try using 0.6 or 0.7...

Laura Hale said...

Every single guess for me was pretty much wrong.

1. New Zealand
2. US Black Vernacular / Ebonics
3. Welsh (UK)

Prior to turning about 22, my instruction was completely in English in the United States with some high school Spanish classes. I spent three months interning at a British school in Slovakia. When I turned 30, I moved to Australia and started a PhD program there. I've spent about 2 months total in New Zealand, and a weekend in Wales. I'm at a complete loss because three different continents, three variants and none that I would describe myself as being a language user of.

The other one was just as "Huh."

1. Norwegian
2. English
3. Dutch

I've never learned either Norwegian nor Dutch. English is my native language.

I'm wondering if education possibly plays a role in this (I thought of Gwendolyn Brooks poetry when answering some) or moving from one English speaking dialect area to another at some stage. (That wasn't asked about.)

GamesWithWords said...

Your question about education is a good point. Education almost certainly plays a role, but we haven't yet incorporated it into the predictions the algorithm makes.

As said in the intro to the quiz, we're using the data we collect to train the algorithm, and with time, it does better. But it's still far from perfect. If you filled out the post-quiz questionnaire about your background, we'll be using that to improve the algorithm.

Probably no quiz that takes 5-10 minutes is ever going to be accurate 100% of the time. Part of the research question is seeing how close we can get!

qvd said...

My results are definitely awesome: dialect - English,Australian,US; First language - Polish,Turkish,Italian. That makes tiny bit of a weird mix. Can you explain how can the algorithm go for 3 most common dialects and at the same time give 3 totally not connected native language answers?

GamesWithWords said...

@qvd When the algorithm tries to guess your dialect, it only considers "native" dialects, like Australian, Canadian, etc. When it considers native language, it considers those dialects *as well as* the "dialects" spoken by people whose native language is something else.

If your answers are closest to a native dialect, it guesses that your native language is English. If your answers are closest to, say, the typical response of an Italian speaker, it guesses your native language is Italian.

Check out our updated interactive visualization. As you can see, in most cases, the typical grammars of non-native English speakers are more closely related to one another than to any specific dialect of English.

Was that explanation clear?

Anonymous said...

What´s the point with the passive constructions and the pictures in the beginning? How can giving the wrong answer to those possibly say anything else than weak proficiency in English? Why did you add those? Also, I understand you will improve the algorithm based on analysis of the results. Will you improve the input questions as well? Right now, they don´t seem very sophisticated at all...

Pam B said...

Hi, just letting you know there's a grammatical error (at least to my Australian English grammar!) in the third sentence on your 'findings' page - the word are is missing.
Interesting quiz, though. My results were:
1. New Zealandish (as an Australian I'd call it Kiwi, but we tend to like shorter slang terms)
2. Australian
3. Welsh
The first two I understand; the Welsh is an interesting guess. I'd love to know a bit more about the particularities of each dialect you've examined.

Anonymous said...

Hi, Pam. I'm Aussie, too, and my results were the same as yours: NZ, Aus & Welsh. I wonder how we were flagged as Kiwi first. (Disclaimer, I've lived in NZ for the past 12 years, but I'm sure my language is still the fully-sick, choice, eh bro, version I brought with me.) Interesting...

Bob Hale said...

I took the quiz, as did many of my friends (some UK, some US-all native English speakers). Most, though not all, of us got Hungarian suggested as the second possibility for our native labguage (after English). I was wondering which feature of our dialects is flagging Hungarian as none of us speak any Hungarian and, on the face of it, the languages have little similarity.

Anonymous said...

Wait. How does a person from USA or Britain say "Throw me down the stairs the hammer"?

Bob Hale said...

Well in my dialect it's the word order - "Throw me the hammer down the stairs." or "Throw the hammer down the stairs to me."

Timo M. Partanen said...

Quite lot of us North Europeans seem to get Singaporean dialect as the top guess. See e.g. Helmer's post above. (Although he has lived long in Singapore.)

I find this bit surprising as in general there shouldn't be particularly strong connection between Scandinavia and South-East Asia.

So I am wondering what might be a reason for this? Isn't Singaporean dialect some kind of mixture of British and American dialects?

We in Scandinavia are non-native English-speakers but still quite fluent in it and we also get lot of influence from both sides of Atlantic. Thus I would guess we speak neither pure standard American English nor standard British English but we tend to mix features from both of those dialects. Thus could it be that we get Singaporean English as a proxy for "mixed British and American English"?

GamesWithWords said...

@Timo - I've just put up a new post that goes part of the way towards explaining what's going on. As discussed in the blog post, there's still a lot of analysis to be done before we fully understand what is going on.

AlanL said...

I just took the test and was correctly recognised as a native speaker of (English) English. Well done.

One point occurred to me: I have lived abroad for ten years and spent a lot of time with native speakers of other English dialects. Also, like most English speakers these days, I am exposed to a lot of material in American.

Consequently, on the "which sentences are grammatical" questions I clicked ok on quite a few things that I was pretty sure would be fine in (e.g.) Irish or some American dialects, even though I personally would never say them. Perhaps you need to tighten up the phrasing of the question?

Anonymous said...

Thanks for a fascinating study. A couple of questions:

1) It would be interesting to see the effects of other languages learned. I'm a native speaker of American English, but have studied German (and learned a lot about English grammar in the process.) "Who whom kissed" is a construction I'd never use (and would take some time to parse if someone said it to me), but if I see it written, I recognize the grammatical logic.

2) Any differences between how we understand/process written versus spoken language?

3) A lot of the constructions in the survey/game seemed to be missing key helping verbs or prepositions that would make meaning clear. Have you seen any indication that people supply them, even if they are missing from the text of the examples?

GamesWithWords said...

@AlanL - I don't know. That sounds like an effect of your experience on your understanding of grammar. Seems like something we actually want to pick up on in the quiz, right? If you hadn't been exposed to those dialects, you'd find those sentences bad.

@Sam -- There are definitely differences between written and spoken language. It's harder to study spoken language, though, and not just on the Internet. It's much harder to analyze the results. So we know a lot more about written communication.

Hsu said...

I just want to share an interesting finding..

My first try:
Our top three guesses for your English dialect:
1. Singaporean
2. US Black Vernacular / Ebonics
3. Welsh (UK)
Our top three guesses for your native (first) language:
1. Russian
2. Finnish
3. Spanish -

I speak Mandarin and my first teacher, at 10, was Canadian (for about 2 years). Then, turning 13, I became very intersted in American hip hop, and the passion lasted for 3 or 4 years. After going to university at 19 I studied English literature and Spanish language and literature for 4 years. I learned a bit of Russian but was never able to pick it up. I, however, had a very close friend whose native language is Bulgarian and Russian, and she's been living in Finland for 7 years or so and speaks Finnish. We communicate in English and it seems like she might have affected my English grammar somehow, assuming the test is accurate, even though I have lived in Taiwan all my life. I am now 24 and I met her only 4 years ago.

Out of curiousity I retook the test for 3 times; the results were largely different. I admit that each time the answers were a bit different because I was not very sure on a few options or because I realised I overlooked a few grammar mistakes. I also left some questions blank, something I did not do the fist time. Not sure how that might affect the results.

On a final note, I am glad that you reminded me about the declining learning curve. I really need to step up the game and go for the languages I want to learn before it's too late! Thank you for finishing the long paragraphs and I wish you success on your research.

Ioan ap Iago said...

A note about your use of English in the quiz. The United Kingdom is made up of four countries, NOT four regions. I am Welsh and if asked which country I come from/live in, I will always answer Wales. I will never answer the United Kingdom. I compromised this principle when doing the quiz, but under protest.

Anonymous said...

I see that other people have gotten "Singaporean" as a result for English dialect. However, I'm native Singaporean and didn't get that result... Although it did guess that English is my first language.

My results:
Our top three guesses for your English dialect:
1. Welsh (UK)
2. English (England)
3. South African

Our top three guesses for your native (first) language:
1. English
2. Norwegian
3. Swedish

Devin Denis said...

"Yes, we definitely plan on releasing an anonymized version of the dataset. "

Any update on this?

Anonymous said...

I took the quiz on an Android phone and was not able to get past the question about what countries I have lived in. No matter if I picked the countries from the meny or typed their names, they wouldn't stick. I also tried to enter only one country, but it still wouldn't accept it, so I never managed to get to the next page.

GamesWithWords said...

@Devin - The paper is still under review. Once it has been published, we'll be able to release the data.