Field of Science

Showing posts with label computational modeling. Show all posts
Showing posts with label computational modeling. Show all posts

Nature, Nurture, and Bayes

I generally have very little good to say about the grant application process, but it does force me to catch up on my reading. I just finished several papers by Amy Perfors, who I think does some of the more interesting computational models of language out there.*

A strange sociological fact about language research is that people generally come in two camps: a) those who don't (really) believe language is properly characterized by hierarchical phrase structure and also don't believe in much innate structure but do believe in powerful innate learning mechanisms, and b) those who believe language is properly characterized by *innate* hierarchical phrase structure and who don't put much emphasis on learning mechanisms. But there's no logically necessary connection between being a Nativist and believing in hierarchical phrase structure or being an Empiricist and believing in relatively simple syntactic forms. In the last few years, Perfors has been staking out some of that (largely) unclaimed territory where hierarchical phrase structure and Empiricism meet.

In "The learnability of abstract syntactic principles," she and her colleagues consider the claim by (some) Nativists that children must have an innate expectation that language be something like a hierarchical context-free grammar because there isn't enough data in the input to rule out alternative grammars. (Empiricists often buck the whole question by saying language is no such thing.) Perfors et al. show that, in fact, with some relatively simple assumptions and a powerful (Bayesian) learning device, the learner would conclude that the most likely representation of English is a hierarchical context-free grammar, based on relatively little input (reproducing what happened in linguistics, where linguists came to the same conclusion). You do have to assume that children have the innate capacity to represent such grammars, but you don't need to assume that they prefer such grammars.

"Joint acquisition of word order and word reference" presents some interesting data bearing on a number of questions, but following the theme above, she notes that her model does not require very much data to conclude that the typical word-order in English is subject-verb-object. She and her colleagues note: "The fact that word order can be acquired quickly from so [little data] despite the lack of bias [for a particular word order] may suggest no need to hypothesize that children are born with strong innate constraints on word ordering to explain their rapid acquisition."

I'm sympathetic to all these points, and I think they bring an important perspective to the question of language learning (one that is not, I should say, unique to Perfors, but certainly a minority perspective). What I can't help wondering is this: she (and others) show that you could learn the structure of language based on the input without (certain) innate assumptions that the input will be of a particular sort. Fine. But why is the input of that particular sort across (most? all?) languages? One thing the Nativist positions Perfors argues against have going for them is that they give a (more or less) principled explanation. Empiricists (typically) do not. (I am aware that some try to give explanations in terms of optimal information structure. What I have seen of this work has not struck me as overwhelmingly convincing, but I admit I haven't read enough of it and that I am willing to be convinced, though my prior on this line of argumentation is fairly low).


*My quasi-journalistic training always makes me want to disclose when I know personally the people I am writing about. But psycholinguistics is a small world. It would be safe for the reader to assume that I know *all* of the people I write about to one degree or another.

*********
Perfors A, Tenenbaum JB, & Regier T (2010). The learnability of abstract syntactic principles. Cognition PMID: 21186021

Maurits, L., Perfors, A., & Navarro, D. (2009). Joint acquisition of word order and word reference Proceedings o the 31st Annual Conference of the Cognitive Science Society, 1728-1733

Cognitive Science, March 2010

In my continuing series on the past year in Cognitive Science: March, 2010.

Once again, the discussion of some of these papers will be technical.

March


Baroni, Murphy, Barbu, Poesio. Strudel: A corpus-based semantic model based on properties and types.

You are who your friends are. A number of computational linguists have been interested in just how much you can learn about a word based on the other words it tends appear with. Interestingly, if you take a word (e.g., dog) and look at the words it tends to co-occur with (e.g., cat), those other words often describe properties or synonyms of the target word. A number of researchers have suggested that this might be part of how we learn the meanings of words.

Baroni et al. are sympathetic to that literature, but they point out that such models are only learned that dog and cat are somehow related. So they don't actually tell you what the word dog means. Moreover, Dog is also related to leash, but not in the same way it's related to cat, which is something those models ignore. Their paper covers a new model, Strudel, which attempts to close some of the gap.

The model also keeps track of what words co-occur with a target word. It additionally tracks how those words are related (e.g., dogs and cats is considered to be different from dogs chase cats). The more different types of constructions that connect the target word and a given "friend", the more important that friend is thought to be.

This model ends up doing a better job than some older models at finding semantic associates of target words. It also can cluster different words (e.g., apple, banana, dog, cat) into categories (fruit, animal) with some success. Moreover, with some additional statistical tricks, they were able to clump the various "friends" into different groups based on the type of constructions they appear in. Properties, for instance, often appear in constructions involving X has Y. Conceptually-similar words appear in other types of constructions (e.g., X is like Y).

This presents some clear advantages over previous attempts, but it has some of the same limitations as well. The model discovers different types of features of a target word (properties, conceptually-similar words, etc.), but the label "property" has to be assigned by the researchers. The model doesn't know that has four legs is a property of dog and that like to bark is not -- it only knows that the two facts are of different sorts.

Perruchet & Tillman. Exploiting multiple sources of information in learning an artificial language: human data and modeling. 

Over the last 15 years, a number of researchers have looked at statistically-based word segmentation. After listening to a few minutes of speech in an unknown language, people can guess which sequences of phonemes are more likely to be words in that language.

It turns out that some sequences of phonemes just sound more like words, independent of any learning. The authors check to see whether that matters. Participants were assigned to learn one of two languages: a language in which half of the words a priori sounded like words, and a language in which half the words a priori sounded particularly not like words. Not only did participants do better in the first condition on the words that sound like words, they did better on the "normal" words, too -- even though those were the same as the "normal" words in the second condition. The authors argue that this is consistent with the idea that already knowing some words helps you identify other words.

They also find that the fact that some words a priori sound more like they are words is easy to implement in their previously-proposed PARSER model, which then produces data somewhat like the human data from the experiment.

Gildea & Temperley. Do grammars minimize dependency length?

Words in a sentence are dependent on other words. In secondary school, we usually used the term "modify" rather than "depend on." So in The angry butcher yelled at the troublesome child, "the angry butcher" and "at the troublesome child" both modify/depend on yelled. Similarly, "the angry" modifies/depends on butcher. Etc.

This paper explores the hypothesis that people try to keep words close to the words they depend on. They worked through the Wall Street Journal corpus and calculated both what the actual dependency lengths were in each sentence (for each word in the sentence, count all the words that are between a given word and the word it depends on, and sum) and also what the shortest possible dependency length would be. They found that actual dependency lengths were actually much  closer to the optimum in both the WSJ corpus and the Brown corpus than would be expected by chance. However, when they looked at two corpora in German, while dependency lengths were shorter than would be expected by random, the effect was noticeably smaller. The authors speculate this is because German has relatively free word order, because German has some verb-final constructions, or some other reason or any combination of those reasons.

Mueller, Bahlmann & Friederici. Learnability of embedded syntactic structures depends on prosodic cues. 

Center-embedded structures are hard to process and also difficult to teach people in artificial grammar learning studies that don't provide feedback. The authors exposed participants to A1A2B1B2 structures with or without prosodic cues. Participants largely failed to learn the grammar without prosodic cues. However, if a falling contour divided each 4-syllable phrase (A1A2B1B2) from each other, participants learned much more. They did even better if a pause was added in addition to the falling contour between 4-syllable phrases. Adding an additional pause between the As and Bs (in order to accentuate the difference between As and Bs) did not provide any additional benefit.

Cognitive Science, January 2010

In my continuing series on the past year in Cognitive Science: January, 2010.

Once again, the discussion of some of these papers will be technical.

January


Lee & Sarnecka. A model of knower-level behavior in number concept development.

Children learn the full meanings of number words slowly, one word at a time. The authors present a Bayesian model of number word acquisition -- or, more specifically, of performance on the famous Give-A-Number task. The model assumes that each child has a certain baseline preference to give certain numbers of items more than others. It also assumes that the child knows certain number words and not others. If the child, say, knows one and two, the child will give that number of items when asked and not when asked about a different number word (e.g., three), even if the child doesn't know what that other number word means.

The model was then fed data on the actual performance of a set of actual children and estimates what words the child knows and what the child's baseline preferences are. The model learned that children prefer to either give a handful of items or all the available items, which accords well with what has been seen over the years. It also seemed to do a reasonable job of doing several other things.

None of this was necessarily surprising, in the sense that the model modeled well-known data correctly. That said, psychological theories are often complex. Theorists (often) state them in loose terms and then make claims about what predictions the theory makes in terms of behavior in different tasks. Without specifying the theory in a formal model, though, it's not always clear that those are in fact the predictions the theory makes. This paper represents, among things, an attempt to take a well-known theory and show that it does in fact account for the observed data. To the extent it gets things wrong, the model presents a starting point for further refinement.

There has been a movement in some quarters to make and test more explicit models. This is undoubtedly a good thing. The question is whether there are many behaviors that we understand sufficiently well to produce reasonable models ... that aren't so simplistic that the formal model itself doesn't really tell us anything we don't know. That seems to be a point one could argue. One thing I like about this particular model is that the authors attempt to capture fine-grained aspects of individual subjects' performances, which is something we ultimately want to be able to do.

Estigarribia. Facilitation by variation: Right-to-left learning of English yes/no questions

The syntax of questions have played a key role in the development of modern linguistics. In particular, a great deal of ink has been spilled about auxiliary inversion. Compare That is a soccer ball with Is that a soccer ball. Well-known theories of English posit that the auxiliary is is generated in normal declarative position (that is...) and must be moved to the front of the sentence to form a question (is that...).

Estigarribia argues that many theories have assumed parents model auxiliary-inverted questions for their children. A (smallish) corpus analysis reveals that in fact ~20% of parental yes/no questions with auxiliaries are non-auxiliary-initial (that is a soccer ball?). Of all yes/no questions, canonical auxiliary-first questions make up less than half, with sentence fragments being quite common (soccer ball?).

Again looking at the corpus of 6 young children, Estigarribia finds that the children begin by producing the simplest, fragment questions (a soccer ball?). Next, they begin producing what Estigarribia calls subject-predicate questions (that a soccer ball?). Full-on auxiliary-inverted questions appear relatively late (is that a soccer ball). Estigarribia finds this consistent with a learning mechanism in which children learn the ends of sentences better than the beginnings of sentences, similar to the MOSAIC model.

One limitation is that children have difficulty producing long sentences, and the data are consistent with children producing shorter sentences first and eventually progressively-longer sentences. Estigarribia shows that he finds the same order of acquisition even in children who have somewhat longer MLUs at the beginning of the study (that is, produce longer sentences), but one can still worry. The fact that children selectively produce the ends of the sentences rather than the beginning could be due to the fact that the end of a question (a soccer ball?) is a lot more informative than the beginning (is that a?).

It might be somewhat more impressive if children produce non-inverted questions (that is a soccer ball?) before inverted questions, but Estigarribia does not analyze those types of sentences. What I find most compelling about this study is in fact the adult data. As Estigarribia points out, we don't want to think of language acquisition as a process in which children ultimately eliminate non-canonical questions (that is, those without inverted auxiliaries), since in fact adults produce many such sentences.

Nakatani & Gibson. An on-line study of Japanese nesting complexity.

Mary met the senator who attacked the reporter who ignored the president is easier to understand that The reporter who the senator who Mary met attacked ignored the president, even though the latter sentence is grammatical (of sorts) and means the same thing. Why this is the case has been a focus of study in psycholinguistics for many years.

The authors lay out a couple hypotheses. On one, the second sentence is harder to interpret because the relevant nouns are far from the verbs, making integrating ignored and the reporter harder to integrate. On other hypotheses, all the nested relative clauses (who...) generate expectations about what verbs are coming up. The more expectations, the more has to be kept in memory, and the harder the sentence is.

These hypotheses (and a similar surprisal hypothesis) are tested using the self-paced reading methodology in Japanese, a language with a few nice properties like relatively free word order, which makes controlling the stimuli slightly easier than it is in English. The results ultimately support the expectancy hypotheses over the retrieval hypotheses.

One of the interesting things about this paper is that one well-known retrieval hypothesis is actually Gibson's. So is one of the expectancy hypotheses, which he developed after he (apparently) decided the original theory was probably wrong. The willingness to abandon a cherished theoretical position in the face of new evidence is a trait more prized than seen in academia, and it's something to be admired -- and something very typical of Gibson.

Mirman, Strauss, Dixon & Magnuson. Effect of representational distance between meanings on recognition of ambiguous spoken words.

The authors looked at word recognition using two different paradigms (lexical decision and eye-tracking). All the words could be nouns. Some had only strong noun meaning (acorn, lobster). Some were homophones with two common noun meanings (chest -- chest of drawers or person's chest) and some were homophones with a common noun and a common verb meaning (bark -- the dog barked or the tree's bark).

Participants were fasted to interpret the unambiguous words (acorn, lobster), next fastest at recognizing the noun-verb words (bark) and slowest at the noun-noun words (chest). They take this in the context of previous research that has shown that words with two closely related meanings are faster to interpret that words with two very different meanings. In this study, the semantic relatedness of the two meanings for the noun-verb homophones were no closer than that of the noun-noun homophones. So the authors suggest that syntactic distance matters as well -- two meanings of the same syntactic type (e.g., noun) interfere with one another more than two meanings of different types (e.g., noun-verb).

An alternative explanation of these data is one of priming. 2/3 of the stimuli in this study were unambiguously nouns. This may have primed the noun meanings of the noun-verb homophones and helped automatically suppress the verb meaning. Thus, participants processed the noun-verb homophones more like unambiguous, non-homophonic words. The way to test this, of course, would be to run a similar study with unambiguous verbs, verb-verb homophones, and the same noun-verb homophones.

Cognitive Science, April 2010

This week I was tasked by the lab to check the last years' worth (or so) of issues of Cognitive Science and see what papers might be of interest to folks in the lab (other people are covering other journals). There of course many good papers not on the list below; I focused largely on the psycholinguistics articles. There are a lot of articles, so I'm going to be breaking up issues into separate posts.

Fair warning: my discussion of these articles is brief and so somewhat technical.

April 2010


Szymanik & Zajenkowski. Comprehension of simple quantifiers: empirical evaluation of a computational model.

Different quantifiers seem to require different amounts of computation. Formal logic suggests that checking the truth of Some of the cars are blue simply requires checking whether at least one car is blue (or failing to find any). Most of the cars are blue probably requires something like finding out how many cars is 1/2 of the cars and checking to see if at least more than  that are. That's harder.

S&Z had people evaluate the truth value of sentences like those in the examples. People were slower for the "harder" quantifiers. This suggests people are actually running through something like the formal math theorists use to describe quantifiers.

The only odd thing about the results is a ton of research (e.g., Bott & Noveck) has suggested that evaluating sentences with some can be very slow, presumably because it involves a scalar implicature, whereas in the study some was one of the fastest quantifiers. This either suggests that for some reason people weren't computing implicatures in their study or that the other quantifiers were really slow (or that Polish, the language they used, is just different).

Matthews & Bannard, Children's production of unfamiliar word sequences is predicted by positional variability and latent classes in a large sample of child-directed speech.

Two- and three-year olds were asked to repeat back four-word sequences. Several things were varied, such as how predictable the final word was based on the first three in the sequence sequence (e.g., jelly probably commonly appears after peanut butter and ... ) and whether the words that do commonly appear as the fourth word in such a sequence are semantically related (e.g., pretty much everything following I drive a ... is going to be some kind of vehicle).

Importantly, in the actual sequences presented to the children, the final word was one that hardly ever appears in that sequence (e.g., I hate green boxes). Kids were better at repeating the sequences when (1) entropy on the 4th word was high (e.g., many different words commonly follow the first three in the sequence, as in I drive a rather than peanut butter and), and when most words that typically appear in that 4th position are semantically related (I drive a truck/car/bus/Toyota/Ford). 

The authors (tentatively) suggest that such results are predicted by theories on which young children's grammars are very item-specific, involving many narrow sentence templates (I drive a + Vehicle), rather than theories on which young children's grammars involve broad abstract categories (e.g., Noun + Verb).

However, my labmate and collaborator Timothy O'Donnell has been working on a computational model that involves abstract grammatical categories but nonetheless stores high-frequency constructions (which has been allowed but not specifically explained on many grammatical theories such as Pinker's Words & Rules theory that are the traditional alternatives to item-based theories). One consequence of his model is that if a particular construction appears very frequently with little variation (peanut butter and jelly; 653,000 hits on Google), the model finds slight alternatives to that construction (peanut butter and toast; 120,000 hits on Google) extremely unlikely.

Casasanto, Fotakopoulou & Boroditsky. Space and time in the child's mind: Evidence for a cross-dimensional asymmetry.

4-5 year-olds and 9-10 year-olds watched movies of two animals traveling along parallel paths for different distances or durations and judged the which one went longer temporally or spatially. As has been previously shown in adults, the children's judgments of temporal length were affected by spatial length (e.g., if animal A went farther than B but in a shorter amount of time, children sometimes erroneously said A took the most time) more than judgments of spatial length were affected by temporal length (e.g., if animal A went farther than B but in less time, children were not as likely to be confused about when animal went the farthest).

One obvious confound, which the authors consider, is that the stimuli stayed on the screen until the children responded, which meant that information about physical distance was available at response time, but children had to remember the duration information. The authors point to a previous study with adults that controlled for this confound and got the same results, but they have not yet run that version with children (since I haven't read the study they refer to and the method isn't described, I can't really comment).

These results are taken as evidence of a theory on which our notion of time is dependent on our representations of space, but not vice versa.

Fay, Garrod, Roberts & Swoboda. The interactive evolution of human communication systems.

People played what amounted to repeated games of Pictionary. The games were played over and over with the same words, and the question was how the pictorial representations changed over repeated games. Participants were assigned either in pairs or communities of 8. The pairs played against each other only. In the communities, games were still played in pairs, but each person played first against one member of the community, then against another, and so on until they had played with everyone.

The people in the pairs condition rapidly created abstract visual symbols for the different target words, as has happened in previous research. What was interesting was that in the communities condition, participants also created similarly abstract symbols that were rapidly shared throughout the community, such that participants who had never played against one another could communicate with abstract symbols that they had each learned from others.

The study is meant to be a model of vocabulary development and dispersal, and it certainly made for a good read (I've been a fan of Garrod's work in other contexts as well). I don't know much about theories of language evolution, so it's difficult for me to say what the theoretical impact of this work is. One obvious question is whether it matters that people in a single community know they are in a single community. That is, did they continue to use the abstract symbols they'd learned because they reasonably thought the other person might know it, or was it instinct after having used that symbol many times?

Baseball Models


The Red Sox season opener was delayed yesterday by rain. In honor of Opening Day 2.0 (this afternoon), I point you to an interesting piece in the Times about statistical simulations in baseball. According to the article, the top simulator available to the public is Diamond Mind.

Is language just statistics?

Many years ago, I attended a talk in which a researcher (in restrospect, probably a graduate student) was talking about some work she was doing on modeling learning. She mentioned that a colleague was very proud of a model he had put together in which he had a model world populated by model creatures which learned to avoid predators and find food.

She reported that he said, "Look, they are able to learn this without *any* input from the programmer. It's all nurture, not nature." She argued with him at length to point out that he had programmed into his model creatures the structures that allowed to them to learn. Change any of those parameters, and they ceased to learn.

There are a number of researchers in the field of language who, impressed by the success of statistical-learning models, argue that much or all of language learning can be accomplished by simply noticing statistical patterns in language. For instance, there is a class of words in English that tend to follow the word "the." A traditional grammarian might call these "nouns," but this becomes unnecessary when using statistics.

There are many variants of this approach, some more successful than others. Some are more careful in their claims (one paper, I recall, stated strongly that the described model did away with not only grammatical rules, but words themselves).

While I am impressed by much of the work that has come out of this approach, I don't think it can ever do away with complex (possibly innate) structure. The anecdote above is an argument by analogy. Here is a great extended quote from Language Learnability and Language Development, Steven Pinker's original, 1984 foray into book writing:

As I argued in Pinker (1979), in most distributional learning procedures there are vast numbers of properties that a learner could record, and since the child is looking for correlations among these properties, he or she faces a combinatorial explosion of possibilities. For example, he or she could record of a given word that it occurs int eh first (or second, or third, or nth) position in a sentence, that it is to the left (or right) of word X or word Y or ..., or that it is to the left of the word sequence WXYZ, or that it occurs n the same sentence with word X (or words X, Y, Z, or some subset of them), and so on. Adding semantic and inflectional information to the space of possibilities only makes the explosion more explosive. To be sure, the inappropriate properties will correlate with no others and hence will eventually be ignored, leaving only the appropriate grammatical properties, but only after astronomical amounts of memory space, computation, or both.

In any case, most of these properties should be eliminated by an astute learner as being inappropriate to learning a human language in the first place. For example, there is no linguistic phenomenon in any language that is contingent upon a word's occupying the third serial position in a sentence, so why bother testing for one? Testing for correlations among irrelevant properties is not only wasteful but potentially dangerous, since many spurious correlations will arise in local samples of the input. For example, the child could hear the sentences John eats meat, John eats slowly, and the meat is good and then conclude that the slowly is good is a possible English sentence.

Ultimately, a pure-statistics model still has to decide what regularities to keep track of and what to ignore, and that requires at least some innate structure. It probably also requires fairly complex grammatical structures, whether learned or innate.

How to win at baseball (Do managers really matter?)

It's a standard observation that when a team does poorly, the coach -- or in the case of baseball, the manager -- is fired, even though it wasn't the manager dropping balls, throwing the wrong direction or striking out.

Of course, there are purported examples of team leaders that seem to produce teams better than the sum of the parts that make them up. Bill Belichick seems to be one, even modulo the cheating scandals. Cito Gaston is credited with transforming the Blue Jays from a sub-.500 team into a powerhouse not once but twice, his best claim to excellence being this season, in which he took over halfway through the year.

But what is it they do that matters?

Even if one accepts that managers matter, the question remains: how do they matter? They don't actually play the game. Perhaps some give very good pep talks, but one would hope that the world's best players would already be trying their hardest pep talk or no.

In baseball, one thing the manager controls is the lineup: who plays, and the order in which they bat. While managers have their own different strategies, most lineups follow a basic pattern, the core of which is to put one's best players first.

There are two reasons I can think of for doing this. First, players at the top of the lineup tend to bat more times during a game, so it makes sense to have your best players there. The other reason is to string hits together.

The downside of this strategy is that innings in which the bottom of the lineup bats tend to be very boring. Wouldn't it make sense to spread out the best hitters so that in any given inning, there was a decent chance of getting some hits.

How can we answer this question?

To answer this question, I put together a simple model. I created a team of four .300 hitters and five .250 hitters. At every at-bat, a player's chance of reaching base was exactly their batting average (a .300 hitter reached base 30% of the time). All hits were singles. Base-runners always moved up two bases on a hit.

I tested two lineups: one with the best players at the top, and one with them alternating between the poorer hitters.

This model ignores many issues, such as base-stealing, double-plays, walks, etc. It also ignores the obvious fact that you'd rather have your best power-hitting bat behind people who get on base, making those home-runs count for more. But I think if batting order has a strong effect on team performance, it would still show up in the model.

Question Answered

I ran the model on each of the line-ups for twenty full 162-game seasons. The results surprised me. The lineup with the best players interspersed scored nearly as many runs in the average season (302 1/4) as the lineup with the best players stacked at the top of the order (309 1/2). Some may note that the traditional lineup did score on average 7 more runs per game, but the difference was not actually statistically significant, meaning that the two lineups were in a statistical tie.

Thus, it doesn't appear that stringing hits together is any better than spacing them out.

One prediction did come true, however. Putting your best hitters at the front of the lineup is better than putting them at the end (291 1/2 runs per season), presumably because the front end of the lineup bats more times in a season. Although the difference was statistically significant, it still amounted to only 1 run every 9 games, which is less than I would have guessed.

Thus, the decisions a manager makes about the lineup do matter, but perhaps not very much.

Parting thoughts

This was a rather simple model. I'm considering putting together one that does incorporate walks, steals and extra-base hits in time for the World Series in order to pick the best lineup for the Red Sox (still not sure how to handle sacrifice flies or double-plays, though). This brings up an obvious question: do real managers rely on instinct, or do they hire consultants to program models like the one I used here?

In the pre-Billy Beane/Bill James world, I would have said "no chance." But these days management is getting much more sophisticated.

Why are humans risk averse?

After my first foray into computational simulations successfully predicted that losses should loom larger than gains, at least when the stakes are high, I decided to take on an even more complicated phenomenon in psychology and economics: risk aversion.

Daniel Kahneman, one of the few psychologists to win a Nobel Prize -- largely because there is no Nobel Prize for psychology -- along with Amos Tversky achieved at least some of his fame by demonstrating that humans are risk-averse for gains but risk-seeking for losses.

The most famous demonstration of this came from the following experiment (known as the "Asian Flu problem"):

The Center for Disease Control discovers there is an outbreak of Asian Flu in the United States. If nothing is done, they predict that 600 people will die. Two courses of action have been suggested. If program A is adopted, 200 people will be saved. If program B is adopted, there is a one-third probability that 600 people will be saved and a two-thirds probability that no people will be saved. Which of the two programs do you favor?

Most people, it turns out, pick program A, the sure bet. This is interesting, because, on average, the two programs are exactly the same. That is, on average, program B will save 200 people, just like program A. The difference is that program B is more risky.

This would seem to demonstrate that people are risk-averse. However, you can reword the problem just slightly and get a very different response:

Two courses of action have been suggested. If program A is adopted, 400 will die. If program B is adopted, there is a one-third probability that nobody will die and a two-thirds probability that 600 people will die. Which of the two programs do you favor?

Given this scenario, most people pick program B. However, notice that this is the exact same program as in the first version of the problem! It turns out that if people think about the issue in terms of lives saved, they are risk-averse. If they think about the issue in terms of lives lost, they are risk-seeking (they choose the riskier option).

There is no right or wrong answer according to logic, because logic and statistics tell us that program A and B are essentially identical.

In my last simulation, I suggested that it actually makes sense for losses to loom larger than gains, even though statistics and logic don't predict this. Maybe the same is true for being risk averse for gains and risk-seeking for losses. Maybe that is actually adaptive.

Here's how my simulation worked: Each of my simulated "creatures" played the following game: They could either take a sure gain of 10 units of food, or they could take a risky gain: a 50% chance of 5 units or a 50% chance of 15 units. Again, the two choices are statistically identical -- on average, you get 10 units of food either way. Some of the creatures were risk-averse and always took the sure bet; some were risk-seeking and always took the risky bet.

The creatures also played the same game for losses: they either had a guaranteed loss of 10 units or a 50% chance of losing 5 and a 50% chance of losing 15. Again, some were risk-seeking and some were risk averse.

Each creature played both games (the gain game and the loss game) 1000 times. There were 1000 creatures who were, like humans, risk-averse for gains and risk-seeking for losses. There were 1000 creatures who were risk-seeking for gains and risk-averse for losses (the opposite of humans). There were also 1000 creatures who were risk-seeking for both gains and losses.

The creatures all started with 100 pieces of food.

Risk-averse for gains/Risk-seeking for losses:
52% died
98 = average units of food at end of simulation

Risk-seeking for gains/Risk-averse for losses:
54% died
92 = average units of food left at end of simulation

Risk-seeking for gains & losses:
68% died
94 = average units of food left at end of simulation


While this simulation suggested that being risk-seeking across the board is not a good thing, it did not suggest that being risk-seeking for gains and risk-averse for losses was any better than the other way around. This could be because the size of the gains and losses was too large or two small relative to the starting endowment of food. I tried both larger endowments of food (200 units) and smaller (50 units), but the pattern of results was the same.

Again, this was a very simple simulation, so it is possible that it does not include the crucial factors that make the human strategy an adaptive one. It is also possible that the human strategy is not adaptive. Hopefully I will come across some papers in the near future that report better simulations that will shed some light onto this subject.


-----
(Note that as far as I can tell, being risk-seeking for losses should prevent people from buying insurance, yet people do. I'm not sure why this is, or how Kahneman's theory explains this.)




Tversky, A., Kahneman, D. (1981). The framing of decisions and the psychology of choice. Science, 211, 453-458.

Why is losing $10 worse than winning $10 is good?

Losses loom larger than gains.

This useful mnemonic describes an odd experimental finding: if you have people rate on a scale of 1 to 10 how unhappy they would be to lose $100, that rating will be higher than if you ask them how happy they would be to win $100. Similarly, people tend to be reluctant to gamble when the odds are even (50% chance of winning $100, 50% chance of losing $100). Generally, if odds are even, people aren't likely to bet unless the potential prize is greater than the potential loss.

This is a well-known phenomenon in psychology and economics. It is particularly surprising, because simple statistical analysis would suggest that losses and gains should be treated equally. That is, if you have a 50% chance of winning $100 and a 50% chance of losing $100, on average you will break even. So why not gamble?

(Yes, it is true that people play slot machines or buy lottery tickets, in which, on average, you lose money. That's a different phenomenon that I don't completely understand. When/if I do, I'll write about it.)

A question that came up recently in a conversation is: why aren't people more rational? Why don't they just go with the statistics?

I imagine there have been papers written on the subject, and I'd love to get some comments referring me to them. Unfortunately, nobody involved in this conversation knew of said papers, so I actually did some quick-and-dirty simulations to investigate this problem.

Here is how the simulation works: each "creature" in my simulation is going to play a series of games in which they have a 50% chance of winning food and a 50% chance of losing food. If they run out of food, they die. The size of the gain and the size of the loss are each chosen randomly. If the ratio of gain to loss is large enough, the creature will play.

For some of the creatures, losses loom larger than gains. That is, they won't play unless the gain is more than 1.5 times larger than the loss (50% chance of winning 15.1 units of food, 50% chance of losing 10). Some of the creatures treat gains and losses roughly equally, meaning they will play as long as the gain is at least a sliver larger than the loss (50% chance of winning 10.1 units of food, 50% chance of losing 10). Some of the creatures weigh gains higher than losses and will accept any gamble as long as the gain is at least half the size of the loss (50% chance of winning 5.1 unites of food, 50% chance of losing 10).

(Careful observers will note that all these creatures are biased in favor of gains. That is, there is always some bet that is so bad the creature won't take it. There are never any bets so good that the creature refuses. They just differ in how biased they are.)

Each creature plays the game 1000 times, and there are 1000 creatures. They all start with 100 units of food.

In the first simulation, the losses and gains were capped at 10 units of food, or 10% of the creature's starting endowment, with an average of 5 units. Here's how the creatures faired:

Losses loom larger than gains:
0% died.
807 = average amount of food at end of simulation.

Losses roughly equal to gains:
0% died.
926 = average amount of food at end of simulation.

Gains loom larger than losses:
2% died.
707 = average amount of food at end of simulation.


So this actually suggests that the best strategy in this scenario would be to treat losses and gains similarly (that is, act like a statistician -- something humans don't do). However, the average loss and gain was only 5 units of food (5% of the starting endowment), and the maximum was 10 units of food. So none of these gambles were particularly risky, and maybe that has something to do with it. So I ran a second simulation with losses and gains capped at 25 units of food, or 25% of the starting endowment:

Losses loom larger than gains:
0% died
1920 = average amount of food at end of simulation

Losses roughly equal to gains:
1% died
2171 = average amount of food at end of simulation

Gains loom larger than losses:
14% died
1459 = average amount of food at end of simulation


Now, we see that the statistician's approach still leads to more food on average, but there is some chance of starving to death, making weighing losses greater than gains seem like the safest option. You might not get as rich, but you won't die, either.

This is even more apparent if you up the potential losses and gains to a maximum of 50 units of food each (50% of the starting endowment), and an average of 25 units:

Losses loom larger than gains:
1% died.
3711 = average amount of food at end of simulation

Losses equal to gains
9% died
3941 = average amount of food at end of simulation

Gains loom larger than losses
35% died.
2205 = average amount of food at end of simulation


Now, weighing losses greater than gains really seems like the best strategy. Playing the statistician will net you 6% more food on average, but it also increases your chance of dying by 9! (The reason that the statistician ends up with more food on average is probably because the conservative losses-loom-larger-than-gains creatures don't take as many gambles and thus have less opportunity to win.)

So what does this simulation suggest? It suggests that when the stakes are high, it is better to be conservative and measure what you might win by what you might lose. If the stakes are low, this is less necessary. Given that humans tend to value losses higher than gains, this suggests that we evolved mainly to think about risks with high stakes.

Of course, that's all according to what is a very, very rough simulation. I'm sure there are better ones in the literature, but it was useful to play around with the parameters myself.

Can computers talk? (The Chinese Room)

Can computers talk? Right now, no. Natural Language Processing -- the field of Artificial Intelligence & Linguistics that deals with computer language (computers using language, not C++ or BASIC) -- has made strides in the last decade, but the best programs still frankly suck.

Will computers ever be able to talk? And I don't mean Alex the Parrot talk. I mean speak, listen and understand just as well as humans. Ideally, we'd like something like a formal proof one way or another, such as the proof that it is impossible to write a computer program that will definitively determine whether another computer program has a bug in it (specifically, a type of bug known as an infinite loop). That sort of program has been proven to be impossible. How about a program to emulate human language?

One of the most famous thought experiments to deal with this question is the The Chinese Room, created by John Searle back in 1980. The thought experiment is meant to be a refutation to the idea that a computer program, even in theory, could be intelligent. It goes like this:

Suppose you have a computer in a room. The computer is fed a question in Chinese, and it matches the question against a database in order to find a response. The computer program is very good, and its responses are indistinguishable from that of a human Chinese speaker. Can you say that this computer understands Chinese?

Searle says, "No." To make it even more clear, suppose the computer was replaced by you and a look-up table. Occasionally, sentences in Chinese come in through a slot in the wall. You can't read Chinese, but you were given a rule book for manipulating the Chinese symbols into an output that you push out the "out" slot in the wall. You are so good at using these rules that your responses are as good as those of a native Chinese speaker. Is it reasonable to say that you know Chinese?

The answer is, of course, that you don't know Chinese. Searle believes that this demonstrates that computers cannot understand language and, scaling the argument up, cannot be conscious, have beliefs or do anything else interesting and mentalistic.

One common rebuttal to this argument is that the system which is the room (input, human, look-up table) knows Chinese, even though the parts do not. This is attractive, since in some sense that is true of our brains -- the only systems we know do in fact understand language. The individual parts (neurons, neuron clusters, etc.) do not understand language, but the brain as a whole does.

It's an attractive rebuttal, but I think there is a bigger problem with Searle's argument. The thought experiment rests on the presupposition that the Chinese Room would produce good Chinese. Is that plausible?

If the human in the room only had a dictionary, it's clearly not reasonable. Trying to translate based on dictionaries produces terrible language. Of course, Searle's Chinese Room does not use a dictionary. The computer version of it uses a database. If this is a simple database with two columns, one for input and one or output, it would have to be infinitely large to perform as well as a human Chinese speaker. As Chomsky famously demonstrated long ago, the number of sentences in any language is infinite. (The computer program could be more complicated, it is true. At an AI conference I attended several years ago, template-based language systems were all the rage. These systems try to fit all input into one of many template sentences. Responses, similarly, are created out of templates. These systems work much better than earlier computerized efforts, but they are still very restricted.)

The human version of the Chinese Room Searle gives us is a little bit different. In that one, the human user has a set of rules to apply to the input to achieve an output. In Minds, Brains and Science, which contains the version of this argument that I'm working from, he isn't very explicit as to how this would work, but I'm assuming it is something like a grammar for Chinese. Even supposing using grammar rules without knowledge of the meaning of the words would work, the fact is that after decades of research, linguists still haven't worked out a complete grammatical description of any living language.

The Chinese Room would require a much, much more sophisticated system than what Searle grants. In fact, it requires something so complicated that nobody even knows what it would look like. The only existing algorithm that can handle human language is implemented in the human brain. The only machine currently capable of processing human language as well as a human is the human brain. Searle's conceit was that we could have "dumb" algorithm -- essentially a look-up table -- that processed language. We don't have one. Maybe we never will. Maybe in order to process human language at the same level of sophistication as a human, the "system" must be intelligent, must actually understand what it's talking about.

This brings us to the flip argument to Searle's thought expeirment: Turing's. Turing proposed to test the intelligence of computers this way: once a computer can compete effectively in parlor games, it's reasonable to assume it's as intelligent as a human. The parlor game in question isn't important: what's important is the flexibility it required. Modern versions of the Turing Test focus on the computer being able to carry on a normal human conversation -- essentially, to do what the Chinese Room would be required to do. The Turing assumption is that the simplest possible method of producing human-like language requires cognitive machinery on par with a human.

If anybody wants to watch a dramatization of these arguments, I suggest the current re-imagining of Battlestar Galactica. The story follows a war between humans and intelligent robots. The robots clearly demonstrate emotions, intelligence, pain and suffering, but the humans are largely unwilling to believe any of it is real. "You have software, not feelings," is the usual refrain. Some of the humans begin to realize that the robots are just as "real" to them as the other humans. The truth is that our only evidence that other humans really have feelings, emotions, consciousness, etc., is through their behavior.

Since we don't yet have a mathematical proof one way or another, I'll have to leave it at that. In the meantime, having spent a lot of time struggling with languages myself, the Turing view seems much more plausible than Searle's.