Field of Science

Another problem with statistical translation

In the process of writing my latest article for Scientific American Mind, I spent a lot of time testing out automatic translators like Google Translate. As I discuss in the article, these programs have gotten a lot better in recent years, but on the whole they are still not very good.

I was curious what the Italian name of one of my favorite arias meant. So I typed O Soave Fanciulla into Google Translate. Programs like Google Translate are trained by comparing bilingual documents and noting, for a given word in one language, what word typically appears in the other language in the same place. Not surprisingly, Google Translate translated O Soave Fanciulla as O Soave Fanciulla -- no doubt because it was the case that, in the bilingual corpora GT was trained on, sentences with the phrase o soave fanciulla in Italian had o suave fanciulla in English.

I was reduced to translating the words one at a time: soave -> sweet, fanciulla -> girl. GT thinks o means or, but I expect that's the wrong reading in this context ("or sweet girl"?).


Panos Ipeirotis said...

O = oh :-)

The translation is "oh lovely girl" or "oh sweet girl".

But the overall point is indeed interesting...

Shadab_Troikaa said...

Nice observation, I have read somewhere all these free translation software work on Mathematics algorithm and give result accordingly.

Homer Mershon said...

I tried your trick with Google Translate. "O soave fanciulla" was, as you said, with the Italian repeated. However, taking the separate words rather than the whole phrase did give "sweet" and "girl" but "or" for "o", but "o!" gave back "o!" Putting them together gave the translation "O! sweet girl". It seems to me that mechanical translation tries to operate like the activity of looking words up in the dictionary, but with the same kind result which yields the French translation of "a big movie fan" as "un gros ventilateur de cinéma".
As you say in the SciAm Mind article, context is the problem. The question, AFAIK, is whether the binomial, linear system of computer reference can deal effectively with a language system based on the planetary, core-orbitally organized structure of language with four units of syntax and 12 parts of speech which creates meaningful units of varying lengths rather than bigram and trigram units. It seems to me that resolution of the problem of homophones and polysemous words is more likely to be found in grammatical and syntactical analysis rather than in linear segmentation.
You say, p. 47 S.A.M. "The problem for the talking robot's engineer is that we do not know precisely [how people quickly resolve homophones by using context]." The problem is that the human speaker is precisely a context machine and until computers can duplicate that activity, neither fully accurate computer translation of talking robots will be more than marginally successful. If the solution is to lie in effective syntactical and grammatical analysis, call me. I have a system!