How does Google predict typing language regardless of the input language?

Question

4.00/5 (1 vote)

See more:

So I assume it does it like that:

1. Constantly spell check input till it finds that most of it doesn't make sense.
2. At this point the input is somehow "transkeyed" into every known language.
(Or simply user installed languages).
3. Those entries are spell checked till an entry in a specific language makes sense and then this phrase is searched.

http://imageshack.us/photo/my-images/89/capturece.jpg/[^]
In this image I'm writing "google is able to do it easily" with the input language set to Arabic(101) and Google detects that what I wrote was in English. It works with combinations as well not only single keys.

(Because an ordinary US keyboard uses 101 keys, I don't want the problem to get more complicated so we could exclude languages those use more keys when we think a method if it shortened the way, then later worry about that.)

Posted 17-Jul-11 18:52pm

Hesham_h4

Add a Solution

2 solutions

Add a Solution

Add your solution here

Treat my content as plain text, not as HTML

Preview 0

…

Existing Members

Sign in to your account

...or Join us

Download, Vote, Comment, Publish.

Your Email
Password
Forgot your password?

Your Email
This email is in use. Do you need your password?
Optional Password

I have read and agree to the Terms of Service and Privacy Policy
Please subscribe me to the CodeProject newsletters

When answering a question please:

Read the question carefully.
Understand that English isn't everyone's first language so be lenient of bad spelling and grammar.
If a question is poorly phrased then either ask for clarification, ignore it, or edit the question and fix the problem. Insults are not welcome.
Don't tell someone to read the manual. Chances are they have and don't get it. Provide an answer or move on to the next question.

Let's work to help developers, not make them feel stupid.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Abhinav S · Answer 1 · 2011-07-17T19:03:00

Solution 1

They are not going to let us know how they did it - if they did that, it would not be unique any more.
:)

This[^] would be a simple approach to start language detection, but really, there is tons more to it.

Posted 17-Jul-11 19:03pm

Abhinav S

Comments

Sergey Alexandrovich Kryukov 18-Jul-11 2:10am

Agree, but the very first step should be pretty simple -- please see my answer. :-)
--SA

Hesham_h4 18-Jul-11 12:34pm

I think this method is close to the one used in Google translate (Not accurate at all though), it detects the language from a text written in it, not a text written in a different language which is our topic here.

Sergey Alexandrovich Kryukov · Answer 2 · 2011-07-17T20:09:00

Solution 2

Ask Google, but don't expect the answer :-).

Now, the start of it would be much simpler than you suggested. You can get dominant sub-set of Unicode code points used. Most often it is characterized by two high-order bytes of the integer value of code point. For example, Cyrillic will cover several Slavic languages and some Asian languages, Perso-Arabic Script will cover Persian, Arabic and few Indian languages, and so on. This along with greatly narrow down the search. After that referring to dictionaries should come, with all the complex techniques.

SA

Posted 17-Jul-11 20:09pm

Sergey Alexandrovich Kryukov

Comments

Abhinav S 18-Jul-11 2:10am

Yes an interesting approach. My 5.

Sergey Alexandrovich Kryukov 18-Jul-11 2:13am

Thank you, Abhinav. Just a first simple step. Everything else is way more complex. OP gave one idea about spell-checking.
--SA

Hesham_h4 18-Jul-11 12:52pm

I think I'm asking them, and if they answer me (5% probability) I'll mark this as the solution :).
Funny fact is that I asked that question at Microsoft forums, forgot that Bing would be using it if they knew :)
About your approach, I think that detecting the input language isn't a big deal if we talk about a .NET application and so the spell check would be just appropriate, "transkeying" as I called it is still the big problem here. We need to know what keys were pressed, let's assume we know because we are working upon user input not on a copy-past text, then we need to simulate each of these scripts to get a probable text (Remember that some characters can be entered in multiple ways) and then comes your method in place to detect which one is used and so narrow the language probabilities.