Click here to Skip to main content
15,891,657 members
Please Sign up or sign in to vote.
2.00/5 (1 vote)
See more:
How can we check the language of text entered in the Textbox in ASP.NET c#
Posted
Comments
Sergey Alexandrovich Kryukov 2-Oct-12 22:48pm    
Doesn't your common sense tell you the answer? it's pretty obvious... :-)
--SA

1 solution

This is theoretically impossible, because not all texts can be classified as some certain language (example: just digits, punctuation), too many texts are written in more them one language.

However, lets see if you can get some limited results.

If we set aside very intellectual linguistic methods and the method simply using huge databases of linguistic information — developing of such approaches belongs to computational linguistics and can take a lifetime — you can only determine the language in limited number of situations and for some limited number of languages. For example, if you know in advance that only 15 previously known languages could be used, you can draw one of 17 conclusions: one of 15, or "no specific language" or "failed to determine".

Having made these assumption, what languages are not good for such simple analysis? I assume we are not using dictionaries (which is unreliable, anyway, please see below) and trying to determine the language "of majority of words", by collecting some statistics. We can determine only the Unicode subsets of each characters. All Unicode code points is classified into subset, each for certain applications (such as punctuation), and most of such subsets represent a "writing system".

So, let's see. If a writing system is used by just one language, you can determine the language of a word by the Unicode subset. How good is this method? Right now, I remembered only three languages which could be clearly determined by the writing system: Georgian (still have dialects some consider as different languages) and Armenian. I checked Tamil writing system (Tamil script), but it appeared to be used by some other languages except Tamil. I'm sure people from different countries will give more examples.

Chinese writing system is also used in Korea, Japan and elsewhere; and even in China the languages using the common writing system are considered different. If you want to call the whole writing system a "language", it will be incorrect, but still give you some classification.

The simplest typical case is Cyrillic and Latin. Each of these writing systems is used in different languages, sometime very different. For example, Cyrillic is used by modern Mongolian, a language of Altaic family (including Turkic, Mongolic, Tungusic, and Japonic groups, basically, and Korean), and many Turkic language use either Latin or Cyrillic, as well as Arabo-Persian system. How to go about that?

I don't want to discuss many more complex by fairly usual situations. For example, did you see some artificially "invented" words spelled using a mix of writing systems (usually two, most people don't know more), to create humorous or catchy (advertising) effect?

And finally, let's imagine that you have access to all dictionaries for all world languages; and the dictionaries will work infinitely fast. Will it work? Not so simple. I already mentioned that many texts are mixed language. For example, take the speech of non-English-speaking software developers. Another example: many Ukrainians and people from southern Russian regions use the dialects mixing many Ukrainian and Russian words. Besides, there are many, many words spelled in the same writing system, totally identical in spelling, but belonging to different languages. Moreover, they may have completely different meanings in these languages — there are numerous jokes around funny coincidences in different languages.

As a conclusion, you should admit that the general problem is prohibitively difficult, and not really by some technical reason. The notion of the "language of the text" itself is not really correct; even though it could be applied to a relatively limited set of cases.

—SA
 
Share this answer
 
v3

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900