This article is about going beyond localization and delivering software to a truly global audience. Most developers think this can be accomplished simply by translating and localizing the text, but this is not true. This article does not contain source code and is not about the technical aspect of localizing your software. Rather it is about the process and how to prepare your team and software for the necessary changes. Globalizing your software includes adapting for language, dialect, customs, cultural issues, monetary issues, times, dates, formatting, and measurement standards.
Even the word localization in English suffers from localization issues. In American English, it is spelled with a Z, localiZation, while in British English, it is spelled with an S, LocaliSation. Localization only has two variants, but other aspects of English vary between the many forms of English which include Canadian, American, British, Australian, New Zealand, Caribbean, and more. Granted, the differences are close enough that speakers of any form of English can understand the other variants except in corner situations, however there are issues beyond spelling which can cause mistakes and misunderstandings. Additionally, if you plan to target specific countries in the English speaking world, localizing to that edition of English and customs can give you a competitive edge over your competitors. In some cases, regulatory issues can make certain changes mandatory. Imagine making software in English that only supports the $ sign for money. This will work just fine for the USA, Canada, Australia, and even New Zealand. But your software will be problematic in the United Kingdom and Ireland who use the Pound and Euro, respectively. And then there is the metric system which all English speaking countries except the US and a few Caribbean countries have standardized on.
It amazes me how often developers tell me that automated translations can be used to localize their software, web pages, and documentation. Each time, they confidently tell me the technology is ready. As of this writing, it is not. Automated translations are good for when the end user has no other choice and wants to translate something themselves, or can be useful in providing a base for manual translation.
Furthermore, not all languages are equal. It is much easier to translate between the romance languages of Western Europe, for example Spanish to Italian, than say from Italian to Russian. But even in the simpler cases, the automated translators are just not up to par yet. I noticed that Google Translate, for example, can translate English to Russian, but the Microsoft offering cannot. When I inquired why, I was told that the developers at Microsoft said Russian was simply too complex to get it right, and they were not happy with the results they were able to produce. Google does in fact translate it and it is useful in a pinch, but the translations are down right odd and a constant source of amusement. I often use Google to translate Chinese into English with even funnier results.
In fact, the translation engines cannot even translate English to English.
If the automated translators are anywhere near as good as I've been told, they should be able to go from English to Russian, and back with at least reasonable results. Let's give it a try. Here is some text from a quote from one of my websites regarding one of my conference sessions.
Let's translate it to Russian.
And now let's take the text it gave us and come back to English.
Google Translate yields us:
Parts of it are understandable, but obviously strange. Other parts such as the last sentence are incomprehensible, even to me the original author.
Automated translators have serious problems with words that have several meanings. In Russian, the word for both "dear" and "expensive" are the same. So translating "My Dear Love," from Russian to English very often is translated to "Me Expensive Love," which of course can be rather insulting rather than endearing. Although in many cases it may add an unspeakable truth.
Another example follows. The English is correct, but the French quite possibly was translated from the English, which was probably translated from what I'm guessing is Chinese. In English, "oil" can refer to cooking and flavoring oils, but also motor oil which we put in our cars. The French translation should use "L'huile", but instead uses "Le Petrole", which is the petroleum form of oil. So the French labeling translates roughly to "Pure Sesame Motor Oil". Yum!
It is also critical to check the output of any translations. Look at the example below. Someone obviously put the text into a translator and got back English. But since they did not know English, they had no idea what it said and assumed it was correct.
I once had a similar problem with Arabic. I was working on a project where the code page of the client was different than the server, and it was returning gibberish. But since I could not read Arabic, it was several weeks before someone pointed out that the program was displaying gibberish. I could recognize Arabic, but had no way myself to validate it, and simply assumed it was correct. Instead of "It's all Greek to me", it became "It's all Arabic to me."
It is also important to check that all words have been translated. Automated translations can leave behind words that it cannot handle.
Use a Native
I have also seen many translations that were performed by a native speaker of the source language, but not the target language. While this can be successful in some cases, it usually has issues unless the person has been speaking the target language for many years and has lived in countries speaking that language. Generally, it is preferable to use someone who is a native speaker of the target language. Such a person is more likely to ask questions if need be about the source language than a person would be about the target language.
Spell checking is very important, and automated tools are very good as are grammar checking tools. However, they cannot and will not catch everything so do not over rely on them.
Text in Images
Automated tools can be used to extract text from applications and web pages for translation. However, such tools miss text embedded in images. Here are some examples from the Microsoft Arabia web site where text in images has been missed.
Free 60 day trial is certainly something that should have been translated.
Despite the attention paid to detail of finding all texts previously, there are some cases where text should not be translated. Brands are a common such exception. Some brands are so well known that it is better to just leave them alone.
Technical terms are another such exception. Technical terms often require new words and are created in English. Other languages then simply import the Anglicized version as a new word to the language.
In this Russian example, we see several samples and variations.
- IP-Сети - This is a mermaid word, half English, half Russian. Or as I often call it, Runglish. IP is English, but Сети (Network) is Russian. So the end result is IP Network.
- веб-интерфейсов - Web Interface. веб itself is not a true Russian word, but instead is the phonetic equivalent of the English word web. It has now been accepted directly into Russian, but is only used for computer terminology. E-mail is a word that is similarly imported into many languages phonetically.
- Plug-In - While English proficiency in the Russian speaking world is rather low, familiarity with the letters and sounds of English letters is wide spread. Some words are often left in their original English spelling. In cases like Plug-In, it is common to find it both ways. In English lettering, or in Russian.
This example also includes many items (Powered by, Worldwide, etc.) that were missed and should have been translated.
Brands can be translated as well though, and usually are done phonetically. It really is a decision that the owner of the brand needs to make.
Not all languages support the same sounds so localizing brand names can pose some challenges. For example, Russian has letters and thus sounds that have no equivalent in English, and also vice versa. Russian has no W sound, and Greek has no P sound. These sounds can be approximated by combining letters in Russian and Greek, but they are not the same. Russian substitutes a V sound when a W is necessary. So "web" becomes "veb".
If your brand is small or not well known, it is probably better to translate it phonetically. However, just because it is big and even possibly universally known is not necessarily a reason to not translate it. Subway (food chain) for example typically does use the localized languages in their brand. Most big brands do not do this, including Microsoft, Oracle, IBM, etc. But there is at least one notable exception who does translate their brand name.
One would think that selecting a language would be very easy. But unfortunately, it is not, and often it can be very hard to switch to your own language. I even once encountered a website that had translated all the language names into the local languages. Language selectors absolutely should not be translated. Imagine ending up on an Arabic web page and needing to find الإنجليزية to change to English! الإنجليزية is in fact the Arabic word for English, but of course English speakers will not know that.
Most users know the name of their language in English, but not always. Here is an older version of the Apple website.
Here is the language selector from one of my websites. Despite this being on an English page, each language is written in its own language for quick recognition by readers wishing to change languages.
Dual language listings are also acceptable.
Some say that the use of flags is better than words. The idea is that flags are easier to find than a bunch of words; however, generally the use of flags is bad idea.
In Switzerland, there are four official languages, and Canada has two. So what language does a Swiss or Canadian flag represent? And why should a British person click on an American flag for English, or a Spaniard click on a Mexican flag? While not major offenses, these create uncomfortable cultural feelings in many cases.
Flags should be reserved for local content, not language selection. For example, if you have a website that discerns between The UK and The US for marketing and contact purposes, then flags are in fact perfect. Apple's updated website uses flags in an acceptable manner. Apple targets specific countries and has multiple flags for languages.
Notice that the flags of Canada, Switzerland, and Puerto Rico appear more than once. This allows not only local content selection, but language selection. Also note that the languages are in fact written in themselves, i.e., Spanish is not listed as Spanish, but is in fact Español.
Another option is to display the ISO codes of languages, such as [en] [ru]. However, I do not recommend these because if your audience includes non-technical types, these are not as well understood.
Many developers have only dealt with European languages which go from left to right. Sure Russian and Greek might look funny, but we know they are just different looking letters and otherwise the rules are the same. But toss Arabic and East Asian languages in, and there is a lot to consider. Most developers simply have no idea how to handle Asian languages at all, and assume that Arabic is the same as Russian (i.e., different characters) but that it goes right to left instead of left to right. If it were really that simple, things would be pretty easy. But it is not quite that simple.
Everyone is familiar with left to right (LTR) which European and other languages use.
Right to left (RTL) is used by Arabic, Farsi, Urdu, and Hebrew. RTL is essentially what it says and the opposite of LTR.
Many East Asian languages such as Chinese use top to bottom (TTB). However, most assume that after TTB it goes LTR, as shown here.
This is not correct, and TTB languages actually are firstly TTB, secondly RTL.
So now you understand all there is to know about directions of written languages, right? Not quite. It actually gets a lot more complex when languages or characters from languages are mixed. Because many words from English such as brands do not get translated, mixing languages is very common especially in the technical world.
Take a look at this example. It is mostly Arabic, but the word kudzu has been left in English. Notice the placement and direction of punctuation marks such as the ? in the header.
If this text is only displayed, there is no problem. However if the text is editable, or a caret is possible for selecting and movement, something unexpected happens. The cursor will move left through the Arabic text, and to move left, the right arrow key is actually used. When the caret hits the word kudzu, it will jump over the word to the k, and then start to move to the right until it hits the u. After the u, the caret will jump again over the word, but instead of to the k, it will move to the Arabic letter preceding the k.
If text is mixed, it can usually be automatically detected that an LTR section is embedded in an RTL language. However, it also affects punctuation and alignment. Consider this LTR text that exists on an Arabic page.
Note that it is right aligned, and that punctuation appears on the left rather than the right.
It also plays havoc with the quote character. When the quote character is surrounded by LTR text, it appears correctly, but when the quote is at the beginning or ending of a line, it follows RTL rules instead.
And the word ".NET" becomes "NET."
To avoid such problems, each section of LTR text must be specifically marked so that it can be handled the way that we expect. For WPF and HTML, spans and other surrounding tags can be used. For technologies such as WinForms, it becomes a lot more complicated.
Latin sets are the most familiar because of the number of speakers of European languages as well as the dominance of English. English consists of 26 letters, but languages such as German have a few letters that English does not. But these extra letters are only a few and generally easy to comprehend for English speakers.
Latin sets are quite widespread as shown in this map.
Let's take a look at some common screens in Windows for comparison of two Latin based languages. The languages are English and Polish.
Did you notice something different? It is something that developers very frequently forget when localizing between Latin sets. Words in other languages may be shorter or longer and so extra space or flow adjustment must be handled. Just look at "Help and Support" for example in Polish. It is quite a bit longer than the English equivalent. Windows seems to handle this in many places by just adding whitespace for non-English languages, as can be seen by the large amount of whitespace in the Polish screens.
Cyrillic is the character set used by Russian, and Ukrainian. Generally, the mechanics are the same as the Latin sets, but the characters look different. Some characters exist in both Cyrillic and Latin though. For example, both English and Russian have a letter P, but they have different sounds. In Russian, the character P is equivalent to an English R in sound.
Before Unicode, there were several code pages for each of the Cyrillic languages, and there were often issues from this. Generally, today we live in a Unicode age and handling Cyrillic languages has become a lot easier.
Cyrillic, while not as widespread as Latin, is far more widespread than one would expect, mostly because of the physical size of Russia and the influence of the former Soviet Union.
The Russian alphabet looks like this and has 33 letters. The first and third rows are printed letters, and the second and fourth rows are cursive.
Note that there is a letter that looks like two English letters, bI. It is actually one letter in Russian.
Let's take a look at the same screens we saw earlier in English and Polish, but this time in Russian.
Greek is separate from Cyrillic although it shares a large number of letters with Cyrillic. This is because Cyrillic was based on Greek.
Both Greek and Cyrillic have characters that the other does not. Many shared letters have different sounds. Just as the character P differs in Russian and English, shared characters can differ in sound between Greek and Russian. In Russian, H sounds like an English N, but in Greek, it is something akin to an I or E.
The Greek alphabet looks like this and has 24 letters:
Because of the extreme similarities of Greek and Cyrillic, I have not captured the screens in Greek for comparison. No offense is intended. I'm proficient in the Greek alphabet and can read signs and menus because I spent many years in Cyprus. But I'm quasi-fluent in Russian, and I already had Russian installed on my machine.
So far things have been pretty easy. Now we enter the territory that most developers fear. East Asian is important though and growing more important. In terms of Gross Domestic Product, do you know who the top three countries are? You might know the first, but I bet you don't know the next two. For 2010, according to the IMF, in order, they are the United States, China, and Japan. Two of the three use East Asian languages, and are huge markets. Japan and China are pretty close, with some sources listing them swapped, but all major sources agree that these are the top three.
We already discussed the first major difference with East Asian. East Asian is typically top to bottom, followed by left to right.
Well, at least traditionally. In 1955, China officially changed to left to right, and Taiwan followed in 2004. These were the official dates however, and in Taiwan, left to right was common place before 2004.
East Asian characters are far more complex than other character sets. Instead of two to three dozen characters, there are many thousands. To see this detail, the characters must be larger. But since whole words are written using one or at most a group of a few characters, the width required to display East Asian is much less. However, more height must be reserved. Think of it this way.
Latin / Cyrillic / Greek: xxxx xxx xxxxxx xx xxxxxx xxxx
East Asian: XX XX XX XX
Where XX is a single East Asian character.
Let's take a look at Windows screens again with East Asian languages.
Japanese becomes even more complex. We all know that Japanese has an East Asian set like Chinese, and in fact the Japanese writing system is based on Chinese although the languages are independent. The same happened to Russian with respect to Greek. Both Japanese and Russian were originally only spoken languages, and their writing system were borrowed from other unrelated languages.
What will come as a surprise to most though is that Japanese actually has four written systems, and they are all commonly used. Furthermore, one of them is Latin based.
Kanji are the characters that most people think of when they think of Japanese. They were adapted from Chinese, and to someone who can understand neither, they probably appear indistinguishable. In fact, Kanji in Japanese literally means "Chinese Characters". The Japanese Kanji character for Kanji itself is the same in Japanese Kanji and in Chinese.
Hiragana and Katakana are very similar to each other. Romaji is based on Latin characters.
Typical Japanese is a mixture of Kanji, Hiragana, and Katakana. Romaji is the least used, but still important. Here is an actual headline from a Japanese newspaper:
In a single headline, Kanji, Hiragana, Katakana, and Romaji are all used.
Sorting is pretty straightforward when there are only a few dozen letters. But what to do when there are thousands of characters? Or when there are mixed character sets? Or mixed languages?
Some languages have context based sorting. This means that the sorting method depends on what the items are, or how they are being presented.
There are rules for such cases, but they are pretty complex. So let's leave it at this. Do not write your own sorting routines. When possible, use Unicode collation which is available in .NET, Java, or through Operating System APIs. For databases, rely on the sorting that the database provides.
The most common right to left languages are Farsi / Persian, Arabic, Urdu, and Hebrew. Here are some samples of each:
It is generally assumed that the writing simply needs to go "backwards", and that is all that needs to be done to localize to LTR languages. In fact, this is the bare minimum to make it usable, but to do it right, there is a lot more to be done. Let's take a look again at the Windows screens, this time in an LTR language.
Notice anything else different? You should. The start menu is on the right, as are the icons on the desktop.
Notice that the panes are reversed, as are the positions of icons relative to the text. For real fun, look at the forward and backward buttons in the top right. Guess which is which.
Notice the position of the minimize, maximize, and close buttons. Also notice the position of the scrollbar, and think about the functioning of horizontal scrollbars.
Notice the alignments of buttons, tabs, and the drop down arrows of the combobox.
Now let's take a look at a few more examples of LTR:
Notice that the cascading menus open to the left.
Here is an interesting example. This is the header from the English version of one of my websites:
Now look at the Arabic version:
Notice that to keep the proper look, I also had to reverse the leaf image.
And one final example:
Other LTR Languages
There are many other LTR languages beyond European ones. Hindi to many looks similar to the RTL scripts, but it is in fact an LTR language. Hindi also has several different scripts like Japanese, and is in fact #5 in spoken languages worldwide.
Hindi and Urdu are separate languages but close enough that spoken conversation between an Urdu and Hindi speaker is often possible. But Hindi is an LTR language, and Urdu is RTL.
There are many different paper sizes.
In daily use though, only letter and legal are used in the US. However, Europe and most of the rest of the world use A4. A4 might appear to be the same as US letter, but A4 is actually a little different.
A4 is a tiny bit wider, but a fair bit longer. This can affect any printed output and should be properly handled as well as defaulted based on region.
Here are languages by group. I realize the key is too small to read in the article, but the groupings can still be seen.
Here are the top ten spoken languages as ranked by number of speakers:
Many may be surprised to see that English is number four and that even Arabic beats it. Of course the percentage of people who have computers is a big factor, but as computers become cheaper and markets emerge, these languages will become much more important.
So what is the big deal? Pluralization is easy. 1 boy, 2 boys. Add an -s right? Oh sure, there are a few exceptions, but we can handle them. In most Western European languages pluralization is pretty easy, but things can become very complex in other languages.
In English, pluralization generally consists of adding -s, -es, or -ies to most words.
But we also have to think of zero. It seems odd, but in English, it can be any of the following:
- There is no cow.
- There are no cows.
- There are 0 cows.
And then there are of course the exceptions, especially with regards to animals. Fortunately, unless you are dealing in the agricultural software business, you probably will be saved from most of these.
But it's still not that simple even with the exceptions.
But what is it when you have two? Two meese? Two mooses? It's still moose. Two moose.
So we end up with the end result of this:
Given that the exceptions are not the rule in English, you can generally get away with some basic rules:
- Check exception list. 1 person, 0 / 2+ people.
- If word ends in -y, change the ending. 1 directory, 0 / 2+ directories.
- If word ends in -x, add -es. 1 box, 0 / 2+ boxes.
- If no match, add an -s. 1 cow, 0 / 2+ cows.
There are a few more rules, but you should get the idea. Many European languages can be similarly handled.
But there is one more item that is important to note. English only has two variants. There is a word for quantity of 1, and a word for more than one or zero.
Arabic distinguishes into three categories. So saying 3 cows is different than saying 2 cows.
In fact, such a distinction is common in many languages with quantity 2 being treated as you might think of "a pair" in English. So when building rules and look up tables, it is better to break these variants into categories:
Why? Why not just group them like Arabic into 1, 2, and 3+? For English, you can treat few and many the same, and for Arabic, you could simply say One = 1, Few = 2, and Many = 3+. But not all languages treat it that way. Some languages, for example, think of it this way:
- One = 1
- Few = 2 to 5
- Many = 5+
But actually, even what I've described so far is not quite accurate, and I've done it only to make it easier to bring you along step by step. It actually gets more complex and there are more than three variants.
Notice that many is 11 to 99, but that 103 to 110 becomes a few again. This is because the "primary" count is based on the ones digit.
In English, only nouns and sometimes verbs need to be pluralized. But that is not the case in other languages. Changing the quantity can affect adjectives, verbs, and even adverbs.
In English, it affects verbs in the present tense sometimes. Consider the following:
- Copying 1 file
- Copied 1 file
- 1 cow walks
Now consider it with an adjective:
In English, the adjective big remains unchanged. However, in many European romance and Slavic languages, the word adjectives must also change depending on the quantity. In some Slavic languages, it also sometimes affects adverbs.
In those cases, big, cow, and quickly would all need to change based on quantity.
Formatting of Plurals
Never concatenate, always format.
- "I copied " + x + " files in " + y + " folders."
The problem is that in different languages, the y may need to come before the x. The equivalent of "In 4 folders, I copied 8 files." In English, this second form sounds a bit odd, but it is acceptable. But in some languages, only one form may be available. Also, remember to account for variable length of text, as discussed previously.
In most cases, it is not worth the cost to localize to specific dialects of a language. However, doing so can provide you a competitive edge, and if you are specifically targeting a country, it can be important. Furthermore, if you include features like spell checking into your software, then you better support dialects.
Different dialects can make a user feel uncomfortable or foreign. Meanings can also be misconstrued and sometimes confusion can occur.
Let's start off with a simple example of American versus British English. Absolutely a Brit and an American can easily carry on a conversation, read each others' books, and watch each others' movies. After all, both dialects are English.
One of the most pronounced differences aside from pronunciation of words are the spellings. Here are a few examples.
Reading another dialect is very possible, but often feels like a "verbal itch" that you want to scratch, and can be distracting. Because of the predominance of American English, others though have become quite accustomed to reading American English. Because I've lived in several English speaking countries, I've learned to ignore it as well. But when writing, I produce quite a mixture and find that I absolutely must use a spell checker to keep myself consistent.
Often though, words have different meanings, sometimes very contrary.
In other cases, single items have different words.
HSBC bank has produced some quite funny commercials around the theme "HSBC: The World's Local Bank." For example, the word "tart" is very different between American and British English and was used in a commercial where a son brings home a scantily clad woman to meet his parents and the mother asks "Tart?" while handing her a plate of small jellied cookies. In modern American English, "tart" is not a common word except for in pop tarts, or to mean something akin to sour, but it can mean a type of cookie. But in British English, it's akin to a whore. (HSBC Commercial 1, HSBC Commercial 2)
English does not stop at just American and British. There are many other dialects including Canadian, Australian, New Zealand, Caribbean, Indian, Philippines, African, and South African. Even within the US, there are minor dialects that go beyond pronunciation. Although with cable TV and the internet they are diminishing, they still exist a little. A milkshake in New England can be different than in other parts. And a soda can be a pop or a Coke. In the American South, "a Coke" is often used to refer to any carbonated sweet beverage. And "tea" certainly has a different meaning in the American South.
Dialects of course are not limited to English. German, French, and Spanish all have dialects as well. Chinese has Mandarin and Cantonese, which some consider to be nearly a different language. Canadian French is quite different than the French spoken in France, and Spanish spoken in South America is different from that spoken in Spain (Castilian).
Let's look at a variety of ways of displaying November 24, 2008.
Or Canada, which has a real problem within its own borders.
But things are not always as simple as 1-2-3 in Canada or when dealing internationally.
Here are a few tips for dates:
- Always display four digit years, but allow entry in 2 digits.
- Display month names, or abbreviated month names rather than a number.
- Use the user's regional settings for input, and display.
Let us look at times. There are a variety of ways to display time.
And it gets even more confusing when midnight or noon is an option.
Now tell me. Am I returning the car at midnight, or noon? And if midnight, at the morning of August 15th, or the night of August 15th?
If you remember, Canada has an issue with conflicting date formats. Combine this with the midnight problem and the Canadians have invented time travel.
Start of Week
Many countries treat a different day as the start of the week. Some say Sunday is the start, others say it is Saturday or Monday. This is a simple user preference though in most software.
Did you know that Friday and Saturday are not the weekend everywhere?
- Most of the world: Sat + Sun
- Middle East: Fri + Sat
- Saudi Arabia: Thu + Fri
The calendar that most of us are familiar with is the Gregorian calendar. Generally supporting this is enough. Saudi Arabia, however, uses the Islamic calendar. The Saudi government and many businesses do use the Islamic calendar.
There are other calendars such as the Chinese and the Hebrew calendars, but they are only used for religious purposes and not for business or day to day usage.
"Hello" or “Hello”
Russian, French, Greek, Turkish:
And even primary and secondary quotes are not always consistent.
"Primary" and 'Secondary'
'Primary' and "Secondary"
Here is a larger list of language specific quotes, primary and secondary.
The « » characters are called angle quotes, or guillemets. But many people, including Adobe software and the Microsoft spell checker, call them guillemots. There is such a thing as a guillemot, but it looks like this:
Quickly look at this next picture and remember the first thing that comes to mind.
Symbols are often more powerful than words, and so symbols can be very powerful. But if the wrong or confusing symbols are chosen, the results can be troublesome.
Many interfaces end up looking like this to end users:
For most markets, the following picture is appropriate. However, if you are targeting the Middle East, you should consider that such a plunging neckline might not be perceived so neutrally as in other countries. And if you are specifically targeting the more conservative Muslim countries such as Saudi Arabia, you should consider not using this picture at all.
Localizing pictures to your audience can be a good thing. However it can also backfire. Consider the case of this Microsoft ad. This is how it appeared on the US site.
A good multi-racial mix. But in Poland and much of the former Eastern Bloc, unfortunately, there is still a lot of racism towards Black people. So the ad was Photoshop-ped.
Not very well though. Look at his hand. The change however was noticed, and in the US it created quite an outrage and embarrassment. Using a completely different picture probably would have been accepted, but the modified version became an insult.
This was most likely done by an external firm and not intentionally by Microsoft. Don't believe me? See the laptop in front of the poly-chromatic person? It's a MacBook.
Let's revisit the image shown earlier of the woman with the plunging neckline. This version would not be considered appropriate in Saudi Arabia, and also would create an International outcry:
Too many images are US centric. It's not that US images should never be used, but how would Americans feel if every time they saw a flag it was Chinese? And every passport was Chinese? And every photograph, address, and map was from China? If you are targeting a global audience, use a variety of symbols and examples rather than from just one country. This extends to sample data too.
Product and Brand Names
When George Eastman founded Kodak, he simply made the word up, but with intent. The letter K was a favorite letter of Eastman's, so he wanted K to play prominently in the word. About the letter K he said, "[it is a] strong, incisive sort of letter." He also wanted a short name that was easy to pronounce in all languages, and something that was unique and would not be confused with anything else. He and his mother came up with Kodak. In fact, Kodak was the name of their first camera, and it was so popular that they changed the company name to Kodak.
Not all brand and product names are that well thought out. Certainly the name of this bicycle should have been better thought out.
And there have been several car naming snafus. This is the Chevy Nova.
The Nova did not sell well in Spanish speaking countries. "Nova" in Spanish roughly means "no go".
More recently is the example of the popular Mitsubishi Pajero. In Spanish speaking countries, it had to be renamed the Mitsubishi Montero. It is the very same vehicle, but it has a different name. Why? Pajero means something in Spanish. In Spanish, it's slang and has varied meanings. At best, it is a tired or lazy man. At worst, it is a man who does certain activities alone. Montero instead means "Mountain Warrior".
Domain names often consist of more than one word, and domain names have no spaces. Companies also like to avoid dashes as they cause confusion. Also keep in mind that users don't usually use capitals, so your intended separate words can often be "lost". However, new unintentional words can be formed. Be sure to check your domain names under consideration.
Let me give you three real life examples of domain names. Two of these have now been discontinued and the owners have selected new domain names after finding out the hard way.
||www.GenItalia.com (Generator company in Italy, Italia is Italian for Italy)
There are two common numbering systems: European and Arabic-Indic.
Arabic-Indic numbers are not as frequently used, but are used in the Middle East. In the Middle East, you will see them on license plate numbers, signs, menus, and price tags. Arabic-Indic numbers function the same as European numbers but they have different symbols. They are both base 10.
There are also Eastern Arabic-Indic numbers and Tamil numbers. Eastern Arabic-Indic are almost identical to Hindi.
The good news is that Windows can handle these automatically for you according to the user's system settings. However, it is important to be aware of them, and could come into play in localizing images.
There are uncommon numbering systems used in Chinese and Japanese, but since these are uncommon, they are not expected to be supported in software.
Finally, there are novelty systems such as the Roman numeral system. Except for stylish effects, no support for Roman numerals are needed. Roman numbers have been popularly used for years in copyrights and movie credits.
12,000 vs. 12.000 - is that 12000 or 12? In the US, the comma is used for thousand separators and period is for the decimal part. However, for Europe and South America, the period is used for the thousands mark and the comma is used for the decimal place.
But at least the grouping is the same. The groups are separated by thousands, or every three digits. In China, the grouping is by ten thousands, or every four digits.
India is yet more complex.
Remember earlier when we talked about dialects and how words can have serious consequences? Billion is a word that can have a very serious effect. Billion has different meanings depending on whether the country is a short scale or a long scale country.
Long scale countries are listed below:
Now since the US is a short scale country, and Germany is long scale, Billion means two completely different things. Canada is short scale in English, and long scale in French.
The US has always been short scale, but many English speaking countries formerly were long scale. The UK changed officially to short scale in 1974.
Weights and Measures
Most of us are aware that the US still uses pounds and miles for measurements but that most of the rest of the world uses the metric system. The problem is that Americans typically do not think in metric. If you tell an American something weighs 10 kilograms, you might as well tell them it weighs 4 megahertz. The same is true of trying to speak to metric users using American weights.
Because of this, when dealing with weights and measures, it is essential to support both systems. It doesn't quite stop there though. Although the UK is officially metric, people still think and informally use pounds and other non-metric measurements. But sometimes they are not even the same. A British (Imperial) gallon is not the same as an American gallon. But both countries in their own countries just call them "gallons".
China also commonly uses a Chinese system for weights.
Handling currencies is more than just allowing different currency symbols. It's also important to understand that $ and other currency symbols can be used for multiple currencies, and that currency symbols are not always one character. Both Canada and the US use $ for their currencies. If your application handles both Canadian and US dollars, how will the user know the difference? QuickBooks is especially bad about this.
Each currency also has a three letter ISO code which can be used. For example, Canadian dollars is CAD, and US Dollars is USD. $ can also be used to denote AUD, SGD, XCD, and many other forms of dollars.
The placement of the currency symbol can vary as well. Some countries use it as a suffix rather than a prefix. The Euro symbol varies by country.
Some countries even put the currency symbol in the middle in place of the decimal. The Portuguese Escudo and French Franc both did this. Fortunately for us, both are now deprecated currencies.
50$00 - Portuguese Escudo
12₣34 - French Francs
There is more to currency than just currency symbols. There is also formatting. In addition to the comma versus period issue covered in numbers previously, there is also precision to consider. Most currencies extend two decimal points of precision. For example, $5.96. However, some currencies extend three or four digits. The Jordanian Dinar uses three decimal places so prices often appear as 5.965, and some currencies use four.
For storage and all math, four decimal places should be used. Float types should be avoided for currency calculations. Most languages support fixed or even currency types. In .NET, the decimal type should be used.
Be sure that the type you choose as well as input fields are large enough to accommodate larger numbers. Those of us who deal mostly with Dollars and Euros think that a million dollars is a lot. But it's not in all currencies.
This is a single bill in Zimbabwe. Many other countries have had periods of great inflation, not only Zimbabwe. Romania in the late 1990's issued million lei bills. Even countries such as Italy within the last decades had massive inflationary periods.
In fact, look what it took to buy a few eggs:
Do not hard code things such as "C:\Program Files", because it might actually be "C:\Archivos de programa". Not to mention the differences on 64 bit platforms. The actual localization of such folder names was remedied in Vista, but still affects XP. Vista stores it physically in English, but allows an alternate localized display name.
Everybody and Guest are also a common problem in installations. In Norwegian, for example, they are Alle brukere and Gjest, instead.
In older versions of Office, VBA commands were actually localized. This caused scripts written in English to fail on non-English installations of Office such as French. I personally had to deal with this while working for a large company in the late 90's.
Store as much as possible in external files. XML is a good format. By doing this, you allow users to contribute and even fix errors in their languages.
In the .NET world, WPF is far superior to WinForms for localization issues.
I've spent an enormous amount of time researching this subject, and both Google and Wikipedia have been a huge resource for me. A few of the charts are from Wikipedia. Most of the screenshots were created by myself using VMWare. Other images and pictures were sent to me by conference attendees wishing to enhance the presentation. A few images are of completely unknown sources as I have been presenting on this topic for many years.
Corrections and Additions
I have lived in a dozen countries and traveled to over 60. I'm a native English speaker and know Canadian, American, and British English all quite well having lived in countries that speak each. I speak and read quasi-fluent Russian, and basic Greek. I watch almost as much Russian television as I do English. I also know tidbits of Arabic, and Turkish. I can read reasonably well basic Bulgarian, French, Spanish, Italian, German, Dutch, and Romanian. I can also understand Ukrainian quite well because it is so close to Russian and we used to receive and watch Ukrainian television as well.
I have spent a lot of time researching these topics and have presented on this subject dozens of times at world wide conferences. However, I certainly welcome corrections and or additional information on languages that I am not a native speaker of.