Click here to Skip to main content
Click here to Skip to main content

IDNA resolver client

, 30 Mar 2004
Rate this:
Please Sign up or sign in to vote.
Using non-ASCII in DNS name lookups.

Introduction

This article presents code for an IDNA-client (Internationalising Domain Names in Applications).

With the influence of the Internet on society, it's more important to present brand-names etc. correctly in the native language. It is more straightforward for most people to access a web-page or send an email using the correct name instead of some ASCII'fied version of it. And with the depletion of domain names in the various top-level domains (especially .COM), it's ever more needed to support names using non-ASCII characters.

Most TLDs are now open for registering domain-names with international characters. But you can most likely not register a name in, e.g. the Danish domain .dk containing Hungarian accented characters. The .com, .org, .net and .biz domains probably accept ISO-Latin and all East-Asian character sets. Check your local registrar for details.

For several reasons, the IETF designers wanted to keep UTF-8 or any other encoded strings away from the Domain Name protocol (port 53, RFC-1035). Mainly to be backward compatible with existing protocol and DNS servers. No new software on the server side should be required to support IDNA. So purely US-ASCII was needed to represent national characters in IDNs (Internationalised Domain Names). Hence they came up with a pretty slick scheme called Punycode. The details are in RFC-3490 and RFC-3492.

Side-note: Windows-2000/XP does send an UTF-8 encoded query for IDNs (ref. RFC-2044 and RFC-2181). And Windows/2003-Server seem to support Stuart Kwan's draft. These methods will most likely not work since very few DNS servers supports UTF-8 directly. But there's is a remedy for IE/OE users. See references below.

Domain name conversion

In order to query a name server for an IDN, the name must be converted to ACE (ASCII Compatible Encoding). Here is an example. Suppose you want to resolve the host-name www.blåbærsyltetøy.no (www.blueberryjam.no for you English. And yes, the name really exists).

The algorithm goes like this:

  • Split the name into components of its labels; "www", "blåbærsyltetøy" and "no".
  • For all labels do:
    • If the label contains non-ASCII characters (i.e. 0x80 ... 0xFF), convert the label to ACE form using Punycode.

      The result of this conversion is "blbrsyltety-y8aO3x".

      Prepend a "xn--" label. Thus becoming "xn--blbrsyltety-y8aO3x".

  • Splice all the converted labels together and send a query for the resulting name.

    I.e. pass "www.xn--blbrsyltety-y8aO3x.no" to gethostbyname().

As can be seen, the converted name is longer than the original. My code uses fixed size buffers and just makes conservative guess on the maximum sized result. 2 times MAX_HOST_LEN (2*256) should hopefully be enough for most names. I don't know what would happen with East-Asian characters converted to ACE. E.g. Big5, GBK or Shift-JIS. I cannot easily test this since my Windows box does not support these encoding.

Using the code

The public interface to the conversion functions are in idna.h. The punycode.* files are only for internal use. punycode.* come straight from the RFC-3492 with some adaptations for Windows. getopt.* are only needed by the demo source files to parse the command-line. I choose not to make this an all C++ implementation since that would exclude C users. It is callable from any type of Windows program (MFC-based, console-mode etc.). Later versions may include an option to use libiconv in addition to Windows' NLS functions.

The main function:

IDNA_init()

BOOL IDNA_init (WORD code_page);

Initialize the IDNA converter using the requested code_page. This can be 0 to use the system's default ANSI codepage (CP_ACP). If you use the C++ wrapper class, there's currently no way to specify other codepages.

IDNA_convert_to_ACE()

BOOL IDNA_convert_to_ACE (char *name, size_t *size);

Tries to convert name to the ACE-form using codepage specified in IDNA_init(). name points to the buffer to convert. *size on input must specify the maximum size of buffer to convert. *size on output tells you the size of the ACE-converted buffer.

Note: if name contains only US-ASCII (below and including 0x7F), no conversion is done and name will remain unchanged.

IDNA_convert_from_ACE()

BOOL IDNA_convert_from_ACE (char *name, size_t *size);

Tries to convert name to a string using codepage specified in IDNA_init(). *size on input must specify the maximum size of buffer to convert. *size on output tells you the size of the ACE-to-ASCII converted buffer.

Note: if name contains no labels with a "xn--" prefix, no conversion is done and name will remain unchanged.

IDNA_strerror()

Returns an error-string for the supplied error-number. _idna_errno in most cases.

Status codes

The above functions (except IDNA_strerror()) returns TRUE on success or FALSE on any error (no surprise here). Use IDNA_strerror(_idna_errno) to check why. If _idna_errno == IDNAERR_WINNLS, the error was in one of the WinNLS functions. Use GetLastError() to obtain the error. _idna_errno is not cleared on a subsequent successful call.

C++ wrapper class

The CIDNA_resolver class is a simple wrapper for the C-code implementation. Using it resembles the ::gethostbyname() function.

Minimal example:

CIDNA_resolver idna; 
struct hostent *he = idna.gethostbyname (name);

If you want to know the ACE name of your supplied name, extract it from he->h_aliases[0] (but only if he is non-NULL off course). This erases the original alias (if that should exist). But was the best I could do at the moment.

CIDNA_resolver::gethostbyaddr() is provided to convert an IPv4 address into a name of your codepage. I haven't found any ACE domain-name with a PTR record, so this function was tested with the hosts file only.

The accompanying Makefile is for MSVC, MingW and CygWin. Issue one of these commands:

make msvc
make gcc

to build the demos using Visual-C or MingW/CygWin respectively. And yes, the Makefile requires GNU's make (since I'm tired of the limitations in nmake). I'm too lazy to provide a VC6/7 project file (since I'm a MingW addict). Using the code in your project should be as easy as adding punycode.* and idna.* to your project. The code has been tested on Win-XP only. If it doesn't work on your OS, I'd be happy to hear why. But first, run one of the demo programs with full debug. E.g. "demo-1-vc.exe -c850 -dddd www.tromsø.no", and study the printed trace output.

Considerations

Some protocols that exposes hostnames in the application layer will have problems with IDNs. Most noticeably HTTP 1.1. If you try to fetch the URL http://www.blåbærsyltetøy.no/some/document, and you're able to resolve it (via the hosts file etc), this will be sent:

  GET /some/document HTTP/1.1
  User-Agent: whatever
  Host: www.blåbærsyltetøy.no

This will probably not work if the Web-server serves multiple domains. It will simply not match the "Host:" header-line against the domains it serves. Therefore, application should send the ACE form of the hostname instead:

  GET /some/document HTTP/1.1
  User-Agent: whatever
  Host: www.xn--blbrsyltety-y8aO3x.no

At least Apache 1.3 returns different results depending on what Host header is sent. Using HTTP 1.0 could be a solution for these cases, but I've not tested this (and don't understand virtual servers that well either). Expect such problems to arise more often as IDNs become more popular.

The DNS system is generally not case-sensitive when matching a normal domain-name. This is not the case in IDNA. A domain-name in the ACE-form is normally only stored in its lower case version. E.g. the DNS system has this entry for www.tromsø.no in the A record:

www.xn--troms-zuA.no.   IN      A       195.159.151.136

To handle the uppercase version www.tromsØ.no, an A record for "www.xn--troms-ipA.no" would be stored too. There would be too many combinations of ACE-forms (and MX records), so this is not generally done.

Note: This code does nothing to display the converted name (in IDNA_convert_from_ACE) correctly. It just assumes you're using the correct font for whatever codepage that is used. demo-1.c has a codepage (-c) option. Use that to experiment or call IDNA_init() with the proper codepage.

References

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here

Share

About the Author

Gisle Vanem

Norway Norway
No Biography provided

Comments and Discussions

 
GeneralCan't do UTF8 PinmemberTilman Hausherr28-May-11 3:09 
GeneralRe: Can't do UTF8 PinmemberGisle Vanem28-May-11 3:35 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

| Advertise | Privacy | Mobile
Web02 | 2.8.140902.1 | Last Updated 31 Mar 2004
Article Copyright 2004 by Gisle Vanem
Everything else Copyright © CodeProject, 1999-2014
Terms of Service
Layout: fixed | fluid