Introduction
I started the development of the GameWatch project in December 2002 with
the idea of combining both my need for a free (open-source) game browser
and experimenting with c# development. In abstract, GameWatch is an online game servers
browser for games like Half-Life or Unreal Tournament 2003 (in other words, it's not unlike
GameSpy or the All-Seeing-Eye).
One of its numerous internet-related feature is getting the country
code from the IP address of a server, so that the users can filter out
the servers displayed according to the country they live in (in the
hope of lowering the latency to the game server).
In this short article, I'll first introduce some features of IP
addresses, then discuss the algorithm and data structure used for the
implementation of the GameWatch's IP-to-CountryCode converter.
What is an IP address?
Nowadays, few people can claim they have never seen an IP address. This
four-number long string with dots actually represents a single 32-bit
unsigned integer, where each number in the four-number format is a byte
from this integer. In the common representation one can see daily, the
four bytes are extracted in network order from the integer. Network
order is equivalent to big-endian order, which basically means that an
IP address which 32-bit integer value is ABCD is mapped to A.B.C.D, and
not D.C.B.A as it is usually the case on intel-family computers.
So, the first interesting property we get from this, is that we can
handle those IP addresses as simply as unsigned 32-bit integers.
Another interesting property is that IP addresses have a semantic:
the left part of the address specifies the network number, the right
part of the address represents the local address inside this
network. The size of the left and right parts depends on the kind of
network. In IPV4, there are three kinds of network type:
- Class A: the network code is 8 bits and the local address is
24 bits.
- Class B: the network code is 16 bits and the local address is 16
bits.
- Class C: the network code is 24 bits and the local address is 8 bits.
To be very precise, there are not really 8, 16, and 24 bits for the
network codes, as a few bits are taken to help distinguish the network
types, but you got the idea, and this is not relevant for our purpose in
this article.
Well, what we learned here, is that we won't need all of the 32 bits
of the IP address to get the country, as we can omit the local address,
which value is always specific to its network. Thus, the significant
part of an IP address for our IP-to-Country conversion can vary from 8
to 24 bits.
The standard internet database
The IP addresses of the internet are not allocated by a single
organization, but by four Regional Internet Registries (RIRs), each one
being responsible for a region of the world.
- The RIPE "R�seaux IP Europ�ens", which name means in english Europeans IP
Networks.
- APNIC is the Asia Pacific Network Information Centre.
- ARIN is the American Registry for Internet Numbers.
- LACNIC is the Latin American and Caribbean Internet Addresses
Registry.
Each of these entities manage their IP allocation database, which are
publicly available on their web site in text format. Those databases
contain a list of lines of the form:
apnic|nz|ipv4|202.0.32.0|20|19930101|allocated
apnic|nz|ipv4|202.0.48.0|22|19930101|allocated
apnic|jp|ipv4|202.0.65.0|24|19930101|assigned
As you can see, the tokens are separated by '|' characters,
and the elements of particular interest for us are the country (second
token) and the network address (fourth token). The other tokens
represent the RIR responsible for the allocation, the version of the IP
protocol used (IPv4 is described in RFC 791), the number of possible
local address to be used with this network code, and the date of the
initial allocation.
The four databases can be FTP'ed from the ftp site of the four RIRs.
Now that we know how an IP address is composed, and that we have a
database to find the country of an IP address, what do we do next ?
Well, we must find a way to store the information and retrieve it as
fast as possible.
Finding the best data structure
The .Net framework provides a few data container and lookup class, like
ArrayList, HashTable, or Stack for instance, but none of these data
structure are suitable for our needs. Let's recapitulate our design
goals:
- We need a data structure that can store thousands of entries with a
low space complexity (our so expensive memory is used for other
tasks). There are currently more than 47000 networks defined in the RIRs
databases.
- The data structure must be able to match a variable prefix (8 to 24
bits of network code) for a 32-bit integer (we don't know in advance how
long is the network code in a given IP address).
- We need it to be fast (because we display thousands of country
codes in a user interface, which must stay reactive).
The Hashtable class does not fit, as it cannot match
keys with part of the key entries (the Hashtable algorithm would test
the IP address against each possible network code size, which is not
acceptable).
The arraylist (and all the flavours of Linked List one
can think about) does not fit either, as each IP address test would
require at most O(N) look-ups, with N the number of entries in the
database (and we have a lot of them).
If we dive in the world of the internet software programming, we can
easely find a domain that has a very similar issue: IP routing
table lookup. IP routing involves finding the best network mask for a
given target IP address, in order to find the IP address of a connected
gateway that can reach the target.
Well, we are not doing routing, but we have a similar goal: finding
the network code that matches our IP address. Note that there are a
number of differences as well: we are not interested in frequently
updating our table (adding or removing a network mask), and we have only
one possible matching network code, while IP routing usually have
several ones (this adds the constraint that only the longest-matching
network must be selected).
So, how are our friends from IP Routing doing ? In fact, many solutions
have been proposed in hundreds of papers. It goes from the simplest
binary-trees (with the space complexity drawbacks) to the most
sophisticated Peano Count Trees with partitionning based on prefix
length [HABIB].
However the most commonly used structures are taken in the family of
the Tries (pronounced like "tree", most urban legends say the name comes
from reTRIEval trees). Tries are tree-based structures where each node
represents a part of the item key. For instance, a character-based trie
would store the two words "eater" and "eating" as follows:

Figure 1: eater/eating strings in a Trie
There are several variants of the Trie data structure, one of the
most efficient being the PATRICIA (Practical Algorithm To Retrieve
Information Coded In Alphanumeric) Trie [MORRISON68].
The main
characteristic of the PATRICIA trie is the way it eliminates
unnecessary nodes by grouping the sequences of tokens whenever possible.
Using a PATRICIA trie, the previous example would be encoded
as:

Figure 2: eater/eating strings in a PATRICIA Trie
FreeBSD for instance uses a radix trie structure (very similar to
the PATRICIA one) to hold its ip routing table, and many software
routers use a PATRICIA trie to hold their ip tables (FutureSoft's IPV6
Routers, for instance).
The reader can easely find a lot of descriptions on the Patricia Trie
around the web. Basically, let me stress the following property of the
PATRICIA Trie: it is a compact data structure that can give you the
longest prefix of an entry key in O(N) steps (in the worst case), with N
the length of the longest prefix (see [PACKER2000] for a good
performance analysis).
Implementing a Patricia Trie to store IP Tables
First, we need a class to store the IP addresses and manage them at the
bit-level. Unfortunately, the BitArray class provided by the .Net
Framework lacks a few methods that we really need. The GameWatch
project provides a very simple BitVector class that implements the main
following methods:
-
LongestCommonPrefix() returns the longest common
prefix of the BitVector object and the one provided as argument.
-
AddData() and AddAscii() that add data to the bitvector using
standard data types.
The BitVectorTrie class implements a PATRICIA Trie using
BitVector objects as keys for the indexing of the data. It contains mainly three methods:
-
public void Add(BitVector key, object data)
-
public object Get(BitVector key)
-
public object GetBest(BitVector key)
The Add() method stores the "data" object using the
BitVector "key". Conversely, the Get() method retrieves an
object stored using a BitVector key.
GetBest() is the most interesting method for our
purpose of mapping an IP address to a country code: it walks the
nodes of the trie, following the path provided in the
BitVector key. The method walks as much of the trie as it
can, and return the value stored at the end of the path (in our case,
the BitVector object describes the path to follow).
For instance, the GetBest() method used with the
following BitVector {00110100101 1100} as
argument retrieves the object highlighted in the figure below.

Figure 3: Look-up of {00110100101 1100} in a PATRICIA Trie
In the example above, the trie look-up stops at the
00110100101 prefix, and returns the object stored in the
last node of the path. This node represents the longest prefix for the
BitVector given as argument.
As the source code provided is quite simple and documented, the
reader can refer to it for the details of the PATRICIA trie
implementation.
The IPToCountry class
Now we have the data needed for the mapping (provided by the RIRs)
as well as the structure and algorithm to hold it. So, let's build our
IP-to-CountryCode converter: The IPToCountry class.
This class basically contains a BitVectorTrie member, and
provides a method to load a standard RIR database as described at the
beginning of this article. The GetCountry() method takes a
string containing an IP address in the "aaa.bbb.ccc.ddd" format, and
return the country code as defined by the ISO3166 standard. The
Load() method is so simple I won't bother the reader with a
description of it.
Benchmark results
Benchmarks are always criticized, but here are the results I get on my
standard Athlon XP 2000:
f:\Documents and Settings\Rodrigo\Mes documents\projects\iptocountry>bin\itctest.exe
IP tables loaded in 00:00:02.2532400 (47203 networks)
IP list loaded in 00:00:00.0200288 (25000 ip addresses)
Testing ip look-up speed...
25000 addresses were translated in 00:00:01.2417856 (20132 ip/s)
Country code for 195.149.21.72: UK
Country code for 81.3.59.134: DE
Country code for 194.134.233.72: NL
The benchmark program loads a list of 25000 IP addresses that run a
game server (taken from the unreal tournament 2003 and Half-Life servers
lists) and performs a look-up for each of them.
The results show an average of 20'000 IP addresses processed per
second with a nice space complexity, which fits perfectly our
requirements.
A few notes on the provided source code
The code provided with this article mainly comes from the GameWatch project (an entry point with
some benchmarks was added though). It is released under the terms of the
GNU General Public License (see http://www.gnu.org/copyleft/gpl.html
for details). According to this license, you can use, modify, and
redistribute the software as long as you keep it under the GPL license.
Implementation Note: In the implemented algorithm it is not taken into account the fact
that the local addresses in a same network can be dispatched in several
countries. Indeed, international organisations can use their network as
they wish, and a company can dispatch its IP over several
countries. There is unfortunately nothing we can do against it, as this
is the private domain of the owner of the network code.
Bibliography
| You must Sign In to use this message board. |
|
|
 |
|
|
 |
|
|
 |
|
|
 |
|
 |
hi i am a beginer & i tried to run your program, for gamewatch.utils namespace i created a dll, which i gave a reference while creating the dll for gamewatch.utils.net but it shows error for following code:
private BitVector IpToBitVector(string ip) { string[] elements = ip.Split('.'); BitVector bv = new BitVector(); foreach(string e in elements) { int i = Int32.Parse(e); bv.AddData(i, 8); } return bv; }
error: The type or namespace name 'BitVector' could not be found (are you missing a using directive or an assembly reference?)
this message is shown for all statements where bitvector is used as a datatype
if you have understood my problem please help me.
|
| Sign In·View Thread·PermaLink | 5.00/5 |
|
|
|
 |
|
 |
I tried to access the registry ftp sites but was unable to access RIPE, LACNIC, and ARIN do you have the IP addesses in an existing database or the new ftp urls? Thanks greatly! I have to perform thousands of lokkups on a regular basis and I wish I knew about this site months earlier.
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
 |
if you want to know some thing about ip like 62.139.88.15
you type in your web browser: "http://www.senderbase.org/search/panels/Whois?searchString=62.139.88.15&rawWhois=1" you will get some useful data about this ip
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
|
 |
|
 |
Thanks for the thumbs up, Rodrigo... and for the inspiration! 
I ran across your excellent article while researching country lookups for a log analyzer application I'm writing.
Have a great day,
Jeffrey
Everything should be as simple as possible, but not simpler. -- Albert Einstein http://www.extremeoptimization.com/
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
 |
Seems that the article at www.extremeoptimization.com has gone commercial. It is not more (freely?) available there.
Leon[^] - Enterprise Anti-Spam Server
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
|
 |
|
|
 |
|
 |
The Patricia Tree explanation/implementation alone is worth a 5!
---------------------------------------- ----I said my name wasn't important ---------------------------SlartiBartFast
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
 |
I am a new webmaster trying to make some sense of our web logs - in particular the country codes of my site visitors. Can anyone tell me among 111.222.333.444 which digits are the country code. Also is there a master list of countries and their corresponding code, and where can it be found? I've looked briefly at the APNIC and LACNIC database listings and couldn't make much of them! Thanks.
|
| Sign In·View Thread·PermaLink | 1.33/5 |
|
|
|
 |
|
 |
EdBro wrote: Can anyone tell me among 111.222.333.444 which digits are the country code.
It's not one single group of digits that is the country code per se - the IP address blocks have been assigned to various countries - you'll need to consult this list / tree of IP address block assignments.
Also is there a master list of countries and their corresponding code, and where can it be found?
ISO 3166 is the main reference, I think - go to www.iso.org and search for it.
============================= Marc Scheuner, Berne, Switzerland m.scheuner - at - inova.ch
May The Source Be With You!
|
| Sign In·View Thread·PermaLink | 5.00/5 |
|
|
|
 |
|
 |
would patricia tries be a good way to find keywordsin a text?? (if you have say 2000 keywords)
eg , you scan from the first to the last char in a text , and for each char you get you try to walk the patricia trie from the root and down(or up , depending on you you see the tree) to see if the following chars matches a keyword ..
//Roger
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
 |
Hi Roger,
Although you'd better make a first tokenization pass of your text prior to checking for keywords, Tries are typically a nice solution for all the dictionary lookup problems. However, if you have only 2000 keywords, you may just go with the standard Hashtable (and some tweaking of the hash size though), given you have a good hashing function (so that items are correctly spread in the hashtable).
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
 |
Why is using Hash tables not prefered, when you can still search in them for the longest prefix of a key in O(N) where N is the length of the key? In this case it's only a few hash table lookups on average per search?
Regards, Orlin
|
| Sign In·View Thread·PermaLink | 1.00/5 |
|
|
|
 |
|
 |
Hi Orlin,
The .Net Hashtable implementation, just like the Java one, deals with collisions using a linked-list backend. The worst-case is then O(N) with N the number of items, which is far different from what you say. Even if you find a perfect hashing function for the IP Addresses, Hashtables are unable to provide the Best Matching Prefix (and this is the main reason why hashtables can't be used for an IP address lookup).
Regarding hashtables, http://www.acm.org/sigcomm/sigcomm97/papers/p182.pdf proposes a combination of hashtable and binary tree search, but unfortunately it needs backtracking, which is something I wanted to avoid. For my implementation, I preferred the classic Patricia Trie, which is very simple to develop and at the same time very efficient (its complexity is based on the key length, not on the number of items).
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
 |
I don't dispute that Patricia Tries are theoretically more appropriate. They definetely are. It's just that it seems that Patricia Tries' advantage over Hash tables becomes more important when the keys are long and the number of entries small so that the average tree depth is small. The keys in this case are relatively short so one can just loop a few times trying lookups of ever shorter keys against the Hash table. Anyway I think your article is cool and usefull.
|
| Sign In·View Thread·PermaLink | 1.00/5 |
|
|
|
 |
|
 |
Hi Orlink,
I can hardly agree with your statement : orlink wrote: Patricia Tries' advantage over Hash tables becomes more important when the keys are long and the number of entries small so that the average tree depth is small
Actually, let me stress again the fact: Hashtable don't have the "Best Matching Prefix" property that Trie structures have, so Hashtable just can't scale regarding prefix matching (and it's all a matter of prefix matching, here). If we were dealing with a few tens of Network masks, hashtables would be OK, but they just can't be an option for partial key matching (unless you are not interested in speed). The correct assertion would rather be "Patricia Tries' advantage over Hash tables becomes more important when the keys are long OR the number of entries big"
Please refer to the SIGCOMM'97 article Scalable High Speed IP Routing Lookups which discusses this and is already mentionned in the article. It's really worth the read.
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
 |
Truly an amazing article. I've always wondered how IP addresses were assigned, but never interested enough to read an RFC( ). Your article did an excellent job of explaining it.
And your explanation of the lookup was fantastic. I hope we will see more articles from you in the future. 
P.S. You earned my 5.
Hey, what can I say? I'm a chick magnet...a babe conductor...a logarithm for the ladies. -Strong Bad from HomeStarRunner.com
Essential Tips for Web Developers
|
| Sign In·View Thread·PermaLink | 2.00/5 |
|
|
|
 |
|
|
 |
|
|
 |
|
 |
I've been working on a growing project for over a year and have been needing a data structure just like the Trie. We studied them in class years back but - having never practically used them - I forgot about them! Thanks for the idea.
The article itself was interesting, too. I wasn't aware that IP address were allocated to certain countries. I figured there was probably some pattern, but your article was the first to exploit it. Nice job.
Reminiscent of my younger years... 10 LOAD "SCISSORS" 20 RUN
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
 |
System.IO.DirectoryNotFoundException:Could not find a part of the path
this error relates to the apnic.latest file.
Any suggestions ? (WinXP)
R.Bischoff | C++ .NET, Kommst du mit?
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
|