|
Advanced Search Add to IE Search |
|
|
|
||||||||||||||||
I started the development of the GameWatch project in December 2002 with the idea of combining both my need for a free (open-source) game browser and experimenting with c# development. In abstract, GameWatch is an online game servers browser for games like Half-Life or Unreal Tournament 2003 (in other words, it's not unlike GameSpy or the All-Seeing-Eye).
One of its numerous internet-related feature is getting the country code from the IP address of a server, so that the users can filter out the servers displayed according to the country they live in (in the hope of lowering the latency to the game server).
In this short article, I'll first introduce some features of IP addresses, then discuss the algorithm and data structure used for the implementation of the GameWatch's IP-to-CountryCode converter.
So, the first interesting property we get from this, is that we can handle those IP addresses as simply as unsigned 32-bit integers.
Another interesting property is that IP addresses have a semantic: the left part of the address specifies the network number, the right part of the address represents the local address inside this network. The size of the left and right parts depends on the kind of network. In IPV4, there are three kinds of network type:
To be very precise, there are not really 8, 16, and 24 bits for the network codes, as a few bits are taken to help distinguish the network types, but you got the idea, and this is not relevant for our purpose in this article.
Well, what we learned here, is that we won't need all of the 32 bits of the IP address to get the country, as we can omit the local address, which value is always specific to its network. Thus, the significant part of an IP address for our IP-to-Country conversion can vary from 8 to 24 bits.
Each of these entities manage their IP allocation database, which are publicly available on their web site in text format. Those databases contain a list of lines of the form:
apnic|nz|ipv4|202.0.32.0|20|19930101|allocated apnic|nz|ipv4|202.0.48.0|22|19930101|allocated apnic|jp|ipv4|202.0.65.0|24|19930101|assigned
As you can see, the tokens are separated by '|' characters,
and the elements of particular interest for us are the country (second
token) and the network address (fourth token). The other tokens
represent the RIR responsible for the allocation, the version of the IP
protocol used (IPv4 is described in RFC 791), the number of possible
local address to be used with this network code, and the date of the
initial allocation.
The four databases can be FTP'ed from the ftp site of the four RIRs.
Now that we know how an IP address is composed, and that we have a database to find the country of an IP address, what do we do next ? Well, we must find a way to store the information and retrieve it as fast as possible.
The Hashtable class does not fit, as it cannot match
keys with part of the key entries (the Hashtable algorithm would test
the IP address against each possible network code size, which is not
acceptable).
The arraylist (and all the flavours of Linked List one
can think about) does not fit either, as each IP address test would
require at most O(N) look-ups, with N the number of entries in the
database (and we have a lot of them).
If we dive in the world of the internet software programming, we can easely find a domain that has a very similar issue: IP routing table lookup. IP routing involves finding the best network mask for a given target IP address, in order to find the IP address of a connected gateway that can reach the target.
Well, we are not doing routing, but we have a similar goal: finding the network code that matches our IP address. Note that there are a number of differences as well: we are not interested in frequently updating our table (adding or removing a network mask), and we have only one possible matching network code, while IP routing usually have several ones (this adds the constraint that only the longest-matching network must be selected).
So, how are our friends from IP Routing doing ? In fact, many solutions have been proposed in hundreds of papers. It goes from the simplest binary-trees (with the space complexity drawbacks) to the most sophisticated Peano Count Trees with partitionning based on prefix length [HABIB].
However the most commonly used structures are taken in the family of the Tries (pronounced like "tree", most urban legends say the name comes from reTRIEval trees). Tries are tree-based structures where each node represents a part of the item key. For instance, a character-based trie would store the two words "eater" and "eating" as follows:

Figure 1: eater/eating strings in a Trie
There are several variants of the Trie data structure, one of the most efficient being the PATRICIA (Practical Algorithm To Retrieve Information Coded In Alphanumeric) Trie [MORRISON68].
The main characteristic of the PATRICIA trie is the way it eliminates unnecessary nodes by grouping the sequences of tokens whenever possible.
Using a PATRICIA trie, the previous example would be encoded as:

Figure 2: eater/eating strings in a PATRICIA Trie
FreeBSD for instance uses a radix trie structure (very similar to the PATRICIA one) to hold its ip routing table, and many software routers use a PATRICIA trie to hold their ip tables (FutureSoft's IPV6 Routers, for instance).
The reader can easely find a lot of descriptions on the Patricia Trie around the web. Basically, let me stress the following property of the PATRICIA Trie: it is a compact data structure that can give you the longest prefix of an entry key in O(N) steps (in the worst case), with N the length of the longest prefix (see [PACKER2000] for a good performance analysis).
LongestCommonPrefix() returns the longest common
prefix of the BitVector object and the one provided as argument.
AddData() and AddAscii() that add data to the bitvector using
standard data types.
The BitVectorTrie class implements a PATRICIA Trie using BitVector objects as keys for the indexing of the data. It contains mainly three methods:
public void Add(BitVector key, object data)
public object Get(BitVector key)
public object GetBest(BitVector key)
The Add() method stores the "data" object using the
BitVector "key". Conversely, the Get() method retrieves an
object stored using a BitVector key.
GetBest() is the most interesting method for our
purpose of mapping an IP address to a country code: it walks the
nodes of the trie, following the path provided in the
BitVector key. The method walks as much of the trie as it
can, and return the value stored at the end of the path (in our case,
the BitVector object describes the path to follow).
For instance, the GetBest() method used with the
following BitVector {00110100101 1100} as
argument retrieves the object highlighted in the figure below.

Figure 3: Look-up of {00110100101 1100} in a PATRICIA Trie
In the example above, the trie look-up stops at the
00110100101 prefix, and returns the object stored in the
last node of the path. This node represents the longest prefix for the
BitVector given as argument.
As the source code provided is quite simple and documented, the reader can refer to it for the details of the PATRICIA trie implementation.
BitVectorTrie member, and
provides a method to load a standard RIR database as described at the
beginning of this article. The GetCountry() method takes a
string containing an IP address in the "aaa.bbb.ccc.ddd" format, and
return the country code as defined by the ISO3166 standard. The
Load() method is so simple I won't bother the reader with a
description of it.
f:\Documents and Settings\Rodrigo\Mes documents\projects\iptocountry>bin\itctest.exe
IP tables loaded in 00:00:02.2532400 (47203 networks)
IP list loaded in 00:00:00.0200288 (25000 ip addresses)
Testing ip look-up speed...
25000 addresses were translated in 00:00:01.2417856 (20132 ip/s)
Country code for 195.149.21.72: UK
Country code for 81.3.59.134: DE
Country code for 194.134.233.72: NL
The benchmark program loads a list of 25000 IP addresses that run a game server (taken from the unreal tournament 2003 and Half-Life servers lists) and performs a look-up for each of them.
The results show an average of 20'000 IP addresses processed per second with a nice space complexity, which fits perfectly our requirements.
Implementation Note: In the implemented algorithm it is not taken into account the fact that the local addresses in a same network can be dispatched in several countries. Indeed, international organisations can use their network as they wish, and a company can dispatch its IP over several countries. There is unfortunately nothing we can do against it, as this is the private domain of the owner of the network code.
[GWATCH03] The GameWatch Project
[MORRISON68] PATRICIA-- Practical
Algorithm To Retrieve Information Coded in Alphanumeric (Donald
R. Morrison) [RFC791]
RFC 791: INTERNET PROTOCOL DARPA INTERNET PROGRAM PROTOCOL SPECIFICATION
[PACKER2000] Performance Analysis of IP Routing Lookup Algorithms :
Patricia Tree based vs. Hashing based
(Packer Ali, Dhamim)
[HABIB] Prefix
Lookups Algorithm for Fast IP Routing using P-Tree (Md. Ahsan Habib)
[WALDVOGEL&AL97]
Scalable
High Speed IP Routing Lookups (Marcel Waldvogely, George Varghesez, Jon Turnerz, Bernhard Plattnery)
General
News
Question
Answer
Joke
Rant
Admin
Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads.
|
PermaLink |
Privacy |
Terms of Use
Last Updated: 12 Feb 2003 Editor: Chris Maunder |
Copyright 2003 by R. Reyes Everything else Copyright © CodeProject, 1999-2010 Web22 | Advertise on the Code Project |