JSON vs. XML: Some Hard Numbers About Verbosity

Pragmateek

4.71/5 (12 votes)

Jun 10, 2013

CPOL

5 min read

135683

I’ve written a small Java benchmark that I’ll present, along with its results, in this article.

Introduction

More and more JSON is becoming the data interchange format of the web and even starts to leak outside of this world, replacing XML wherever it can, and there are really good reasons for that. But often people are driven towards JSON for other reasons, not necessarily bad reasons, but based on more anecdotal facts, like the so-called verbosity of XML. Indeed this is the argument you’ll hear most often, e.g., just have a look at this nice comparison of the two formats: the first cons is of course “verbosity“.

And it’s a factual argument: the size gains can be important if your values are small, typically for representing business objects like customers because the markup overhead (all the closing tags) will become important relatively to the carried information (e.g., the names and zip codes of your customers).

But you rarely send big chunk of data in a raw text format as XML or JSON, because nowadays servers and clients (e.g., web browsers) supports live gzipping of the workloads, and use it transparently. So the size advantage of JSON over XML should reduce because GZIP knows how to factorize redundant information like markups.

At least this seems a reasonable speculation, but while intuition is good hard numbers are better to be definitely convinced and to have a numerical idea of the impact. So I’ve written a small Java benchmark that I’ll present, along with its results, in this article.

The source code is available in this archive:

The Model

The benchmark reproduces a very common scenario in web development: serializing a big bunch of business data, a set of two millions users.

Here are the different representations of the user entity.

Java:

public class User
{
    private int id;
    private String name;
    
    public int getId()
    {
        return id;
    }
    
    public String getName()
    {
        return name;
    }
    
    public User(int id, String name)
    {
        this.id = id;
        this.name = name;
    }
}

XML:

<user><id>%d</id><name>%s</name></user>

Note that I’ve used a verbose format to clearly illustrate the point; of course “id” and “name” should have been implemented as attributes, but sometimes you have no choice, e.g., when you have to conform with an ill-conceived XML schema.

And JSON:

{id:%d,name:"%s"}

We already see that the XML template is quite verbose compared to the JSON one.

Data Generation

All the data, i.e., the users ids and names, are randomly generated using some helpers methods, to avoid any bias that could appear when choosing fixed values:

private static final Random random = new Random();
private static final char[] letters = new char[26];
static
{
    for (int i = 0; i < 26; ++i)
    {
        letters[i] = (char) ('a' + i);
    }
  }
private static int getId()
{
    return random.nextInt(99999);
}
private static String getName(int length)
{
    char[] chars = new char[length];
    
    for (int i = 0; i < length; ++i)
    {
        chars[i] = letters[random.nextInt(letters.length)];
    }
    
    return new String(chars);
}
private static User[] getUsers(int count)
{
    User[] users = new User[count];
    
    for (int i = 0; i < count; ++i)
    {
        users[i] = new User(getId(), getName(6));
    }
    
    return users;
}

As the benchmark tries to compare the costs of the formatting overheads for each of the document formats, the size of the values are limited so they don’t become too prevalent: ids and names lengths have been limited to 6 characters.

Data Compression

The zipping process is based on the standard Java GZip implementation and is as simple as that:

private static byte[] zip(String string) throws Exception
{
    ByteArrayOutputStream memory = new ByteArrayOutputStream();
    
    GZIPOutputStream zip = new GZIPOutputStream(memory);
    zip.write(string.getBytes());
    zip.close();
    
    return memory.toByteArray();
}

The inputs are the text versions of the XML and JSON documents, the output the raw binary representation of the zipped content.

The Benchmark

And here is the benchmark:

Generate a set of random users
Generate the XML and JSON representations of this set
Compare the sizes of the text documents
Generate the zipped versions of the XML and JSON documents
Compare the sizes of the zipped documents and the time it took to compress

Note that the benchmark takes into account the time necessary to zip the documents because as you’d guessed zipping duration depends on the size of the content and CPU time is an important factor that can’t be ignored.

And the implementation:

public static void main(String[] args) throws Exception
{
    User[] users = getUsers(2000000);
    String xml = getXML(users);
    String json = getJSON(users);
    System.out.println(String.format("xml(%d)/json(%d): %f", 
      xml.length(), json.length(), 1.0 * xml.length()/json.length()));
    long t1 = System.currentTimeMillis();
    byte[] xmlZip = zip(xml);
    long t2 = System.currentTimeMillis();
    byte[] jsonZip = zip(json);
    long t3 = System.currentTimeMillis();
    System.out.println(String.format("xmlDuration(%d)/jsonDuration(%d): %f", 
      t2 - t1, t3 - t2, 1.0 * (t2 - t1) / (t3 - t2)));
    System.out.println(String.format("xmlZip(%d)/jsonZip(%d): %f", 
      xmlZip.length, jsonZip.length, 1.0 * xmlZip.length/jsonZip.length));
}

Not rocket science, but it should do the job.

And the Winner Is…

Enough suspense, here are the results:

	Text	Gzip	Zip duration
XML	91.78M	18.74M	3.38s
JSON	49.78M	17.09M	2.78s
XML overhead	84.38%	9.62%	21.3%

As expected for both the text and zipped versions, XML has a size overhead but while this overhead is really important with the text version: 84%, almost twice as big, it becomes less significant, less than 10%, when gzipped.

But to obtain this gain in size, we had to consume some additional CPU time: it takes more than 20% more time to gzip the XML document than the JSON document.

So depending on your use case, it could be completely acceptable or not at all: if your server does not handle many requests and is never overloaded the 20% additional time is not an issue because it allows a dramatic reduction of the size, but if your server is already overloaded 20% more CPU loads could cause the latency to end in the red.

Conclusion

As you’ve seen, while the “angle bracket tax” of XML is real, it can be dramatically reduced to an acceptable level, but at the cost of some additional processing time. Keep in mind that as any benchmark, it’s worth what it’s worth and if the data format is critical in your situation, you should carry out your own study, inspired by this one but using your own data and technologies because your mileage may vary.

In my humble opinion, what makes JSON the natural choice for a lot of applications is not its inherent qualities, though real as demonstrated in the above article, but its strong integration into the web ecosystem because JSON is the native way of representing JavaScript objects trees.
And as Javascript is no more limited to the client side, with the rise of JavaScript on the server with Node.js, it becomes the logical candidate to ensure communication between the client and the server: JavaScript talking to JavaScript using some … JavaScript.

Moreover, JSON has been chosen as their data format by some NoSQL databases like MongoDB or CouchDB, one more good reason to use it in order to build a uniform stack.

To follow the blog, please subscribe to the RSS feed