More and more JSON is becoming the data interchange format of the web and even starts to leak outside of this world, replacing XML wherever it can, and there are really good reasons for that. But often people are driven towards JSON for other reasons, not necessarily bad reasons, but based on more anecdotal facts, like the so-called verbosity of XML. Indeed this is the argument you’ll hear most often, e.g., just have a look at this nice comparison of the two formats: the first cons is of course “verbosity“.
And it’s a factual argument: the size gains can be important if your values are small, typically for representing business objects like customers because the markup overhead (all the closing tags) will become important relatively to the carried information (e.g., the names and zip codes of your customers).
But you rarely send big chunk of data in a raw text format as XML or JSON, because nowadays servers and clients (e.g., web browsers) supports live gzipping of the workloads, and use it transparently. So the size advantage of JSON over XML should reduce because GZIP knows how to factorize redundant information like markups.
At least this seems a reasonable speculation, but while intuition is good hard numbers are better to be definitely convinced and to have a numerical idea of the impact. So I’ve written a small Java benchmark that I’ll present, along with its results, in this article.
The source code is available in this archive:
The benchmark reproduces a very common scenario in web development: serializing a big bunch of business data, a set of two millions users.
Here are the different representations of the user entity.
public class User
private int id;
private String name;
public int getId()
public String getName()
public User(int id, String name)
this.id = id;
this.name = name;
Note that I’ve used a verbose format to clearly illustrate the point; of course “
id” and “
name” should have been implemented as attributes, but sometimes you have no choice, e.g., when you have to conform with an ill-conceived XML schema.
We already see that the XML template is quite verbose compared to the JSON one.
All the data, i.e., the users ids and names, are randomly generated using some helpers methods, to avoid any bias that could appear when choosing fixed values:
private static final Random random = new Random();
private static final char letters = new char;
for (int i = 0; i < 26; ++i)
letters[i] = (char) ('a' + i);
private static int getId()
private static String getName(int length)
char chars = new char[length];
for (int i = 0; i < length; ++i)
chars[i] = letters[random.nextInt(letters.length)];
return new String(chars);
private static User getUsers(int count)
User users = new User[count];
for (int i = 0; i < count; ++i)
users[i] = new User(getId(), getName(6));
As the benchmark tries to compare the costs of the formatting overheads for each of the document formats, the size of the values are limited so they don’t become too prevalent: ids and names lengths have been limited to 6 characters.
The zipping process is based on the standard Java GZip implementation and is as simple as that:
private static byte zip(String string) throws Exception
ByteArrayOutputStream memory = new ByteArrayOutputStream();
GZIPOutputStream zip = new GZIPOutputStream(memory);
The inputs are the text versions of the XML and JSON documents, the output the raw binary representation of the zipped content.
And here is the benchmark:
- Generate a set of random users
- Generate the XML and JSON representations of this set
- Compare the sizes of the text documents
- Generate the zipped versions of the XML and JSON documents
- Compare the sizes of the zipped documents and the time it took to compress
Note that the benchmark takes into account the time necessary to zip the documents because as you’d guessed zipping duration depends on the size of the content and CPU time is an important factor that can’t be ignored.
And the implementation:
public static void main(String args) throws Exception
User users = getUsers(2000000);
String xml = getXML(users);
String json = getJSON(users);
xml.length(), json.length(), 1.0 * xml.length()/json.length()));
long t1 = System.currentTimeMillis();
byte xmlZip = zip(xml);
long t2 = System.currentTimeMillis();
byte jsonZip = zip(json);
long t3 = System.currentTimeMillis();
t2 - t1, t3 - t2, 1.0 * (t2 - t1) / (t3 - t2)));
xmlZip.length, jsonZip.length, 1.0 * xmlZip.length/jsonZip.length));
Not rocket science, but it should do the job.
And the Winner Is…
Enough suspense, here are the results:
| ||Text ||Gzip ||Zip duration |
|XML ||91.78M ||18.74M ||3.38s |
|JSON ||49.78M ||17.09M ||2.78s |
|XML overhead ||84.38% ||9.62% ||21.3% |
As expected for both the text and zipped versions, XML has a size overhead but while this overhead is really important with the text version: 84%, almost twice as big, it becomes less significant, less than 10%, when gzipped.
But to obtain this gain in size, we had to consume some additional CPU time: it takes more than 20% more time to gzip the XML document than the JSON document.
So depending on your use case, it could be completely acceptable or not at all: if your server does not handle many requests and is never overloaded the 20% additional time is not an issue because it allows a dramatic reduction of the size, but if your server is already overloaded 20% more CPU loads could cause the latency to end in the red.
As you’ve seen, while the “angle bracket tax” of XML is real, it can be dramatically reduced to an acceptable level, but at the cost of some additional processing time. Keep in mind that as any benchmark, it’s worth what it’s worth and if the data format is critical in your situation, you should carry out your own study, inspired by this one but using your own data and technologies because your mileage may vary.
Moreover, JSON has been chosen as their data format by some NoSQL databases like MongoDB or CouchDB, one more good reason to use it in order to build a uniform stack.
To follow the blog, please subscribe to the RSS feed