Click here to Skip to main content
Click here to Skip to main content

XML optimization

By , 11 Sep 2002
 

Sample Image

What is XML optimization

This is a set of techniques aimed to audit design metadata from any XML stream. Its purpose is to help XML producers minimize the side effects of using XML, such like the size overhead, or the versioning lockdown.

For instance, increase in size results in more network bandwidth required to send/retrieve the equivalent XML content, additionally to the increase in memory space required to store the XML locally, additionally to the increase in time required for the XML parser to process the stream.

XML optimization provides a report showing relevant figures to play with (see screen capture above). With this report in hands, the XML producers may choose to either use dedicated XML automation tools to transform XML streams according to defined rules. XML producers may find even more appropriate to redesign the whole XML metadata.

Figures have been calculated and are displayed in the report because they are meaningful for almost any kind of XML stream, ie each could mean a substantial change in size or design. I have tested over 50 XML files before coming up with these figures.

XML optimization is new stuff. Before writing this article, I have browsed through the public internet sites, newsgroups and even quite a bunch of research papers, and I haven't found a single topic addressing it. Amazingly enough, I believe this is not only of interest in the real world - when you know that every company out there in the high-tech industry now uses XML somehow - this is as crucial as database tuning tools or network tuners. Why isn't this part of leading XML tools (Xmlspy, Xmetal, Msxml, .NET API) ? I don't know, may be developers are content enough with their use of XML without really seeing the impact of using XML instead of binary file formats and standard databases.

What is not XML optimization

XML optimization is not about compressing XML to any proprietary binary format. For that purpose, please don't hesitate to check out xmill (at&t) and XMLppm (sourceforge). Their intent is to make a binary format from XML by shrinking XML patterns. And indeed it is very likely to be so because of either of these :
  • element and attribute names appear many times, thus can be replaced with short tokens
  • list of values may contain a lot of duplicated data, by analogy with SQL join records
Binary XML may be fine for some applications, but XML stops immediately being human readable. That is the reason why such tools are usually applied at the transport level, not at the application level. XML compression does not steal interest from XML optimization, since XML compression is the last thing to use when no smarter code or design principles can be of help - that's brute force in other words -. XML optimization on the other hand reveals best practices and caveats, thus is bound to help XML producers learn about their own metadata.

A real world sample

Before going into details, I would like to point a few links to an actual source XML stream, and the report obtained by applying the tool on it : In the Html report, don't hesitate to click on ? question marks for further info.

The remainder of this article can be broken down into the following sections (reflecting the sections in the Html report) :

XML optimization : structure in general

General figures about the XML stream are simple numbers to begin with.

Though the meaning of nb lines, nb elements, nb comments is obvious, it is of interest to know what are the effects over an XML stream with a high nb comments ratio in it. XML producers usually add comments above, in, or below the actual XML elements to explain the hierarchy and underlying design. But what they don't know is that in a lot of "content management server (CMS)" software, the XML is left as is, and sent to clients without removing these unnecessary comments. Resulting in data transport being often 10% larger compared to the size without comments. Of course, in this case, XML producers are more than encouraged to lift down their XML code. NB. CDATA sections and nb Process instructions play a simlar role than nb comments.

NB namespaces used is interesting as it reflects whether elements, attributes, and even data itself, use a lot of prefixes, which in turn may significantly increase the size of the XML stream. For the report to be really useful, figures are often displayed both as absolute values, and as percentages.

XML optimization : structure in details

Fasten your seatbelt, there are many topics here.

Structure pattern

This reverse engineers the XML stream hierarchy by just processing the stream (it nevers read the DTD if any), giving both parent/children relationships, and also datatypes when they are recognized (including float, integers, currencies, dates, urls, and emails).
What for ? reverse engineering the structure pattern is not only a unique feature, it reveals a lot whether the XML is designed "vertically" (lot of elements), "horizontally" (lot of attributes), or somewhat diagonally. The structure pattern is a preliminary block that must be displayed before proceeding next topics because it simplifies figuring out the design.

Flattening the structure pattern

Distinct patterns tells if there is more than one main pattern in the XML stream. Pattern occurences, Pattern height (amount of lines) and Pattern size (in bytes) show the key characteristics of the main structure pattern. These are figures that are worth mentioning by themselves, but are also preliminary to the next figure.

Now what is flattening patterns ? That's what is obtained by replacing child elements with attributes, where possible. Follows is a sample before and after flattening :

Original XML :

 <person>
  <firstname>John</firstname>
  <lastname>Lepers</lastname>
 </person>

Modified XML :

<person firstname="John" lastname="Lepers"/>

Flattening the patterns makes use of what is known in the W3C XML norm as empty element tags, ie tags with no slash counterparts, thus reducing the size by significant amounts. Flattening patterns has a lot of interesting effects : 1. for instance, because the hierarchy is flat, the parsing will be faster. 2. it is much easier to do a diff on XML streams with flatten patterns.

Structure depth

The depth we are talking about is the element depth in the hierarchy, ie "1" for the root element, "2" for the direct children, and so on. A measure usually comes with figures such like : the minimum value overall the XML stream, the maximum value, the average value, and the standard deviation value. A great standard deviation value means that the XML stream intensively uses indentation, <, > and end tags, which in turn increase the size.

To better reveal the depth, we also list the amount of elements at any given depth.

The depth measure is visually displayed using a bar chart (numerical figures in a list often hide the trend). For those interested in how the chart is built, using Javascript code, read what follows :

// usage
var tabheight = 120;
var tabdata = new Array(1,15,83,159); // y-axis
var tabtips = new Array("01","02","03","04"); // x-axis
showChart_Max("<p class='m1'>Depth histogram chart</p>",tabheight,
              "#4488DD",tabdata,tabtips);

// chart library 1.0 - Stephane Rodriguez - free software
function showChart_Max(title, height, color, data, datatips)
{
  // don't go too far if no data were passed
  if (data.length==0 || data.length!=datatips.length)
    return;

  // calculate min, max and average
  var max = data[0];
  var min = data[0];

  for (i=0; i<data.length; i++)
  {
    c = data[i];
    if ( max<c )
      max = c;
    if ( min>c )
      min = c;
  }

  var average = (min+max)/2;
  average = Math.floor(100*average)/100;

  // output table header
  document.writeln ("<table height='"+height+"' cellpadding='0' " + 
                    "cellspacing='0' border='0'>");

  document.writeln ("<tr><td valign='center'><font size='-1'>max=" + 
                    max+"</font></td>");

  // output data according to max
  for (i=0; i<data.length; i++)
  {

    dataportion = height * data[i] / max; // height of bar
    voidportion = height - dataportion;   // void between top of the bar 
                                          // and top of the table
    document.writeln ("<td height='129' width='15' rowspan='5'> </td>");
    document.writeln ("<td width='15' rowspan='5'>");
    document.writeln (" <table width='100%' cellpadding='0' " + 
                      "cellspacing='0' border='0'>");
    document.writeln ("  <tr><td height='"+voidportion+"'></td></tr>");
    document.writeln ("  <tr><td height='"+dataportion+"' " + 
                      "bgcolor='"+color+"'></td></tr></table>");
    document.writeln ("</td>");

  }
  document.writeln ("</tr>");

  // output min, max and average in first column (rowspan)
  document.writeln ("<tr><td> </td></tr>");
  document.writeln ("<tr><td><font size='-1'>avg="+average+
                   "</font></td></tr>");
  document.writeln ("<tr><td> </td></tr>");
  document.writeln ("<tr><td><font size='-1'>min="+min+"</font></td></tr>");

  document.writeln ("<tr><td valign='center'></td>");
  // output data according to max
  for (i=0; i<data.length; i++)
  {
    j=i+1;

    document.writeln ("<td width='15'> </td>");

    if (datatips.length==0)
      document.writeln ("<td width='15'><font size='-1'>"+j + 
                        "</font></td>");
    else
      document.writeln ("<td width='15'><font size='-1'>" + 
                        datatips[i]+"</font></td>");
  }
  document.writeln ("</tr>");

  if (title!="")
    document.writeln ("<caption valign='bottom'>"+title+"</caption>");

  document.writeln ("</table><br><br>");
}

Structure node naming strategy

Element and attribute names are usually chosen so they are self-descriptive. While this looks like an advantage, it has an overhead on size just because even in English, keywords enclosing content take statistically a significant space, resulting to a great contribution to the overall stream size. This can be avoided by enforcing a new strategy on naming described below. An element or attribute is any combination of letters and digits. With that in hand, why not make these names as short as possible ? Let us take an example:

<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE Bookstore SYSTEM "bookshop.dtd">
<Bookstore>
  <!--J&R Booksellers Database-->
  <Book Genre="Thriller" In_Stock="Yes">
    <Title>The Round Door</Title>
  </Book>
</Bookstore>

Let's build a map of name pairs:

 Bookstore  becomes A
 Book       becomes B
 Genre      becomes C
 In_Stock   becomes D
 Title      becomes E

So we get the following equivalent XML document :

<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE Bookstore SYSTEM "bookshop_A.dtd">
<A>
  <!-- J&R Booksellers Database -->
  <B C="Thriller" D="Yes">
    <E>The Round Door</E>
  </B>
</A>

Similarly with depth, the node naming strategy is also visually reflected using a bar chart, so we see the trend.

The gain resulting from applying the smart node naming strategy to the XML stream is calculated. That's often 30% or more, which is very very significant.

Structure attributes

The Structure attributes indicator reveals how uniform attributes are dispatched within elements. Besides the standard amount of attributes per element (with min, max, mean and standard deviation) is the disorder ratio. The disorder ratio attempts to show if attributes are listed in the same order or not wrt element occurences. That's of course an average, because each element may have any amount of associated attributes. According to the W3C XML norm, there is no special ordering between attributes, it is simply a good habit to have attributes always following the same order.

Structure namespaces

XML namespaces are declared by using a special attribute of the form xmlns:supplier="http://www.namespaces.com/supplier" and refers to a set of element and attribute names with a dedicated semantic meaning. Element and attributes with namespaces are prefixed by the namespace, for instance supplier:orderID. Namespaces are not required in XML streams, but they special meanings and may simplify data binding, as long as namespace real meanings are made public and available to everyone. Any number of namespaces can be used, not only one. A namespace must always be declared before it is used. The URL used for the declaration is a fake URL here just for global uniqueness purpose. Below is a sample for the supplier namespace:

<?xml version="1.0" encoding="ISO-8859-1"?>
<Orders xmlns:supplier="http://www.namespaces.com/supplier">
  <Order date="AA/45/10" supplier:id="UIYBAB47KDIU75">
    <Id>NBZYSJSGSIAUSYGHBXNBJDUIUYE</Id>
  </Order>
</Orders>

When namespaces are used, the report shows the ratio of namespaces' use, and the list of namespaces.

Not only using or not namespaces strongly changes the underlying XML design, they have effect on the node naming strategy, and in turn on the overall size of the XML stream.

Content itself

Even as the content itself is not part of the XML metadata, there are many ways to produce size overhead. The simplest of course is to dump data in XML format from a relational database system, without factorizing duplicate values. It is easy to figure out that there is a lot of gain here.

Raw content

Content size in element or attribute values exhibit a trend which can be described using minimum size, maximum, average, and standard deviation.

In addition, the ratio of element and attributes with no values is shown. If the ratio is high, easy it is to question whether the design of the metadata is good.

A somewhat odd indicator is the Ratio of multiple part values. Below are two samples of multiple part values for the <book> element :

<book>
   The name of this book is so inadequate for a general audience
   that it has been decided not to print it.
</book>

...

<book>The Round Door
  <year>1999</year>
  <price>20$</price>Part II
</book>

Content correlation

Content correlation is an in-depth examination of List Of Values that reveals valuables things. The first indicator is related to duplication, or how often the same values appear again and again. And it includes max, average and standard deviation. The second indicator is a ranking, it shows the most seen value in all List Of Values.

Content spacing and indentation

Indentation is often used in XML streams, as they are often designed and read by humans. But indentation produces a signication increase in size. In the report is shown the new size of XML stream without indentation at all. That's often 30%.

Summary of important measures

Out of the many figures from the HTML reports, several deserve some introductory explanations :

  • Flattening patterns : that's the design rule of replacing 1-cardinality elements by attributes. Sounds awful, but a lot of space is gained here.
  • Indentation and multiple spaces : beautifying your XML stream is ok, as long as you're dealing with tiny streams. Indented XML streams are simply put twice larger. Just keep this in mind if your server-side component does not scale, and you're wrecking the entire network bandwidth.
  • Disorder ratios : that's the kind of measures that help by themselves improve the schema design, and by the way may reduce XML bug fixing.
  • Correlation in content : statistically speaking an XML stream has a lot of overhead in size just because content is duplicated rather than factorized.

How to use the tool

Syntax :

  single file : betterxml <your file>
                betterxml bookshop.xml
                betterxml c:\mydir\bookshop.xml
                betterxml http://www.mysite.com/xml/bookshop.xml
			
  whole directory : betterxml -d <your directory>
                    betterxml -d c:\tmp\repository

Technical details

Technically, the tool is based on James Clark's Expat (royalty-free SAX Parser). The executable, which is a report generator on top of a static library can be divided into three parts :

  • betterxml.dsp (betterxml.exe), a report generator, contains mostly the HTMLWriter class which is straightforward, and reuses HTML templates stored in the .rc resource file. All strings are localized and ready for a foreign release, if anyone interested. The HTML reports have a built-in chart library (limited to bar charts) allowing to display charts using Javascript.

  • SaxAnalyzer.dsp (SaxAnalyzer.lib), an XML extraction library, with the following shade of classes :
    • IXmlStats : API to expose measures. Inherits IUnknown.
    • AppLogic.cpp : callbacks from the XML parser, calculations of all measures
    • Element.cpp : element + attribute API.
    • HtmlParser.cpp : general purpose HTML parser, used to extract details that expat does not see.
    • XmlFileManager.cpp : manages XML stream reading, including async monikers for URL-based XML streams.


  • xmlparser.dsp (xmlparser.lib), the expat library itself. Both VC6 and VC7 workspaces are provided.

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here

About the Author

Stephane Rodriguez.
France France
Member
Addicted to reverse engineering. At work, I am developing business intelligence software in a team of smart people (independent software vendor).
 
Need a fast Excel generation component? Try xlsgen.
 

Sign Up to vote   Poor Excellent
Add a reason or comment to your vote: x
Votes of 3 or less require a comment

Comments and Discussions

 
You must Sign In to use this message board.
Search this forum  
    Spacing  Noise  Layout  Per page   
QuestionXML optimizationmemberMember 90719114 Aug '12 - 20:41 
Dear Stephane Rodriguez. Your article is very nice and informative, but I am unable to compile and run betterxml_src.zip project source files in visual studio 2008. The error message is given below.
"The source control provider associated with the solution could not be found."
QuestionXML optimizationmemberMember 90719113 Aug '12 - 23:47 
I am unable to run betterxml_src.zip project in Visual Studio 2008. Following message appears.
 
The source control provider associated with the solution could not be found. Can any body tell me how to compile and run this project in visual studio 2008.
Generalhello need help (urgent)membervikasmisralkw22 Apr '07 - 20:18 
hello sir,
i am a software professional.we are developing a web based application
in which we need to parse more than 10 mb xml files from diffent web sources.the problems we are facing are
1)how to parse xml file until its completely downloaded
2)will the sax help in this case as we need to get values as they appear
3)what will be the best way to parse these incomplete xml files
 
thanks in advance
GeneralXML binarization : BinXMLmemberFabriceExp21 Aug '03 - 4:02 
Hi again Stephane,
Here I am again, with comments, no silly conversion question this time Wink | ;)
Regarding your remarks on binary XML : xmill and xppm are very far in performance (speed, memory consumption, ...) and ease of use from more recent XML binarizers like BinXML (www.expway.com).
You're right, data are not human readable while binarized, but with BinXML libraries :
- decoding is made through standard APIs like SAX, DOM, making binarization transparent for user (except for manual handling of course)
- decoding process is *much* faster (up to several times, not only a few %) than textual parsers, like Xerces-c, libXML, MSXML, or Java textual parsers etc... Gain factor highly depends on XML structure and grammar.
- Compression rates are up to several times those of a classical textual compressor like zip -compression process uses Schema grammar knowledge-
- decoder size is much smaller than usual parsers, allowing embedded applications, etc...
 
To be honest : I am working at Expway. But all that is the truth and only the truth. Our XML binarization format (called BiM) is normalized and part of MPEG-7 standard. It is currently being adopted by worldwide major broadcast and telecom industry companies.
 
See you in Paris.

GeneralRe: XML binarization : BinXMLmemberStephane Rodriguez.21 Aug '03 - 7:00 
On dirait que le monde est petit. Wink | ;) J'ai entendu parler d'expway il y a quelque mois.
Je pense que la techno a du potentiel. Je ne te cache pas que je ne vois pas bien le rapport avec MPEG7, car même si j'imagine que le gros apport de MPEG7 par rapport à MPEG4 consiste en l'ajout de nombreuses métadonnées, je pense que ce qui se vend c'est surtout les codecs de compression des flux audio/video. A cet égard, quand on voit les dégâts qu'on fait les patents sur le format MPEG4, on peut se poser des questions...D'ailleurs Microsoft s'est positionné récemment depuis : MPEG4 poubelle, désormais il n'y a plus que WMA/DRM et compagnie...
Un beau gâchis.
 
Revenons-en à nos moutons. C'est sacrement culoté de republier une API SAX/DOM masquant la compression xml. Y a des points très positifs, et y en a aussi qui le sont un peu moins, à mon avis.
Sur le plan positif, ça veut dire que si la mayonnaise prend, c'est le jackpot. Pour ça, n'y aurait-il pas intérêt à mettre l'API en open-source, comme RealNetworks l'a fait en début d'année avec Helix ? (d'ailleurs les codecs RealNetworks restent cachés et propriétaires).
 
Je verrais bien des trains de compression surtout en embarqué dans du hardware. Après tout, c'est ce que fait déjà RedLine pour html (cf TF1). Et surtout compression au niveau transport, ça a du sens. Compresser au niveau des applications, ça peut être bien mais il faut trouver des applications clefs qui servent à verrouiller la techno. Malheureusement, j'ai bien l'impression que BinXml est pour l'instant générique et n'a pas de "justification industrielle" contrairement par exemple à un connecteur SAPBW ultra spécialisé.
 
Sinon, un truc me turlupine, je ne vois pas sur le site de référence aux services web. Vu que l'ensemble des nouveaux produits Windows clients et serveurs vont dans cette direction, vu que l'ensemble des infrastructures serveurs y compris J2EE ajoutent mois après mois les protocoles et les outils pour faire des web services une utilisation industrielle, cela m'étonnerait que cela ne soit pas le filon numéro un, bien devant MPEG7. Cela dit, peut-être que je ne vois les choses qu'à travers le trou de serrure...
 

Quand on voit cet article[^], on se dit qu'il y a quand même un problème. Le marché est aujourd'hui saturé de studios de développement xml (au sens large du terme) alors qu'xml n'est pas encore une réalité dans les entreprises finales. Ce qui fait peur, c'est de constater un point commun entre toutes ces solutions, c'est qu'aucune d'entre elle ne part de l'hypothèse quu'il y a déjà des transformations xml d'une manière ou d'une autre, et qu'en pratique l'intégration n'est pas triviale car il faut savoir prendre les entrées et sorties de solutions propriétaires d'autres acteurs. Ca pose problème fondamental. Et malheureusement, à défaut d'articles qui me ferait croire le contraire, j'ai bien l'impression que l'ensemble des acteurs qui font du xml se croient tout seul dans leur petit monde et ils auront donc les mêmes problèmes qu'un éditeur de logiciels qui sortirait un produit sans anticiper le versioning des formats de fichier.
 
On verra bien ce que ça donne.
 
Sinon, une idée de business : les feeds RSS. Ca bouge énormément en ce moment ce truc (RSS2, Atom, ...), alors pourquoi pas. D'ailleurs le taux de compression doit être mal, surtout si les données xml sont compressées avec des dictionnaires type LZW.
 
A+
 


 

-- modified at 9:16 Saturday 8th October, 2005
GeneralRe: XML binarization : BinXMLmemberFabriceExp21 Aug '03 - 23:09 
Merci pour ton feedback, tres interessant.
Si tu veux m'écrire plus tard, pour qu'on parle français sans déranger, en retirant les 2 NOSPAM :
 
fabrice.toNOleSPAMdano@expNOway.SPAMcom
 
Je serais curieux de savoir où tu travailles par exemple, tu en connais un rayon sur XML. Et je pourrais te repondre sur les points un peu touchy (royalties, marchés...) !
Pour info, MPEG7 ne gere que les metadonnees, d'où notre forte presence au sein de cette norme, ainsi que dans TV-Anytime, norme qui s'appuie sur MPEG7 et est dédié au broadcast TV.
 
A+
 
Fabrice

GeneralArticle's source codemember.S.Rod.15 Nov '02 - 6:38 
CodeProject is changing. Now users are required to create an account and log on before they can download the zip files. Needless to say this is a shift in how authors relate article sharing and the passion to spend time on creating articles with source code about (hopefully) value content for others.
 
As author of this article, I haven't been given the opportunity to block this shift in the spirit of the site, although I am heavily against.
I have been told (through flames and misc kind of insults by some sectary Codeproject members) that the logon enforcement was a consequence of an action aimed to limit users from downloading the entire Codeproject site.
A lower profile solution should have been taken instead of that nasty annoying stinking one. Scripting techniques are available and should be a good and meaningful alternative.
 
As a consequence, as long as a safer Codeproject user policy is not back into place, I, as author of this article should not be taken liable for the absence or of source code along with it. Furthermore, I should not be by any mean liable for any damage resulting of the use of the binaries and source code in the zip files attached with this article.
On the other hand, until then the support of this article is discontinued.



GeneralRe: Article's source codeadminChris Maunder15 Nov '02 - 8:44 
Hey Stephane
 
I understand you're upset about the change but I don't understand why you are posting such comments. Making a comment "I should not be by any mean liable for any damage resulting of the use of the binaries and source code in the zip files attached with this article" is a strange statement - it almost seems like you're setting the stage for something. We're an open forum with no barrier (apart from registering) to uploading and downloading source code and binaries. If you upload code or binaries that deliberately includes code that is damaging to others then you are liable for that code - it's the same as if you had emailed that code to an unsuspecting user yourself.
 
We are a free site and we intend to keep the site free to use. We have absolutely no hidden agenda to make it a pay-per-view site. The barrier to downloading code is minor. You don't need to recieve the newsletters, you don't need to post messages, you don't need to do anything but register. If you then never use your account for anything other than downloading files then that's perfectly OK.
 
cheers,
Chris Maunder
Generalrequest..need help..memberswapna4 Nov '02 - 15:33 
hai Rod,
I read your article and was amazed by it...I mean your ideas of XML optmization...I am having problems of running your tool...???can u help me on that...
thanks
Swapna
GeneralStructure node naming strategymemberJoao Morais19 Sep '02 - 4:26 
Hi Stephane,
 
Good article and tool.
 
I have a concern about your strategy on elements and attributes naming. As you wrote, those elements should be self-descriptive and human readable. I agree with that. However, if we start using a strategic naming code and a look-up table to find the right final name of an element, this will defy the purpose of XML document readability.
 
Further more, none only this strategy will make the XML source unreadable (for a human), but all stylesheets used to transform that source will also be unreadable.
 
I still do think your tool is very useful, in the way is provide us a lot of metrics about a XML document (nb of lines, nb of space characters, indentation usage, depth analysis). Thanks to have done and shared it.
 
I suppose the next step will be to provide a tool which analyses the different areas of XML document used by a stylesheet during a XSL-T.
This is another tool I am looking for (I do not think I will have the courage to write it Wink | ;)
 
Joao

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Permalink | Advertise | Privacy | Mobile
Web02 | 2.6.130523.1 | Last Updated 12 Sep 2002
Article Copyright 2002 by Stephane Rodriguez.
Everything else Copyright © CodeProject, 1999-2013
Terms of Use
Layout: fixed | fluid