Introduction
XML is a great technology with many useful, standardized, and well supported extensions like XML schemas, XPath, XQuery, XSLT, etc. However, XML has a reputation of being too verbose. This doesn't come as a surprise if we look at one of the design goals stated in the official W3C XML Recommendation:
"Terseness in XML markup is of minimal importance."
-- XML specification
An interesting question arises: "Couldn't we keep the technology, but just improve the syntax in order to make XML and HTML easier to read and write for humans?"
As shown in this article, the answer to this question is a resounding "Yes, we can!".
We'll have a look at a suggestion for a new syntax that is less verbose, easy to read and write, and works well for all kinds of XML documents, including HTML code.
Note
Readers of this article are supposed to have a basic knowledge of XML, HTML, and JSON.
Existing Alternatives
Before trying to invent anything new, we should of course first have a deep look at what exists already.
This chapter answers the question: Is there any existing markup language with a more human-friendly syntax than XML/HTML, but also well suited for big, complex, and changing documents?
JSON
In the last years, JSON has overtaken XML in terms of popularity.
To understand why (in the context of syntax), let's have a look at a simple data structure in JSON:
{
"person": {
"name": "Albert",
"married": true,
"address": {
"street": "Kramgasse",
"city": "Bern"
},
"phones": [ "123", "456"]
}
}
In XML, the code could look like this:
="1.0"="UTF-8"
<person>
<name>Albert</name>
<married>true</married>
<address>
<street>Kramgasse</street>
<city>Bern</city>
</address>
<phones>
<phone>123</phone>
<phone>456</phone>
</phones>
</person>
Many people prefer the JSON syntax. It is easier to read and less verbose than the XML version. Not counting indentation spaces, the above JSON code requires 144 characters to type. The XML code has 276 characters. That's an increase of 92%!
Examples like the above one lead to an intriguing question:
For example, could we use the JSON syntax to write HTML documents?
Let's try.
Here is a simple HTML snippet:
<p>foo bar</p>
In JSON, we could express this as follows:
{ "p": "foo bar" }
Let's write foo
in italics, and bar
in bold.
HTML:
<p><i>foo</i> <b>bar</b></p>
JSON:
{ "p": [ { "i":"foo" }, " ", { "b": "bar" } ] }
Now we want to display everything in red:
HTML:
<p style="color:red;"><i>foo</i> <b>bar</b></p>
JSON:
{ "p": { "style": "color:red;", "content": [ { "i": "foo" }, " ", { "b": "bar" } ] } }
We can prettify to make the code easier to read:
{
"p":{
"style":"color:red;",
"content":[
{
"i":"foo"
},
" ",
{
"b":"bar"
}
]
}
}
But now, the HTML one-liner has mutated into a '14 lines monster with lots of horizontal and vertical whitespace'.
Not quite what we are looking for.
Besides the obvious fact that the complexity of the JSON code increases quickly, there is another worrying observation:
- In the first example, the
p
element's value was a string: "p": "..."
. - In the second example, the value becomes a JSON array:
"p": [...]
. - In the last example, it mutates to a JSON object:
"p": {...}
.
Such changes can easily lead to maintenance nightmares. Code that inspects the data structure must be updated each time the code changes. For example, if we wanted to extract the text of element p
, we would need to write different code for the three cases.
XML doesn't have this problem. The content of p
is always a list of child elements.
At this point, you hopefully agree that we can stop further investigation and move on. The JSON syntax is a bad fit for describing markup code like HTML documents in a human-friendly way. That doesn't mean of course that 'JSON is bad'. JSON is a good choice in many cases. It is a native part of JavaScript, well supported in most programming languages, and there are lots of libraries and tools available for JSON. However, in the context of our search for a better markup syntax, JSON (as well as all variations of it) is not an option. Later on, we'll have a look at a more complete HTML example that confirms our conclusion.
YAML
One way to minimize verbosity is to use indentation to define structure. YAML is probably the most popular language that uses this technique.
Here is a reprint of a JSON example we saw previously:
{
"person": {
"name": "Albert",
"married": true,
"address": {
"street": "Kramgasse",
"city": "Bern"
},
"phones": [ "123", "456"]
}
}
In YAML, this becomes:
person:
name: Albert
married: true
address:
street: Kramgasse
city: Bern
phones:
- 123
- 456
Nice!
Easy to read and write.
At first sight, it might seem that we could use such a noise-less syntax for all kinds of data structures, including markup code.
It turns out that would be a very bad idea. The problem with YAML and all other languages that use indentation to define structure is this: It works well for small, simple structures (such as config files). But if we need to manage big documents with deeply nested structures then it quickly becomes error-prone and unmaintainable.
Moreover, while using indentation to define structure effectively reduces verbosity, it also leads to much more lines of code for certain types of documents. The reason is that each child element must be written on a new line.
To illustrate this, let's see how the simple HTML one-liner we used in the previous chapter would be written in YAML. Here is a reprint of the HTML:
<p style="color:red;"><i>foo</i> <b>bar</b></p>
In YAML, the code would look like this:
p:
style: 'color:red;'
content:
- i: foo
- ' '
- b: bar
There are other arguments against whitespace-sensitive documents, such as the problems with mixing spaces and tabs, and code snippets that cannot be shared between different documents with different levels of indentation. These inconveniences are well known - there is no need for repetition here.
Finally, the whitespace-significant approach forces us to use whitespace according to the rules (which can get very complex). It takes away the freedom to use whitespace to make documents more appealing and understandable.
As for JSON, this doesn't mean that 'YAML is bad'. YAML is well suited in some cases. What I want to say is that the idea of using whitespace-sensitivity in a markup language like HTML is doomed to fail. It's understandable that, according to Wikipedia, the meaning of the acronym YAML was changed from "Yet Another Markup Language" to "YAML Ain't Markup Language".
XML/HTML ignores whitespace, and that's the right choice.
Other
A good number of other markup languages exist, but I am not aware of any syntax that would be well suited to replace the XML syntax. If you know of a good alternative, then please leave a comment.
There are also many tools and editor plugins aiming to alleviate the pain of writing XML code by hand. However, the gist of this article is not to alleviate the pain. We want to remove it.
History
pXML Predecessor
When I started to ponder about the new syntax, I didn't think at all about creating a better XML/HTML syntax. What I wanted was a new syntax to write articles (published on a blog) and books. Initially I used Docbook, then Asciidoctor to write articles. I also tried out Markdown, and had a look at other syntaxes like RestructedText. To make a long story short: I felt frustrated with some impracticabilities of existing solutions, and finally decided to design a new syntax called Practical Markup Language (PML). If you want to know more about my motivation to create PML, you can read We Need a New Document Markup Language - Here is Why (published in March 2019). (Note: For readers still using a word processor, I also wrote Advantages of Document Markup Languages vs WYSIWYG Editors)
Nowadays, I write all my articles in PML (including this one). To publish them, I created a PML to HTML Converter which reads a PML file and creates a HTML file. You can have a look at the PML source of this article here, and you can see the original version of it here (i.e., the result produced by the PML to HTML Converter). Right-click on the original article, and click 'View Page Source' if you want to have a look at the HTML code produced by the PML to HTML Converter. The converter produces indented, clean and simple HTML code, like hand-coded. The PML to HTML Converter is open-sourced under the GPL2, and written in PPL (Practical Programming Language). The source code is on Github.
After creating PML, I suddenly realized that it's syntax could also be used to write XML/HTML documents - a nice side effect. This article is my first step in making my idea public.
One could say that PML is to pXML like HTML is to XML. PML uses the pXML syntax, but only predefined, sematic PML tags are allowed.
Lenient Syntax in PML
An important aspect of PML is the parser's ability to work in lenient mode. This mode supports very targeted syntax simplifications, aiming to eliminate as much "noise" as possible. It should be easy to write articles and books in PML. Here is an example to illustrate the advantage of the lenient syntax:
This is the code of a simple PML document, written in strict pXML:
[doc (title=Test)
[ch (title="An Unusual Surprise")
[p Look at the following picture:]
[image (source=images/strawberries.jpg)]
[p Text of paragraph 2]
[p Text of paragraph 3]
]
]
In lenient PML mode (always activated), the text can be shortened to:
[doc Test
[ch An Unusual Surprise
Look at the following picture:
[image images/strawberries.jpg]
Text of paragraph 2
Text of paragraph 3
]
]
Let's briefly see how this works:
-
doc (title=Test)
becomes doc Test
:
Some elements (for example doc
) have a default attribute. For that attribute only the value needs to be specified - instead of writing (name=value)
we can simply write value
-
[p Text of paragraph 2]
becomes Text of paragraph 2
:
Free text not contained in an element is automatically embedded in a p
(paragraph) element.
Text separated by two new lines automatically creates a paragraph break.
If you want to try out the above code, you can proceed like this:
-
Download the PML to HTML Converter
-
Create file example.pml in any directory, with the PML code shown above (the strict pXML version will not work).
-
Copy a picture to resources/images/strawberries.jpg
-
Open a terminal in the directory of file example.pml and type:
pmlc example.pml
-
Open file output/example.html in your browser.
The result looks like this:

-
Right-click on the text, and select 'View Page Source' if you want to see the HTML code produced by the PML to HTML Converter.
Implementation
After using the PML syntax for some time to create real articles (not just tests), I was somewhat confident that the pXML syntax should work well for XML documents too. However, to eliminate doubts, I wanted a proof of concept for pXML, before publishing this article. Therefore I created a parser that reads the pXML syntax presented in this article. The parser is written in Java and has no dependencies. I will open-source it.
The following features are currently implemented:
-
Convert pXML into XML (pXML/XML escape rules are applied)
-
Convert XML into pXML (pXML/XML escape rules are applied)
-
Read a pXML document into an org.w3c.dom.Document
Java object.
This is the most powerful feature. Once we have a Java Document
object we can use all of XML's related specifications with a pXML document. A few examples are:
-
validate a document with XML Schema (W3C), RELAX NG, or Schematron
-
programmatically traverse the document
-
insert, modify, and delete elements and attributes, and save the result as a new XML or pXML document
-
query the document (search for values, compute aggregates, etc.) with XQuery/XPath
-
convert the document using an XSL transformer (e.g. create a differently structured XML document, create a plain text document, etc.)
Here is a "Hello World" example of a pXML to XML conversion:
-
Suppose we created file hello.pxml
with this content (an empty root element with name hello
):
[hello]
-
The following Java code converts this pXML file into an XML file named hello.xml
:
PXMLToXMLConverter.PXMLFileToXMLFile ( new File("hello.pxml"), new File("hello.xml") );
-
The resulting hello.xml
file looks like this:
<?xml version="1.0" encoding="UTF-8"?>
<hello />
The opposite (i.e., converting an XML file to a pXML file) can be done with:
XMLToPXMLConverter.XMLFileToPXMLFile ( new File("hello.xml"), new File("hello.pxml") );
Once the pXML parser is ready to be open-sourced (planned for May 2021), I'll publish a dedicated article with more examples.
I'm also working on a dedicated pXML website with a syntax specification and the grammar expressed in EBNF and railroad diagrams. Everybody is very welcome to participate in an open-source project.
Examples
A picture is worth a thousand words. So let's look at two common real life examples: a simple config file, and HTML code. We will compare code written in JSON, XML, pXML, and PML.
Simple Config File
A simple config file is just a (possibly nested) map of key/value pairs.
JSON
Here is an example in JSON:
{
"size":"XL",
"colors":{
"background":"black",
"foreground":"light green"
},
"transparent":true
}
Remarks:
The need for quoting names and values is a bit annoying.
Another inconvenience is the comma required at the end of each assignment, except the last one. Each time we add a parameter at the end of a list, there is a risk of forgetting to add a comma at the existing second-last line.
XML
The same config data look like this in XML:
<config>
<size>XL</size>
<colors>
<background>black</background>
<foreground>light green</foreground>
</colors>
<transparent>true</transparent>
</config>
Remark: The closing tags are noisy.
Alternative syntax, using attributes:
<config>
<size>XL</size>
<colors background="black" foreground="light green" />
<transparent>true</transparent>
</config>
Remark: Both syntaxes are not API-compatible. The change of using attributes instead of elements requires an update of the code that accesses the values of colors
.
pXML
The pXML version looks like this:
[config
[size XL]
[colors
[background black]
[foreground light green]
]
[transparent true]
]
Alternative syntax, using attributes:
[config
[size XL]
[colors (background=black foreground="light green")]
[transparent true]
]
Remark: Both syntaxes are API-compatible. The change of using attributes instead of elements does not require an update of the code that accesses the values of colors
.
Verbosity
To compare the verbosity of the three syntaxes, let's consider the length of the markup code needed for one parameter (excluding whitespace):
Language | Markup | Length | Range | Remark |
JSON | "":"", | 6 | 3 to 6 | -2 for integer, boolean, and null values (because they are not quoted); -1 for the last parameter (because it doesn't have a trailing comma) |
XML element | <></size> | 9 | min. 6 | The length depends on the number of characters in the name |
XML attribute | ="" | 3 | always 3 | |
pXML element | [] | 2 | always 2 | |
pXML attribute | = or ="" | 1 or 3 | 1 or 3 | The length is 1 if the value doesn't need to be quoted |
Conclusion
The most verbose syntax is XML (especially in case of long parameter names). The least verbose one is pXML. Less noise implies 'easy to read and write for humans'.
HTML Code
Now we'll look at some HTML code - the most common use of XML. To keep the example short, we'll just look at an HTML snippet, leaving off the HTML header and footer.
HTML
The following code represents a chapter with three paragraphs and a picture:
<section>
<h2>Harmonic States</h2>
<p>The <i>initial</i> state looks like this:</p>
<img src="images/state_1.png" />
<p>After just a few <i><b>micro</b>seconds</i> the state changes.</p>
<p>More text ...</p>
</section>
JSON
In a previous chapter we saw already that JSON is not a good fit to write HTML-like code. Nevertheless, let's a have look at the JSON version of our HTML snippet - just to confirm our previous conclusion:
{ "section": [
{ "h2": "Harmonic States" },
{ "p": [ "The ", { "i": "initial" }, " state looks like this:" ] },
{ "img": { "src": "images/state_1.png" } },
{ "p": [ "After just a few ", { "i": [ { "b": "micro" }, "seconds" ] },
" the state changes." ] },
{ "p": "More text ..." }
] }
If we prettify, the 7 lines of code turn into 38 (!) lines with more whitespace than text:
{
"section":[
{
"h2":"Harmonic States"
},
{
"p":[
"The ",
{
"i":"initial"
},
" state looks like this:"
]
},
{
"img":{
"src":"images/state_1.png"
}
},
{
"p":[
"After just a few ",
{
"i":[
{
"b":"micro"
},
"seconds"
]
},
" the state changes."
]
},
{
"p":"More text ..."
}
]
}
Who would enjoy writing and maintaining code like this? Yet this is just a simple toy example. Imagine a code base with real-world, big and complex HTML code!
pXML
This is the pXML version:
[section
[h2 Harmonic States]
[p The [i initial] state looks like this:]
[img (src=images/state_1.png)]
[p After just a few [i [b micro]seconds] the state changes.]
[p More text ...]
]
PML
As said already, PML has a lenient syntax mode that allows for very succinct markup code:
[ch Harmonic States
The [i initial] state looks like this:
[image images/state_1.png]
After just a few [i [b micro]seconds] the state changes.
More text ...
]
Note
If we embed the above code in a doc
element (as shown before), save the code into file test.pml, and run the PML to HTML Converter with the OS command pmlc test.pml
, a complete HTML file is created (with header and footer). Here is an excerpt of this file (CSS code removed):
<section id="ch__1">
<h2>Harmonic States</h2>
<p>The <i>initial</i> state looks like this:</p>
<figure>
<img src="images/state_1.png" />
</figure>
<p>After just a few <i><b>micro</b>seconds</i> the state changes.</p>
<p>More text ...</p>
</section>
As can be seen, it is very similar to the initial HTML code we would write by hand.
Verbosity
Let's look at numbers. How much effort does it take to write the code in the four languages? If we extract the markup code (i.e., remove whitespace and text displayed in the browser) we get this, from worst to best:
JSON: {"section":[{"h2":""},{"p":["",{"i":""},""]},
{"img":{"src":""}},{"p":["",{"i":[{"b":""},""]},""]},{"p":""}]}
HTML: <section><h2></h2><p><i></i></p><imgsrc=""/><p><i><b></b></i></p><p></p></section>
pXML: [section[h2][p[i]][img(src=)][p[i[b]]][p]]
PML: [ch[i][image][i[b]]]
Counting the number of characters gives us the following table:
Language | Markup length | Percentage of HTML |
JSON | 108 | 132% |
HTML | 82 | 100% |
pXML | 42 | 51% |
PML | 20 | 24% |
A graph of these numbers looks like this:

Of course, this is not a representative result. Other HTML examples would lead to more or less different numbers. However, it clearly shows the impact of syntax. Syntax affects complexity, space and time, and usability. Succinct syntax makes it easier and more enjoyable to read and write code.
Syntax Comparison
Here is a brief comparison of the XML vs pXML syntax:
Empty element:
XML: <br />
pXML: [br]
Element with text content:
XML: <summary>text</summary>
pXML: [summary text]
Element with child elements:
XML: <ul>
<li>
<div>A <i>friendly</i> dog</div>
</li>
</ul>
pXML: [ul
[li
[div A [i friendly] dog]
]
]
Attributes:
XML: <div id="unplug_warning" class="warning big-text">Unplug power cord before opening!</div>
pXML: [div (id=unplug_warning class="warning big-text")Unplug power cord before opening!]
Escaping:
XML: <note>Watch out for <, >, ", ', &, [, ], and \ characters</note>
pXML: [note Watch out for <, >, ", ', &, \[, \], and \\ characters]
Comments:
Single comment:
XML: <!-- text -->
pXML: [- text -]
Nested comments:
XML: not supported
pXML: [- text [- nested -] -]
Summary and Conclusion
As demonstrated, it is possible to simplify the XML syntax and make it more accessible for humans.
The pXML syntax introduced in this article essentially suggests three changes:
-
Replace the XML syntax:
<name>value</name>
... with:
[name value]
-
Embed attributes between parenthesis and allow unquoted values if possible.
The XML code:
name1="value" name2="value with spaces"
... becomes:
(name1=value name2="value with spaces")
-
Support for nested comments (not supported in XML)
Although the pXML syntax is less verbose and different from XML, all great additions that are part of the XML ecosystem can still be used. Once a pXML document is parsed into an XML tree, documents can be validated, queried, modified, and transformed.
Well designed syntax increases productivity, reduces errors, eases maintenance, and improves space and time efficiency.
Syntax matters!
Article History
- 10th March, 2021
- 20th April, 2021
- Added attributes syntax
- Removed optional name prefix
#
(used to differentiate between data and metadata)