|
|
Comments and Discussions
|
|
 |

|
before:
<tst:Road attrname="123">
...
</tst:Road>
<tst:Road attrname="124" tst:id="1">
...
</tst:Road>
after:
<tst:Road >
...
</tst:Road>
<tst:Road tst:id="1">
...
</tst:Road>
expected:
<tst:Road>
...
</tst:Road>
<tst:Road tst:id="1">
...
</tst:Road>
anyone guide me on how to resolve this issue?
|
|
|
|

|
The C API could to be better.
No one developer likes to use a lot of lines of code to do a simple operation that could be done using only one or a few lines of code.
Here are my contribution, how the C API could be:
XMLDOC xml;
XMLNODE node, node2;
char * ptr;
int i;
xml = OpenXML(zFilePath);
node = XMLGetNode(xml, "/customers/0");
i = XMLGetChildCount(node);
node2 = XMLGetRelatedNode(node, NODE_PARENT);
node2 = XMLGetRelatedNode(node, NODE_FIRSTCHILD);
node2 = XMLGetRelatedNode(node, NODE_NEXT);
ptr = XMLGetAttr(node, "city");
printf("city: %s", ptr);
free(ptr);
XMLSetAttr(node, "city", "My little town");
SaveXML(xml); SaveXMLTo(xml, zOtherpath);
ptr = GetXMLText(xml);
CloseXML(xml);
Other tip: the header file may be only one, not a lot of them. We (developers) don't like to have a lot of dependencies. What we like:
-have only one header file and a c file, or a dll to add to our project.
-make things with one line of code
modified on Friday, August 21, 2009 4:10 AM
|
|
|
|

|
Instead of doing this:
insertBeforeElement("<item/>\n");It would to support this:XMLNODE node, node2, node3;
node = InsertNode("order", xml, NULL, NODE_FIRST);
if (node == NULL) { puts("failed!"); return; }
node2 = InsertNode("item", xml, node, NODE_CHILD);
if (node2 == NULL) { puts("failed!"); return; }
XMLSetAttr(node2, "price", "25.00");
node3 = InsertNode("item", xml, node2, NODE_AFTER);
if (node3 == NULL) { puts("failed!"); return; }
XMLSetAttr(node3, "price", "25.00");
It may to add a element (node) adding the < and /> characters, and when we add a attribute or a child element it may to replace the />.
Then XMLDOC and XMLNODE can be structures containing the needed data to be used by the funcions.
If you can implement these things, this may to become on of the best XML parsers on the world.
|
|
|
|

|
XMLnode is a concept orginated in DOM,
VTD-XML deals with bits/bites/ints/longs/arrays/ element fragments, and namespace compensated fragments...
I can see where you come from on this... but I think that vtd-xml is similar but different from DOM in that it doesn't really operate on nodes...
|
|
|
|

|
Hi Jimmy!
I was writing an wrapper to your code, but I encountered some behaviours of your parser that have made me realize that it is far from complete.
Here are some tips that may to be implemented if we want to have a decent parser:
When we remove elements the new line is not removed. The parser may to use a new line for each element. When we add a new element the parser may to add the new line (we may not to pass it to the parser) by itself.
When we insert a new attribute the parser may to add the left and right spaces, if needed. This is a work for the parser.
When we insert or update a new attribute our program may to pass to the parser only the attribute name and the value, the equal sign and the quotes may be added by the parser.
We may to be able to add child elements...
When we have an element and want to add a child element to it, the parser may to remove the "/>" (if there is one), add the ">", put the inserted element, and the parent element termination (eg "").
Here some ideas:
The parser could to update the xml file on the fly. This can be done for files that are <10MB in size (the majority of them):
Here is an example: when we update an attribute value from 'value' to 'newvalue' the file size may to increase in size the amount of bytes that is the difference of size of the two values. In this case 3 bytes.
0 1 2 3 4
1234567890123456789012345678901234567890
Before: <root><a name='value'>test</a></root>
After : <root><a name='newvalue'>test</a></root>
Step 1: the parser may to copy the bytes from position 21 to position 24,
from the end to start (copy 37 to 40, 36 to 39, and so on):
0 1 2 3 4
1234567890123456789012345678901234567890
Before: <root><a name='value'>test</a></root>
step 1: <root><a name='value___'>test</a></root>
Step 2: The parser copies the new value to the position 16:
0 1 2 3 4
1234567890123456789012345678901234567890
After : <root><a name='newvalue'>test</a></root>
When we insert an attribute, the parser may to increase the file size, move the data to the end of the file, and put the attribute data there, including the space.
<root><a>test</a></root>
<root><a name='value'>test</a></root>
When we remove an attribute, the parser may to move the data from the end to the place where the attribute was, overwriting it, and then truncate the file.
The same thing when we add or remove an element.
Well, in the way that your parser is today, it is incomplete. I hope you can continue the work on it.
Best regards!
|
|
|
|

|
When removing an element from the document, the parser may to remove:
-from the left: all the spaces and tabs. (characters: 0x20 and 0x09)
-from the right: the spaces, tabs, and line feeds (0x13 and 0x10). (characters: 0x20, 0x09, 0x13 and 0x10)
When adding a new element the parser may to add:
-to the left: an amount of spaces that is related to its depth on the hierarchy.
-to the rigth: a new line ("\n", or CR and LF, or 0x13 and 0x10).
|
|
|
|

|
Hey, that is very useful suggestion. I suggest that you join vtd-xml-users mailing list so we can discuss this in more details..
Cheers,
Jimmy Zhang
|
|
|
|

|
Here is the code I was working on.
If the C interface could to be like this, it would to be much better to work, at least for simple xml processing.
I created a unique structure that contains the structures of VTD-XML. In this way the programmers don't need to know them.
If you implement the other features that I have posted before, I can continue this interface.
Note: this code was not terminated.
Bye!
#include "everything.h"
typedef struct {
VTDGen *vg;
VTDNav *vn;
AutoPilot *ap;
XMLModifier *xm;
} XMLDoc;
XMLDoc* XMLOpenFile(char * pzFilePath) {
XMLDoc* xml;
xml = (XMLDoc*) malloc(sizeof(XMLDoc));
if (xml == NULL) return NULL;
xml.vg = NULL;
xml.vn = NULL;
xml.ap = NULL;
xml.xm = NULL;
xml.vg = createVTDGen();
if (!parseFile(xml.vg, TRUE, pzFilePath)) {
return xml;
} else {
XMLClose(xml);
return NULL;
}
}
void XMLClose(XMLDoc* xml) {
freeXMLModifier(xml->xm);
freeAutoPilot(xml->ap);
freeVTDNav(xml->vn);
free(xml.vg->XMLDoc);
freeVTDGen(xml->vg);
free(xml);
}
BOOL XMLMoveToElement(XMLDoc* xml, int location) {
if (xml->vn == NULL) {
xml->vn = getNav(xml->vg);
}
if (toElement(xml->vn, location)) { return TRUE;
} else {
return FALSE;
}
}
void XMLPrepareXPath(XMLDoc* xml, char *pzXPath) {
if (xml->vn == NULL) {
xml->vn = getNav(xml->vg);
}
if (xml->ap != NULL) {
freeAutoPilot(xml->ap);
}
xml->ap = createAutoPilot2();
selectXPath(xml->ap, pzXPath);
bind(ap, vn);
}
int XMLFindNext(XMLDoc* xml) {
int i;
if (xml->ap == NULL) {
return -1;
}
i = evalXPath(xml->ap);
if (i == -1) {
freeAutoPilot(xml->ap);
xml->ap = NULL;
}
return i;
}
char * XMLGetCurElementText(XMLDoc* xml) {
char * ptr;
if (xml->vn == NULL) {
return NULL;
}
int i = getText(xml->vn);
if (i != -1) {
ptr = toString(xml->vn, i);
return ptr;
} else {
return NULL;
}
}
int main() {
XMLDoc *xml;
char *pzFilePath, *pzFilePath2, *pzXPath, *ptr;
pzFilePath = xxx;
pzXPath = xxx;
pzFilePath2 = strdup(pzFilePath);
pzFilePath2[0] = '2';
xml = XMLOpenFile(pzFilePath);
if (xml == NULL) {
puts("Cannot to open the xml file.");
return;
}
if (XMLPrepareXPath(xml, pzXPath) == FALSE) {
puts("Cannot prepare the specified XPath.");
goto getout;
}
while (XMLFindNext(xml) == TRUE) {
ptr = XMLGetElementText(xml);
if (ptr != NULL) {
printf("The element text is: %s", ptr);
free(ptr);
} else {
puts("The element text is blank");
}
if (XMLSetElementText(xml, "This is the new content") == FALSE) {
puts("Cannot set the element new text.");
goto getout;
}
ptr = XMLGetAttrib(xml, "price");
if (ptr != NULL) {
printf("The attribute value is: %s", ptr);
free(ptr);
} else {
puts("The attribute value is blank");
}
if (XMLSetAttrib(xml, "price", "1.23") == FALSE) {
puts("Cannot set the first attribute value.");
goto getout;
}
if (XMLSetAttrib(xml, "qty", "5") == FALSE) {
puts("Cannot set the second attribute value.");
goto getout;
}
}
if (XMLPrepareXPath(xml, pzXPath) == FALSE) {
puts("Cannot prepare the specified XPath.");
goto getout;
}
while (XMLFindNext(xml) == TRUE) {
if (XMLInsertChildElement(xml, "item") == FALSE) {
puts("Cannot add a new child element.");
goto getout;
}
if (XMLSetAttrib(xml, "name", "first") == FALSE) {
puts("Cannot set the first attribute value.");
goto getout;
}
if (XMLInsertElementAfter(xml, "item") == FALSE) {
puts("Cannot add a new element.");
goto getout;
}
if (XMLSetAttrib(xml, "name", "second") == FALSE) {
puts("Cannot set the first attribute value.");
goto getout;
}
}
ptr = XMLToString(xml);
if (ptr != NULL) {
printf("The new xml raw data is:\n %s", ptr);
free(ptr);
} else {
puts("Cannot to get the XML raw data.");
}
if (XMLSaveToFile(xml, pzFilePath2)) {
puts("The new xml was saved to the file.");
} else {
puts("Cannot save the new xml to the file.");
}
getout:
XMLClose(xml);
}
|
|
|
|

|
sounds interesting, we need to study a bit more and
possibly include this as a seocnd API interface for vtd..
and possibly need a nice name for this wrapper...
|
|
|
|

|
I have some ideas on how the parser could to manage the XML document edition:
(I don't know if this is already implemented)
-When only reading the XML file, the parser can read directly from the file, without copying to the memory (This is good for big files).
-When the application starts to add and remove new elements and attributes, if the file size is small these operations can be done in the RAM.
If the file is medium sized to big, then we could to have an in memory "slice keeper" (we can use another name) to keep track of the XML slices, or segments, that form the XML document being modified.
Example:
Suppose that one application opens a XML file of 10MB and wants to make a lot of updates on it, adding elements, child elements, attributes, updating attributes, removing anothers and removing elements...
When the application has not been modified the XML document, there is only one "slice", or no one since the document is entirelly on the file.
Once the application add a new element, we may to have 3 slices:
1. The first part of the document, that is on the file.
2. The added elements, already in XML raw format, that is in the memory.
3. The remaining part of the document, that is on the file.
If the application wants to walk trought the XML document, the parser may to start on the first slice (in the file), then continues on the next slice (on the RAM) and then on the file again.
Each "slice" can be stored in a structure containing:
-The slice location (file or memory)
-The slice offset (for the file) or the memory address (for the RAM)
-The slice size
Then it could be like this:
typedef struct {
int location;
int offset;
int size;
} XMLDocSlice;
The slices could to be stored in an array, or could to be allocated separately having a pointer to the next one. The first option uses only one memory block allocation (then is faster).
As the application makes new changes to the document, more slices are created. (Untill in the cases where the updated data has the same size of the previous one, because the XML file is not updated on the fly, only at the end of all the work. And can be saved on a different file).
If an element is removed, we may have cutted the document into 2 slices: 1 slice of the previous data, and another that begins after the reoved elements terminates.
At the moment the application requests the whole raw XML data to the parser, it may to rebuild the document joining the slices together, by simply concatenating the strings (in the case the application wants the XML document on the memory) or copying one slice at a time to the destination file (in the case the application wants to save the XML document into a file).
And important: All this things may to be transparent to the application (and to the programmers who code it too).
This implementation would to be useful in the cases we want to work on big XML files, and until on small ones.
Advantages:
-It is an easy way to make an XML updater (or XML Writer).
-Don't need to have all the XML file loaded in the memory.
-Don't need to have an object on memory for each element or attribute.
|
|
|
|

|
Thanks for the suggestion...
the api part, I believe that used properly, VTD-XML's interface
is pretty well designed... it might be a bit different from DOM, or SAX, but it has its own characteristics to get used to
regarding makefile,dll, we are working on that front, I can put you in touch with our developer on C to learn your perspective on that...
what do you think?
|
|
|
|

|
It is learned that VTD-XML outperforms SAX in benchmark conducted. In the benchmark, SAX simply parsed through the XML document without other processes; while VTD-XML parsed through the XML document & formed the corresponding LC entries.
My question is: since SAX did nothing besides parsing the XML document (and reporting every event) during the benchmark. How does VTD-XML outperformed SAX when it needed to parse and construct the LC entries at the same time?
I hope the question is clear.
|
|
|
|

|
SAX allocates a lot of small objects, which are slow to create, and furthermore, need to be recycled..
VTD uses integers, which are stored using large memory buffers...which are a lot faster to allocate... and consumes less memory
make sense?
|
|
|
|

|
it sounds making sense... sorry i am not so familiar with these programming.
somehow I am studying the dedicated hardware implementation of XML parsers. If implemented in dedicated hardware or chip, will SAX perform faster than VTD-XML then?
|
|
|
|

|
SAX on chip doesn't make sense...so there is no meaningful comparison here.
|
|
|
|

|
Hi,
we started to play with your parser in a fairly large XML heavy project. It is a data extraction system which can mirror a database without direct access to it (just using the user interface, like a web frontent).
we mainly use xml to normalize data, then we import it to a database. we have file sizes from 10 - 100 MB. i wrote a small xml viewer with your system and i can load and display our biggest XML file in just a matter of a few seconds, thats brilliant man. i have not seen ANY xml application that even comes close to that speed. opening that same xml file, lets say with firefox, takes AGES...
i can say that the performance really blew me away. it took me a while to sort out a few unicode problems, but aside from that your parser works flawlessly.
the only beef i have is the error handling. i have seen lots of empty catch blocks in the code, which makes it hard to pin down problems...
but aside of that, it's a really smart system, kudos...
i hope you continue to work on it. i'm sure you can easily turn this into a commercial product if you work on the details a bit more...
greetings,
markus
|
|
|
|

|
Thanks for the comment...
yes we are continuing working on VTD-XML and your suggestions very welcome!
|
|
|
|

|
Why does the file size increase by a margin of 1.3x to 1.5x? There is another protocol called BXML that decrease the file size and supposedly have better performance than XML. How does VTD rates alongside BXML? Also why is the size of the file so inflated with VTD? How can this be alleviated?
I'm not going to get buy off on using VTD unless I can show that the file size is going to be smaller?
thanks.
|
|
|
|

|
I don't think there is a perfect solution for everything? if you want performance, VTD is the way to go, if you want small files, use BXML ...
|
|
|
|

|
Not often do you see people go against traditional OO and embrace bloat-free way to freedom for the edge cases.
I've known about your impl for some time now off some protocol specific lists (we have a common interest I guess) and I am happy it is receiving a warm welcome here. My main focus shifted from this topic into specific scenarios so I left it all behind, including playing with VTD.
You are facing an uphill struggle from here though, as the world we're in is about selling, runtime if at all possible, bloatware first and foremost, especially in SOA land.
All the best.
|
|
|
|

|
Thanks for the note.
I actually think that we are on verge of a turning point in which people increasingly realize the weaknesses of OO... I am not against OO per se, instead I am promoting a more balanced approach to app development in which sensiblity is placed ahead of formality.
I liked it that it is going uphill, at least it isn't flat or going downhill
modified on Thursday, May 8, 2008 2:25 AM
|
|
|
|

|
I've read all your articles, and I agree with you on the disadvantages and advantages of the various methods of parsing and its effect on memory, performance, redundancy, etc.
However, OO programming is not the best suitable method for programming super fast and memory efficient applications in the first place. C# and .NET itself does not utilize memory like C++ or C, so from the get go, there was a huge elephant in the room.
With that being said, C# has quickly became the language of choice by many, even java developers, because of it's elegance, ease, and simplicity for human understanding.
In my opinion, OO can never be compared to XML, they are like apples and oranges. XML is the underlining data, while OO is an extension of that data, it GIVES THE DATA LIFE, it gives the data functionalities etc.
The problem here is not in OO, it is in .NET's implementation of XML Serialization (which applies to many areas including web services etc).
Any how, I love your idea. Now here is my idea and what i think you should do to make this idea of yours become successful.
Create a persistent generation layer of an OO layer that can be directly bounded to the memory allocated for a deserialized version of an xml. then the business objects can remain Object oriented while not required to be serialized at all.
Example, when you modify an memory, you are actually modifying that piece of xml node value. and when you are getting from memory, you are retrieving a new substring copy of the xml document of the correct node. By doing this, it is important to manage the mapping between xml memory location and the pointers referenced from the properties of an Object.
Comments?
modified on Tuesday, April 29, 2008 4:05 PM
|
|
|
|

|
Hmm... interesting suggestions...
The point I am trying to make with this articles that document-centric XML processing is a more effective XML processing model than object oriented... so the goal is not to make vtd-xml compatible with DOM... but to offer an option that goes beyond DOM...
notice that it is not necessarily about programming language... vtd-xml is available in C, C# and Java..
the limitation (of memory mangement of C#) is inherent to any programming languages ... because objects are small memory blocks (which incurs overhead) VTD-XML goes around that by using big memory blocks..
|
|
|
|

|
my suggestion for a DOM implementation is not to use private variables to store the data, but rather pointers to segments in a underlining xml stream.
ex.
..
public property FirstName
{
get{
return this._xml.xpath('./FirstName');
}
set {
this._xml.modify('./FirstName', value);
}
}
..
public string GetXML()
{
return this._xml.ToString();
}
..
public Person(string xml)
{
}
..
This way there is no heavy performance hit while reducing small memory blocks being used.
also eliminates serialization and deserialization.
i think i'm going to write an article on this, maybe it is worth exploring.
|
|
|
|

|
look forward to it
|
|
|
|
 |
|
|
General News Suggestion Question Bug Answer Joke Rant Admin
|
Reveal XML processing issue #1 and explain why document-centric XML processing is the future.
| Type | Article |
| Licence | CPOL |
| First Posted | 13 Mar 2008 |
| Views | 54,780 |
| Downloads | 108 |
| Bookmarked | 47 times |
|
|