Introduction
Change detection algorithms are used to find the differences between two documents and to construct a delta representation of these differences. It can be used for archiving systems, versioning, or just for a small requirement in your general application. We have two XML files and we will generate a delta file to save the changes. There is no standard for delta files.
I have been working on change detection in XML files, and I said to myself, why not use LINQ? This article will show you how to detect delete, insert, and move operations in XML files with LINQ. This article was written just to give an idea; it is not a complete diff algorithm.
Using the code
I am giving examples by using XML files like this:
="1.0"="utf-8"
<employees>
<employee salaried="no">
<name>Test test</name>
<hire_date>7/31/1996</hire_date>
</employee>
<employee salaried="yes">
<name>ZEYNEP PEHLIVAN</name>
<hire_date>1/1/2009</hire_date>
</employee>
</employees>
I will detect the changes at the level of employee tag, but you can change it for your needs and depending on your XML files' structures.
We have these steps:
- Initialization
In this step, a hash value for each node is generated and added to each node with the ID
attribute. The POS
attribute is also added. We need the ID
and POS
attributes to be able to make joins in LINQ.
POS
= Parent node’s ancestors count + "-" + Parent node’s right siblings count + Node’s ancestors count + "-" + Node’s right siblings count.
- Detecting changes
To start with, we get all the elements. BAsicXName
is the “employee” in this example.
var source = from o in Version1.DescendantsAndSelf(BasicXName)
where o.Attribute("ID") != null && o.Name == BasicXName
select new
{
ID = o.Attribute("ID").Value,
POS = o.Attribute("POS").Value,
Element = o
};
var version = from o in Version2.DescendantsAndSelf(BasicXName)
where o.Attribute("ID") != null && o.Name == BasicXName
select new
{
ID = o.Attribute("ID").Value,
POS = o.Attribute("POS").Value,
Element = o
};
MOVE(x,y,z)
: Moves a child node x from node y to node z. It can be considered as a delete and insert operation. Which means, node x is first deleted and then it is inserted as a child of z. If a node changes its order without changing its parent, it can be considered as a move operation in ordered trees, as shown in the figure below.
If two nodes have the same ID but different POS, the node is considered as moved from POS1 to POS2. To detect this operation, an inner join on the ID
attribute is used by filtering for the same POS attribute.
var Moved = from o in version
join t in source on o.ID equals t.ID
where o.POS != t.POS
select new
{
IDNEW=o.ID,
IDOLD = t.ID,
REFNEW = o.POS,
REFOLD = t.POS
};
DELETE
: Deletes a node. There are two methods to detect this operation. One is a left join between two versions, like this:
var deleted = from o in source
join block1 in version on o.ID equals block1.ID
into elements
from t in elements.DefaultIfEmpty(null)
where !MovedIDs.Contains(o.ID)
select new
{
Element = o.Element
};
As mentioned, the move operation can be considered as a sequence of delete and insert operations. That’s why moved elements are also detected as deleted in this step. We remove them with where !MovedIDs.Contains(o.ID)
. We can detect an insert operation the same way but with a right join. The second way of detecting a delete operation is explained in the Insert section.
INSERT
: inserts a node. We can get all the elements in the second version which are not in the first version, by using:
var nodenotinsource = (from block2 in version
select new { ID = block2.ID }).Except(from o in source select new { ID = o.ID });
To get the element node, we need another join because nodenotinsource
just returns IDs.
var inserted = from o in version
join block1 in nodenotinsource on o.ID equals block1.ID
select new
{
Element = o.Element
};
At the end, we can remove the attributes that we added; I prefer to keep POS, but you can remove it, too.
DeletedElements.Descendants().Attributes("ID").Remove();
InsertedElements.Descendants().Attributes("ID").Remove();
Here is the felta file with a delete, insert, and move operation:
="1.0"="utf-8"
<Delta From="Version1.xml" To="Version2.xml">
<Delete>
<employee salaried="yes" POS="0-0-1-3">
<name POS="1-3-2-1">TEST DELETED</name>
<hire_date POS="1-3-2-0">2/6/1998</hire_date>
</employee>
</Delete>
<Insert>
<employee salaried="yes" POS="0-0-1-0">
<name POS="1-0-2-1">ZEYNEP - Inserted</name>
<hire_date POS="1-0-2-0">1/1/2009</hire_date>
</employee>
</Insert>
<Move>
<employee From="0-0-1-2" To="0-0-1-3" />
<employee From="0-0-1-1" To="0-0-1-2" />
<employee From="0-0-1-0" To="0-0-1-1" />
</Move>
</Delta>
The POS value is used for the move operation, but you can use another way for it; in fact, for all your delta file structure.
When it comes to complexity, it is always quadratic with jointures of LINQ. So it is O(n2).
LINQDiff.zip:
- Hash.cs: to create and add the
ID
and POS
attributes. - Performance.cs: I took this from MSDN to test the execution time. I removed the iteration.
- Diff.cs: to detect changes.
- Two XML files, versions 1 and 2.
- Delta file
Points of interest
If you look at the code, I have detected all the operations in the same function, which is not really cool. If you want to write move, delete, insert functions, you can't pass the source
and version
as parameters. First, you should create an object, for example, the TagElement
object, and then you can use LINQ to create the instances.
var source = from o in source.Descendants(Root)
select new TagElement
{
ID = o.Attribute("ID").Value,
POS = o.Attribute("POS").Value,
ELEMENT = o
};
You can pass these results with IEnumerable<TagElement> source
to another function.
Conclusion
I hope it was or it will be useful for someone out there :).