Click here to Skip to main content
65,938 articles
CodeProject is changing. Read more.
Articles / Languages / XML

Diff in XML files with LINQ

4.63/5 (9 votes)
27 Nov 2009CPOL3 min read 64.1K   2.3K  
Change detection in XML files with LINQ.

Introduction

Change detection algorithms are used to find the differences between two documents and to construct a delta representation of these differences. It can be used for archiving systems, versioning, or just for a small requirement in your general application. We have two XML files and we will generate a delta file to save the changes. There is no standard for delta files.

I have been working on change detection in XML files, and I said to myself, why not use LINQ? This article will show you how to detect delete, insert, and move operations in XML files with LINQ. This article was written just to give an idea; it is not a complete diff algorithm.

Using the code

I am giving examples by using XML files like this:

XML
<?xml version="1.0" encoding="utf-8" ?>
<employees>
  <employee  salaried="no">
    <name>Test test</name>
    <hire_date>7/31/1996</hire_date>
  </employee>
  <employee  salaried="yes">
    <name>ZEYNEP PEHLIVAN</name>
    <hire_date>1/1/2009</hire_date>
  </employee>
</employees>

I will detect the changes at the level of employee tag, but you can change it for your needs and depending on your XML files' structures.

We have these steps:

  • Initialization
  • In this step, a hash value for each node is generated and added to each node with the ID attribute. The POS attribute is also added. We need the ID and POS attributes to be able to make joins in LINQ.

    POS = Parent node’s ancestors count + "-" + Parent node’s right siblings count + Node’s ancestors count + "-" + Node’s right siblings count.

    Image 1

  • Detecting changes

To start with, we get all the elements. BAsicXName is the “employee” in this example.

C#
var source = from o in Version1.DescendantsAndSelf(BasicXName)
             where o.Attribute("ID") != null && o.Name == BasicXName
             select new
             {
                  ID = o.Attribute("ID").Value,
                  POS = o.Attribute("POS").Value,
                  Element = o

             };
var version = from o in Version2.DescendantsAndSelf(BasicXName)
              where o.Attribute("ID") != null && o.Name == BasicXName
              select new
              {
                   ID = o.Attribute("ID").Value,
                   POS = o.Attribute("POS").Value,
                   Element = o
               };

MOVE(x,y,z): Moves a child node x from node y to node z. It can be considered as a delete and insert operation. Which means, node x is first deleted and then it is inserted as a child of z. If a node changes its order without changing its parent, it can be considered as a move operation in ordered trees, as shown in the figure below.

Image 2

If two nodes have the same ID but different POS, the node is considered as moved from POS1 to POS2. To detect this operation, an inner join on the ID attribute is used by filtering for the same POS attribute.

C#
var Moved = from o in version
                  join t in source on o.ID equals t.ID
                  where o.POS != t.POS
                  select new
                  {
                      IDNEW=o.ID,
                      IDOLD = t.ID,
                      REFNEW = o.POS,
                      REFOLD = t.POS
                  };

DELETE: Deletes a node. There are two methods to detect this operation. One is a left join between two versions, like this:

C#
var deleted = from o in source
                   join block1 in version on o.ID equals block1.ID
                   into elements
                   from t in elements.DefaultIfEmpty(null)
                   where !MovedIDs.Contains(o.ID)
                   select new
                   {
                       Element = o.Element

                   };

As mentioned, the move operation can be considered as a sequence of delete and insert operations. That’s why moved elements are also detected as deleted in this step. We remove them with where !MovedIDs.Contains(o.ID). We can detect an insert operation the same way but with a right join. The second way of detecting a delete operation is explained in the Insert section.

INSERT: inserts a node. We can get all the elements in the second version which are not in the first version, by using:

C#
var nodenotinsource = (from block2 in version
     select new { ID = block2.ID }).Except(from o in source select new { ID = o.ID });

To get the element node, we need another join because nodenotinsource just returns IDs.

C#
var inserted = from o in version
                    join block1 in nodenotinsource on o.ID equals block1.ID
                    select new
                    {
                        Element = o.Element

                    };

At the end, we can remove the attributes that we added; I prefer to keep POS, but you can remove it, too.

C#
DeletedElements.Descendants().Attributes("ID").Remove();
InsertedElements.Descendants().Attributes("ID").Remove();

Here is the felta file with a delete, insert, and move operation:

XML
<?xml version="1.0" encoding="utf-8"?>
<Delta From="Version1.xml" To="Version2.xml">
  <Delete>
    <employee salaried="yes" POS="0-0-1-3">
      <name POS="1-3-2-1">TEST DELETED</name>
      <hire_date POS="1-3-2-0">2/6/1998</hire_date>
    </employee>
  </Delete>
  <Insert>
    <employee salaried="yes" POS="0-0-1-0">
      <name POS="1-0-2-1">ZEYNEP - Inserted</name>
      <hire_date POS="1-0-2-0">1/1/2009</hire_date>
    </employee>
  </Insert>
  <Move>
    <employee From="0-0-1-2" To="0-0-1-3" />
    <employee From="0-0-1-1" To="0-0-1-2" />
    <employee From="0-0-1-0" To="0-0-1-1" />
  </Move>
</Delta>

The POS value is used for the move operation, but you can use another way for it; in fact, for all your delta file structure.

When it comes to complexity, it is always quadratic with jointures of LINQ. So it is O(n2).

LINQDiff.zip:

  • Hash.cs: to create and add the ID and POS attributes.
  • Performance.cs: I took this from MSDN to test the execution time. I removed the iteration.
  • Diff.cs: to detect changes.
  • Two XML files, versions 1 and 2.
  • Delta file

Points of interest

If you look at the code, I have detected all the operations in the same function, which is not really cool. If you want to write move, delete, insert functions, you can't pass the source and version as parameters. First, you should create an object, for example, the TagElement object, and then you can use LINQ to create the instances.

C#
var source = from o in source.Descendants(Root)
                   select new TagElement
                   {
                         ID = o.Attribute("ID").Value,
                         POS = o.Attribute("POS").Value,
                         ELEMENT = o
                    }; 

You can pass these results with IEnumerable<TagElement> source to another function.

Conclusion

I hope it was or it will be useful for someone out there :).

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)