Click here to Skip to main content
15,999,481 members
Please Sign up or sign in to vote.
0.00/5 (No votes)
See more:
I'm trying to find some consecutive nodes <xref ref-type="bibr" rid="ref...">...</xref> (when there are 3 or more) in a file that are separated by a comma or space and write them to a log file.

NOTE: The consecutive nodes that I'm trying to identify should have their respective rid values incremented by +1 minus the text ref. Here is small sample file https://codeshare.io/5wOjlK

and the desired output is
XML
<xref ref-type="bibr" rid="ref2">[2]</xref>, <xref ref-type="bibr" rid="ref3">[3]</xref>, <xref ref-type="bibr" rid="ref4">[4]</xref>


XML
<xref ref-type="bibr" rid="ref11">[11]</xref>, <xref ref-type="bibr" rid="ref12">[12]</xref> <xref ref-type="bibr" rid="ref13">[13]</xref>


here is the code that I'm using https://codeshare.io/ar6mPA But it shows a dtd not found type error, how do I ignore that..I tried using the below code

What I have tried:

C#
FileStream xmlStream = new FileStream(@"D:\test\12345.XML", FileMode.Open, FileAccess.Read);
XmlReaderSettings settings = new XmlReaderSettings();
settings.XmlResolver = null;
settings.ProhibitDtd = false;
XmlReader reader = XmlTextReader.Create(xmlStream, settings);
XmlDocument doc = new XmlDocument();
doc.Load(reader);


instead of

C#
XmlDocument doc = new XmlDocument();
doc.PreserveWhitespace = true;
doc.Load(@"D:\test\12345.XML");

But now it is showing only the first match...I'm confused.. Can anyone help please...
Posted
Updated 28-Feb-18 21:03pm
v3

1 solution

I prefer to use XDocument class[^] which is very "flexible" when there's a need to implement custom search method. See:

C#
XmlReaderSettings settings = new XmlReaderSettings();
settings.DtdProcessing = DtdProcessing.Parse;

XDocument xdoc = XDocument.Load(XmlReader.Create("fullfilename.xml", settings));

var cons = xdoc.Descendants("xref")
    .GroupBy(x=>x.Parent)
    .Select(grp=> new
        {
            Parent = grp.Key,
            ConsecutiveNodes = grp.Select((n, i)=> new
                {
                    Index = i+1,
                    Node = n
                }),
            Count = grp.Count()
        })
    .ToList();

Console.WriteLine("3 or more consecutive nodes:");
foreach(var o in cons)
{
    if (o.Count>2)
    {
        Console.WriteLine("{0}", new string('=', 30));
        Console.WriteLine("Found in: {0} ... {1}", o.Parent.ToString().Substring(0,15), o.Parent.ToString().Substring(o.Parent.ToString().Length-15,15));
        Console.WriteLine("{0}", new string('-', 50));
        foreach (var c in o.ConsecutiveNodes)
        {
            //Console.WriteLine("{0}", c.Node);
            Console.WriteLine("Original rid value [{0}] will be replaced with [{1}]", c.Node.Attribute("rid").Value, c.Index);
            c.Node.Attribute("rid").Value = c.Index.ToString();
        }
    }
}


Above code displays:
3 or more consecutive nodes:
==============================
Found in: <p>In this stud ... 15]</xref>.</p>
--------------------------------------------------
Original rid value [ref2] will be replaced with [1]
Original rid value [ref3] will be replaced with [2]
Original rid value [ref4] will be replaced with [3]
Original rid value [ref20] will be replaced with [4]
Original rid value [ref3] will be replaced with [5]
Original rid value [ref15] will be replaced with [6]
==============================
Found in: <p>The measurin ... cattering..</p>
--------------------------------------------------
Original rid value [ref11] will be replaced with [1]
Original rid value [ref12] will be replaced with [2]
Original rid value [ref13] will be replaced with [3]
Original rid value [ref4] will be replaced with [4]
Original rid value [T2] will be replaced with [5]


For further information, please see:
XDocument.Load Method (XmlReader) (System.Xml.Linq)[^]
XmlReaderSettings.DtdProcessing Property (System.Xml)[^]

Feel free to change code to your needs. Good luck!
 
Share this answer
 
Comments
Member 12692000 1-Mar-18 7:27am    
Hi Maciej, thanks for your post...btw can you explain the portion
.Select(grp=> new
{
Parent = grp.Key,
ConsecutiveNodes = grp.Select((n, i)=> new
{
Index = i+1,
Node = n
}),
Count = grp.Count()
})

I did not get that...also why does it show rid values other than [ref...]? how can I get only [ref...] values(which are 3 consecutive nodes or more) like the desired output....
Maciej Los 1-Mar-18 8:12am    
Well... I wasn't sure what you mean by consecutive nodes... You probably want to get only those nodes which attribute [rid] contains [ref] word.
Change this piece of code:
xdoc.Descendants("xref")
.Where(x=>x.Attribute("rid").Value.Contains("ref")) //<-- condition has been added!
.GroupBy(x=>x.Parent)


As to your ask...
To be able to get consecutive nodes (one after another), we need to group them by their parents (in this case [p] nodes). Above select satatement gets parent node and all [xref] nodes of that parent and adds index to be able to change [rid] attribute.
Member 12692000 1-Mar-18 8:51am    
Thanks for your reply...Can you do me one last favor...
Instead of showing all [rid] with text [ref] it should show only the [rid] of [ref] with consecutive values i.e.
Original rid value [ref2] will be replaced with [1]
Original rid value [ref3] will be replaced with [2]
Original rid value [ref4] will be replaced with [3]
Original rid value [ref20] will be replaced with [4]
Original rid value [ref3] will be replaced with [5]
Original rid value [ref15] will be replaced with [6]

should be

Original rid value [ref2] will be replaced with [1]
Original rid value [ref3] will be replaced with [2]
Original rid value [ref4] will be replaced with [3]

and so on...
Maciej Los 1-Mar-18 8:58am    
Oohhh! You mean consecutive if rid is consecutive: ref2; ref3; ref4, but not [xref] node...
Member 12692000 1-Mar-18 9:33am    
yes

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900