Click here to Skip to main content
15,888,610 members
Please Sign up or sign in to vote.
0.00/5 (No votes)
See more:
Hi, I'm trying to create a program which finds whether there are 3 or more consecutive nodes of a particular type and shows them in a console.
For example, if my file contains consecutive nodes in the format
<xref ref-type="bibr" rid="ref11">[11]</xref>, <xref ref-type="bibr" rid="ref12">[12]</xref>, <xref ref-type="bibr" rid="ref13">[13]</xref> then this is printed on the console, however <xref ref-type="bibr" rid="ref11">[11]</xref>, <xref ref-type="bibr" rid="ref12">[12]</xref>, <xref ref-type="bibr" rid="ref14">[14]</xref> as the rid value increased form 12 to 14, it should find matches only when the rid value is increment by +1.
Anyways, the below code does the job

What I have tried:

C#
using System;
using System.Collections.Generic;
using System.Linq;
using System.IO;
using System.Text;
using System.Xml;
using System.Text.RegularExpressions;
namespace CityRemover
{
	class Program
	{
		public static void Main(string[] args)
		{
			string[] files=Directory.GetFiles(@"D:\test\Jobs\12335","*.xml");
			foreach (var file in files) {
				XmlDocument doc = new XmlDocument();
				doc.PreserveWhitespace = true;
				doc.Load(file);
				//only selects <p>'s that already have 3 or more refs. No need to check paragraphs that don't even have enough refs
				XmlNodeList nodes = doc.DocumentElement.SelectNodes("//*[count(xref[@ref-type='bibr' and starts-with(@rid,'ref')])>2]");

				List<string> results = new List<string>();

				//Foreach <p>
				foreach (XmlNode x in nodes)
				{
					XmlNodeList xrefs = x.SelectNodes(".//xref[@ref-type='bibr' and starts-with(@rid,'ref')]");
					List<StartEnd> startEndOfEachTag = new List<StartEnd>(); // we mark the start and end of each ref.
					string temp = x.OuterXml; //the paragraph we're checking

					//finds start and end of each tag xref tag
					foreach (XmlNode xN in xrefs){ //We find the start and end of each paragraph
						StartEnd se = new StartEnd(temp.IndexOf(xN.OuterXml), temp.IndexOf(xN.OuterXml) + xN.OuterXml.Length);
						startEndOfEachTag.Add(se);
					}

					/* This comment shows the regex command used and how we build the regular expression we are checking with.
        string regexTester = Regex.Escape("<xref ref-type=\"bibr\" rid=\"ref2\">2</xref>")+"([ ]|(, ))" + Regex.Escape("<xref ref-type=\"bibr\" rid=\"ref3\">3</xref>");
        Match matchTemp = Regex.Match("<xref ref-type=\"bibr\" rid=\"ref2\">2</xref> <xref ref-type=\"bibr\" rid=\"ref3\">3</xref>", regexTester);
        Console.WriteLine(matchTemp.Value);*/

					//we go through all the xrefs
					for (int i=0; i<xrefs.Count; i++)
					{
						int newIterator = i; //This iterator prevents us from creating duplicates.
						string regCompare = Regex.Escape(xrefs[i].OuterXml); // The start xref

						int count = 1; //we got one xref to start with we need at least 3
						string tempRes = ""; //the string we store the result in

						int consecutive = Int32.Parse(xrefs[i].Attributes["rid"].Value.Substring(3));

						for (int j=i+1; j<xrefs.Count; j++) //we check with the other xrefs to see if they follow immediately after.
						{
							if(consecutive == Int32.Parse(xrefs[j].Attributes["rid"].Value.Substring(3)) - 1)
							{
								consecutive++;
							}
							else { break; }

							regCompare += "([ ]|(, ))" + Regex.Escape(xrefs[j].OuterXml); //we check that the the xref comes exactly after a space or a comma and space
							

							Match matchReg;

							try
							{
								matchReg = Regex.Match(temp.Substring(startEndOfEachTag[i].start, startEndOfEachTag[j].end - startEndOfEachTag[i].start),
								                       regCompare); //we get the result
							}
							catch
							{
								i = j; // we failed and i should start from here now.
								break;
							}

							if (matchReg.Success){
								count++; //it was a success so we increment the number of xrefs we matched
								tempRes = matchReg.Value; // we add it to out temporary result.
								newIterator = j; //update where i should start from next time.
							}
							else {
								i = j; // we failed and i should start from here now.
								break;
							}
						}
						i = newIterator;
						if (count > 2)
						{
							results.Add(tempRes);
						}
					}
				}
				
				
				
				Console.WriteLine("Results: {0}",file.ToString());
				foreach(string s in results)
				{
					Console.WriteLine(s+"\n");
				}
			}
			Console.ReadKey();
		}
	}
	
	class StartEnd
	{
		public int start=-1;
		public int end = -1;

		public StartEnd(int start, int end)
		{
			this.start = start;
			this.end = end;
		}
	}
}

However I get dtd processing errors in some files as there are dtd declared in the file and I want to ignore it.
So I tried
XmlReaderSettings settings = new XmlReaderSettings();
settings.XmlResolver = null;
settings.DtdProcessing = DtdProcessing.Ignore;
FileStream fs = new FileStream(file, FileMode.Open, FileAccess.Read);
XmlReader reader = XmlTextReader.Create(fs, settings);
XmlDocument doc = new XmlDocument();
doc.Load(reader);
instead of
XmlDocument doc = new XmlDocument();
doc.PreserveWhitespace = true;
doc.Load(file);
I don't get any error but the matched expressions are also not displayed. I'm not that familiar with Filestream though. Can anyone tell where am I doing it wrong?
Posted
Updated 16-Dec-17 3:30am
v3
Comments
________________ 17-Dec-17 2:20am    
By my opinion, use try-catch instead of of-else - bad idea.
Something happens, you do not know what, and continue... not good.

Check all data well before - decide what you should do if null, if empty and so on.
Member 12692000 17-Dec-17 2:38am    
Well the program works without any issue so far (for files without dtd declaration using the above posted code)...my main question was whether my approach to ignoring dtd is right or not? I'm not that good with using streams so I was hoping that someone could point out whether there is a problem in that part of the code?
________________ 17-Dec-17 3:44am    
I doubt that Stream in .NET could be the reason.
This is standard object to assess files, it used everywhere, when you use File.LoadAllText() or doc.Load(file) - the Stream works inside.

Member 12692000 18-Dec-17 9:58am    
Is my approach to using the stream right?
________________ 19-Dec-17 2:08am    
It depends on file size. If file really big - 1 gigabyte, 100 MB and 10 files at same time - .NET gives ability to read it partially via Stream (I speak about local file on disk).
If you need to read one XML with size even 30 MB - simple use existing functions that can do it. Immediate computer now has 4 GB of memory. It is not forbidden to use it.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900