How do I ignore dtd processing my program?

Question

0.00/5 (No votes)

See more:

Hi, I'm trying to create a program which finds whether there are 3 or more consecutive nodes of a particular type and shows them in a console.
For example, if my file contains consecutive nodes in the format

<xref ref-type="bibr" rid="ref11">[11]</xref>, <xref ref-type="bibr" rid="ref12">[12]</xref>, <xref ref-type="bibr" rid="ref13">[13]</xref>

then this is printed on the console, however

<xref ref-type="bibr" rid="ref11">[11]</xref>, <xref ref-type="bibr" rid="ref12">[12]</xref>, <xref ref-type="bibr" rid="ref14">[14]</xref>

as the rid value increased form 12 to 14, it should find matches only when the rid value is increment by +1.
Anyways, the below code does the job

What I have tried:

C#

using System;
using System.Collections.Generic;
using System.Linq;
using System.IO;
using System.Text;
using System.Xml;
using System.Text.RegularExpressions;
namespace CityRemover
{
	class Program
	{
		public static void Main(string[] args)
		{
			string[] files=Directory.GetFiles(@"D:\test\Jobs\12335","*.xml");
			foreach (var file in files) {
				XmlDocument doc = new XmlDocument();
				doc.PreserveWhitespace = true;
				doc.Load(file);
				//only selects <p>'s that already have 3 or more refs. No need to check paragraphs that don't even have enough refs
				XmlNodeList nodes = doc.DocumentElement.SelectNodes("//*[count(xref[@ref-type='bibr' and starts-with(@rid,'ref')])>2]");

				List<string> results = new List<string>();

				//Foreach <p>
				foreach (XmlNode x in nodes)
				{
					XmlNodeList xrefs = x.SelectNodes(".//xref[@ref-type='bibr' and starts-with(@rid,'ref')]");
					List<StartEnd> startEndOfEachTag = new List<StartEnd>(); // we mark the start and end of each ref.
					string temp = x.OuterXml; //the paragraph we're checking

					//finds start and end of each tag xref tag
					foreach (XmlNode xN in xrefs){ //We find the start and end of each paragraph
						StartEnd se = new StartEnd(temp.IndexOf(xN.OuterXml), temp.IndexOf(xN.OuterXml) + xN.OuterXml.Length);
						startEndOfEachTag.Add(se);
					}

					/* This comment shows the regex command used and how we build the regular expression we are checking with.
        string regexTester = Regex.Escape("<xref ref-type=\"bibr\" rid=\"ref2\">2</xref>")+"([ ]|(, ))" + Regex.Escape("<xref ref-type=\"bibr\" rid=\"ref3\">3</xref>");
        Match matchTemp = Regex.Match("<xref ref-type=\"bibr\" rid=\"ref2\">2</xref> <xref ref-type=\"bibr\" rid=\"ref3\">3</xref>", regexTester);
        Console.WriteLine(matchTemp.Value);*/

					//we go through all the xrefs
					for (int i=0; i<xrefs.Count; i++)
					{
						int newIterator = i; //This iterator prevents us from creating duplicates.
						string regCompare = Regex.Escape(xrefs[i].OuterXml); // The start xref

						int count = 1; //we got one xref to start with we need at least 3
						string tempRes = ""; //the string we store the result in

						int consecutive = Int32.Parse(xrefs[i].Attributes["rid"].Value.Substring(3));

						for (int j=i+1; j<xrefs.Count; j++) //we check with the other xrefs to see if they follow immediately after.
						{
							if(consecutive == Int32.Parse(xrefs[j].Attributes["rid"].Value.Substring(3)) - 1)
							{
								consecutive++;
							}
							else { break; }

							regCompare += "([ ]|(, ))" + Regex.Escape(xrefs[j].OuterXml); //we check that the the xref comes exactly after a space or a comma and space
							

							Match matchReg;

							try
							{
								matchReg = Regex.Match(temp.Substring(startEndOfEachTag[i].start, startEndOfEachTag[j].end - startEndOfEachTag[i].start),
								                       regCompare); //we get the result
							}
							catch
							{
								i = j; // we failed and i should start from here now.
								break;
							}

							if (matchReg.Success){
								count++; //it was a success so we increment the number of xrefs we matched
								tempRes = matchReg.Value; // we add it to out temporary result.
								newIterator = j; //update where i should start from next time.
							}
							else {
								i = j; // we failed and i should start from here now.
								break;
							}
						}
						i = newIterator;
						if (count > 2)
						{
							results.Add(tempRes);
						}
					}
				}
				
				
				
				Console.WriteLine("Results: {0}",file.ToString());
				foreach(string s in results)
				{
					Console.WriteLine(s+"\n");
				}
			}
			Console.ReadKey();
		}
	}
	
	class StartEnd
	{
		public int start=-1;
		public int end = -1;

		public StartEnd(int start, int end)
		{
			this.start = start;
			this.end = end;
		}
	}
}

However I get dtd processing errors in some files as there are dtd declared in the file and I want to ignore it.
So I tried
XmlReaderSettings settings = new XmlReaderSettings();
settings.XmlResolver = null;
settings.DtdProcessing = DtdProcessing.Ignore;
FileStream fs = new FileStream(file, FileMode.Open, FileAccess.Read);
XmlReader reader = XmlTextReader.Create(fs, settings);
XmlDocument doc = new XmlDocument();
doc.Load(reader);
instead of
XmlDocument doc = new XmlDocument();
doc.PreserveWhitespace = true;
doc.Load(file);
I don't get any error but the matched expressions are also not displayed. I'm not that familiar with Filestream though. Can anyone tell where am I doing it wrong?

Posted 16-Dec-17 3:26am

Member 12692000

Updated 16-Dec-17 3:30am

v3

Add a Solution

Comments

________________ 17-Dec-17 2:20am

By my opinion, use try-catch instead of of-else - bad idea.
Something happens, you do not know what, and continue... not good.

Check all data well before - decide what you should do if null, if empty and so on.

Member 12692000 17-Dec-17 2:38am

Well the program works without any issue so far (for files without dtd declaration using the above posted code)...my main question was whether my approach to ignoring dtd is right or not? I'm not that good with using streams so I was hoping that someone could point out whether there is a problem in that part of the code?

________________ 17-Dec-17 3:44am

I doubt that Stream in .NET could be the reason.
This is standard object to assess files, it used everywhere, when you use File.LoadAllText() or doc.Load(file) - the Stream works inside.

Member 12692000 18-Dec-17 9:58am

Is my approach to using the stream right?

________________ 19-Dec-17 2:08am

It depends on file size. If file really big - 1 gigabyte, 100 MB and 10 files at same time - .NET gives ability to read it partially via Stream (I speak about local file on disk).
If you need to read one XML with size even 30 MB - simple use existing functions that can do it. Immediate computer now has 4 GB of memory. It is not forbidden to use it.

________________ 19-Dec-17 2:16am

XmlReader - can only read the file forward, its stream could be connected to network file. Its benefits is performance and light weight.

XmlDocument - can construct and change the loaded xml, add nodes etc... But it relatively slow and takes more memory.

You decide what you need to use - according your demands.

Add your solution here

Treat my content as plain text, not as HTML

Preview 0

…

Existing Members

Sign in to your account

...or Join us

Download, Vote, Comment, Publish.

Your Email
Password
Forgot your password?

Your Email
This email is in use. Do you need your password?
Optional Password

I have read and agree to the Terms of Service and Privacy Policy
Please subscribe me to the CodeProject newsletters

When answering a question please:

Read the question carefully.
Understand that English isn't everyone's first language so be lenient of bad spelling and grammar.
If a question is poorly phrased then either ask for clarification, ignore it, or edit the question and fix the problem. Insults are not welcome.
Don't tell someone to read the manual. Chances are they have and don't get it. Provide an answer or move on to the next question.

Let's work to help developers, not make them feel stupid.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)