Click here to Skip to main content
15,881,710 members
Please Sign up or sign in to vote.
2.50/5 (2 votes)
See more:
Is it possible to read through a malformed "XML" file and remove the malformed nodes? I realy wish that rejecting malformed xml was an option, however that is not the case. Thankfully the xml is fairly simple, and the only big problem is unescaped nodes. Essentially the XML scenario that i'm trying to handle is the below:

XML
<NODE1>
     <NODE2>
         <NODE3>
            <NODE4>Meaningful text goes here</NODE4>
         </NODE3>
     </NODE2>
     <NODE2>
         <NODE3>
            <NODE4>Meaningful text goes here</NODE4>
         </NODE3>
     </NODE2>
     <NODE2>
         <NODE3>
            <NODE4>TEXT IS CUT OFF
</NODE1>


What I need to do is find a way to 'fix' the xml by removing the unescaped items which would turn it into the following:

XML
<NODE1>
     <NODE2>
         <NODE3>
            <NODE4>Meaningful text goes here</NODE4>
         </NODE3>
     </NODE2>
     <NODE2>
         <NODE3>
            <NODE4>Meaningful text goes here</NODE4>
         </NODE3>
     </NODE2>
</NODE1>


I'm working with c# 3.5 but any conceps would help as well.

Thanks in advance!
Posted
Comments
Sergey Alexandrovich Kryukov 25-Apr-13 17:42pm    
First of all, you can invent whatever you want, but don't call it XML. Secondly, you are open a can of worms, something no one would be happy to deal with. Why? And, in practice, posting some problem can only making things worse. It's much better to demand well-formed XML in all cases.
—SA
Member 9666734 25-Apr-13 17:49pm    
I agree with what you're saying. So i'll start off by renaming this crap data that i'm getting FML. As for your second item, you have no idea how much I wish I could do that, however it just simply isn't possible in my circumstance. We are stuck being the data janitors in this instance.
Sergey Alexandrovich Kryukov 25-Apr-13 17:51pm    
I would not advise to get into it...
—SA

By definition, a correct XML parser cannot ever do such things, as it contradicts to the criteria of correct XML parser. So, you need something else, which could be anything but an XML parser. For example, it could be a pre-processor used to "convert" trash into XML.

I have two notes here:
  1. You did not specify the exact behavior of such processing code. You should not consider your question as anything defined.
  2. In practice, if you need something like this, I'm pretty sure you would have to design and implement the behavior you think you require by yourself. Technically, it's quite possible one or another way, but you should not expect enthusiasm of other people and, hence, any help.


I tried to explain briefly the uselessness of such approach in my comments to the question. See also: http://en.wikipedia.org/wiki/Garbage_in,_garbage_out[^].

Are you getting the point?

—SA
 
Share this answer
 
Comments
Espen Harlinn 25-Apr-13 18:29pm    
5'ed!
Sergey Alexandrovich Kryukov 25-Apr-13 18:42pm    
Thank you, Espen.
—SA
H.Brydon 25-Apr-13 20:56pm    
+5 from me too. [I give a lot of them to you, just don't usually comment on them...]
Sergey Alexandrovich Kryukov 25-Apr-13 21:06pm    
Thank you very much, Harvey.
—SA
This isn't a complete solution, but here's my idea for how to solve your problem: roll your own parser (shouldn't be too difficult, XML has simple syntax), keeping track of the current open node on a stack, popping it when the close tag is hit. When you hit a close tag that doesn't match the currently open tag, keep popping the stack until you reach the matching open tag. Then you should be able to keep the valid parts and ignore the rest.

EDIT: Come to think of it, you could probably use XmlReader[^] to do it, as long as you don't need anything after the malformed XML.
 
Share this answer
 
v2
Comments
Espen Harlinn 25-Apr-13 18:29pm    
Pretty much what I would have suggested, how well it would actually work depends on the schema :-D

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900