Click here to Skip to main content
15,898,222 members
Please Sign up or sign in to vote.
0.00/5 (No votes)
See more:
Hello,

I have an HTML file that I'm trying to parse using VB.net by reading all the file and get all of the values in the tags.
It's in the below format:
html
head
td
/td
/head
body
/body
/html

What I have tried:

I used XmlTextReader and .Read() to loop through the file, but the issue that I'm facing is the function .Read() is exiting the file after the first occurrence of an ending tag.
How can I read the whole file? I even tried .EOF it doesn't work.

Thanks.
Posted
Updated 20-Aug-19 4:14am

Well, strictly saying, an html file is not an xml at all. It's may look very similar to xml due to usage of tags, but it isn't. See: XML - Wikipedia[^]

If you want to convert html file into xml file, i'd suggest to use SgmlReader[^]. SgmlReader is a .NET library that is handy for converting SGML content (like HTML and OFX) into well formed XML via XmlReader, XmlDocument, XDocument or XPathDocument. It runs on Windows and Linux using Mono.

If you want to get only data between tags, you have to create a "html parser". For suggestion, please see: Google[^]
 
Share this answer
 
v2
Comments
Zainab_m 20-Aug-19 8:06am    
I only want to get the data between the tags, and I found that XmlTextReader id working fine with me, I just want to know how to solve this issue.
I used Read method in While loop:
While Reader.Read()

End While
While XML methods can be used on HTML documents, HTML documents generally are not valid XML.

What you may want to try using is a library specifically created for parsing HTML documents, such as the HTML Agility Pack. This is a widely used package and hence a lot of documentation and code samples.
Html Agility pack | Html Agility Pack[^]

Also... the "outline" you provided is not valid HTML; a TD needs to be within a TR which needs to be in a TABLE, which must be in the BODY of the document.
 
Share this answer
 
Comments
Zainab_m 21-Aug-19 8:15am    
I know that HTML is not XML, but if the XML methods can work fine with HTML why can't use it?
The HAP are written in C#, how can I use it in VB.net?
MadMyche 21-Aug-19 9:46am    
You can do what you want; but if ask code to work on something it was not designed to then you can only blame yourself, knives are designed for cutting and not turning screws.

As for HAP being written in C#; it really doesn't matter, as it is a NET Library. Add a reference to the project in your solution and then all of it's public methods are available to you.

Did you try looking up VB in their knowledgebase?
Html Agility Pack Knowledge Base | Tagged VB-Net[^]
That's the correct behaviour of the Read method, see the documentation[^].
In order to process the whole file, you have to iteratively call the Read method (see the sample code in the documentation).
 
Share this answer
 
Comments
Zainab_m 20-Aug-19 8:04am    
Yes, I'm calling Read method iteratively through While loop, still exiting after the first ending tag.

While reader.Read()

End While

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900