Click here to Skip to main content
Click here to Skip to main content
Go to top

How to Parse Html using c#

, 6 Aug 2014
Rate this:
Please Sign up or sign in to vote.
Get information of any website you want

Introduction

Usually Website has a Rss File then we can parse it to have the latest news , however , there are some that didn't make this Rss file so we should parse directly HTML of this Website. 

You can download this sample here

Using the code

Fisrt of all , we should add to the reference the Htmlagilitypack , you can download it from nuget on you visual studio.

Ps : If you are working on Windows Phone , it will have some problems with that dll , so you must add these two dll file System.net.http and System.Xml.Xpath you can also find it on nuget.

We creat a new Function that take as parameter the website that you want to parse :

 Parsing("http://www.mytek.tn/");

Then we send a request to the website to get all html page : 

HttpClient http = new HttpClient();
var response = await http.GetByteArrayAsync(website);
String source = Encoding.GetEncoding("utf-8").GetString(response, 0, response.Length - 1);
source = WebUtility.HtmlDecode(source);
HtmlDocument resultat = new HtmlDocument();
resultat.LoadHtml(source);

Ps : you should pay attention on the Encoding , each website has an Encoding , in this example it uses utf-8 , you can find it on the attribut charset on the website html.

After that we inspect the element that  we want to parse it and get it's id or class then we can retrieve it easely.

 

as you can see on the picture , we want to parse information of these devices that are all wrapped in ul, but before that we must find the ancestor div that has an id or a class , in this example the div have a class named block_content.

So now we will filter the html with only the content of this div, then we get all tag of li that contains information that we want to get.

 List<HtmlNode> toftitle = resultat.DocumentNode.Descendants().Where
 (x => (x.Name == "div" && x.Attributes["class"] != null && x.Attributes["class"].Value.Contains("block_content"))).ToList();

After each filter you do , it is preferred to beakpoint the project to verify our work.

 

As a result we get 11 div that have class named block_content ,  so you should verify which item contains information that we want to get.in his example it' the item N°6.

 var li = toftitle[6].Descendants("li").ToList();
 foreach (var item in li)
 {
   var link = item.Descendants("a").ToList()[0].GetAttributeValue("href", null);
   var img = item.Descendants("img").ToList()[0].GetAttributeValue("src", null);
   var title = item.Descendants("h5").ToList()[0].InnerText;
 }

inside each item of li, we will get the link , image and Title.

Descendants allow you to get all tag with specified name inside the item.

GetAttributeValue allow you to get the attribut of the tag.

InnerText allow you to get Text betweens tags.

InnerHtml allow you to get HTML.

History

Difficulty of parsing  html depends on the structure of website.

 

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Share

About the Author

Anis Derbel
Software Developer (Junior) Microsoft Student Partners
Tunisia Tunisia
I study Software Engineering , 23 years old , I'm motivated with all Technologies of Microsoft.
Since I have been in the Community of Microsoft as Microsoft Student Partners, I developped many apps on the platform Windows and Phone. Now , it's time to share what I learn here and I'am ready to help Everyone.
You can contact me at any time (anisderbel@outlook.com)
Group type: Organisation

9 members

Follow on   LinkedIn

Comments and Discussions

 
QuestionNice job but.... PinprofessionalKees van Spelde6-Aug-14 6:48 
GeneralRe: Nice job but.... PinmemberPIEBALDconsult6-Aug-14 7:06 
GeneralRe: Nice job but.... PinprofessionalKees van Spelde6-Aug-14 8:35 
GeneralRe: Nice job but.... PingroupAnis Derbel6-Aug-14 10:50 
GeneralRe: Nice job but.... PingroupAnis Derbel6-Aug-14 10:51 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

| Advertise | Privacy | Mobile
Web01 | 2.8.140916.1 | Last Updated 6 Aug 2014
Article Copyright 2014 by Anis Derbel
Everything else Copyright © CodeProject, 1999-2014
Terms of Service
Layout: fixed | fluid