Usually Website has a Rss File then we can parse it to have the latest news , however , there are some that didn't make this Rss file so we should parse directly HTML of this Website.
You can download this sample here
Using the code
Fisrt of all , we should add to the reference the
Htmlagilitypack , you can download it from nuget on you visual studio.
Ps : If you are working on Windows Phone , it will have some problems with that dll , so you must add these two dll file
System.Xml.Xpath you can also find it on nuget.
We creat a new Function that take as parameter the website that you want to parse :
Then we send a request to the website to get all html page :
HttpClient http = new HttpClient();
var response = await http.GetByteArrayAsync(website);
String source = Encoding.GetEncoding("utf-8").GetString(response, 0, response.Length - 1);
source = WebUtility.HtmlDecode(source);
HtmlDocument resultat = new HtmlDocument();
Ps : you should pay attention on the Encoding , each website has an Encoding , in this example it uses
utf-8 , you can find it on the attribut
charset on the website html.
After that we inspect the element that we want to parse it and get it's id or class then we can retrieve it easely.
as you can see on the picture , we want to parse information of these devices that are all wrapped in
ul, but before that we must find the ancestor
div that has an id or a class , in this example the div have a class named
So now we will filter the html with only the content of this div, then we get all tag of
li that contains information that we want to get.
List<HtmlNode> toftitle = resultat.DocumentNode.Descendants().Where
(x => (x.Name == "div" && x.Attributes["class"] != null && x.Attributes["class"].Value.Contains("block_content"))).ToList();
After each filter you do , it is preferred to beakpoint the project to verify our work.
As a result we get 11 div that have class named
block_content , so you should verify which item contains information that we want to get.in his example it' the item N°6.
var li = toftitle.Descendants("li").ToList();
foreach (var item in li)
var link = item.Descendants("a").ToList().GetAttributeValue("href", null);
var img = item.Descendants("img").ToList().GetAttributeValue("src", null);
var title = item.Descendants("h5").ToList().InnerText;
inside each item of
li, we will get the link , image and Title.
Descendants allow you to get all tag with specified name inside the item.
GetAttributeValue allow you to get the attribut of the tag.
InnerText allow you to get Text betweens tags.
InnerHtml allow you to get HTML.
Difficulty of parsing html depends on the structure of website.