C#: Website HTML Content Parsing, or How To Get Needed Info From Website
How to get and parse website content.
Introduction
How can we get some content from some website?
We can use one of three ways:
1. Open website in a browser engine, i.e. standard WebBrowser or some third-party engine (here is article about WebBrowser and third-party engines) and get content of some DOM element of page.
2. Download HTML content via System.Net.WebClient and next parse it by String.IndexOf()/Substring, regular expressions or HtmlAgilityPack
library.
3. Use website's API (if exists): send query to API and get response, also using System.Net.WebClient or other System.Net classes.
Way 1 - Via browser engine
For example, we have website about weather, with such HTML content:
<html>
<head><title>Weather</title></head>
<body>
City: <div id="city">Monte-Carlo</div>
Precipitation:
<div id="precip">
<img src="/rain.jpg" />
</div>
Day temperature: <div class="t">20 C</div>
Night temperature: <div class="t">18 C</div>
</body>
</html>
Tip: if you haven't internet access or can't locate my site (or create your own), you can navigate local *.html file with such HTML content.
Let's get city name (i.e. Monte-Carlo).
You creating a WebBrowser (programmatically or in an form designer), navigating to website, and when website loaded (in DocumentCompleted event, make sure the website is indeed fully loaded), we getting DOM element (first div
) by it id "city"
and getting it inner text ("Monte-Carlo"
):
// getting city
var divCity = webBrowser1.Document.GetElementById("city"); // getting an fist div element
var city = divCity.InnerText;
label1.Text = "City: " + city;
Next, let's get precipitation image link to show it in PictureBox:
// getting precipitation
var divPrecip = webBrowser1.Document.GetElementById("precip");
var img = divPrecip.Children[0]; // first child element of precip, i.e. <img>
var imgSrc = img.GetAttribute("src"); // get src attribute of <img>
pictureBox1.ImageLocation = imgSrc;
Lastly, let's get day and night temperature:
// IE haven't Document.GetElementsByClassName method, therefore we writing it ourself
private HtmlElement[] GetElementsByClassName(WebBrowser wb, string tagName, string className)
{
var l = new List<HtmlElement>();
var els = webBrowser1.Document.GetElementsByTagName(tagName); // all elems with tag
foreach (HtmlElement el in els)
{
// getting "class" attribute value...
// but stop! it isn't "class"! It is "className"! 0_o
// el.GetAttribute("className") working, and el.GetAttribute("class") - not!
// IE is so IE...
if (el.GetAttribute("className") == className)
{
l.Add(el);
}
}
return l.ToArray();
}
// ...
// getting day and night temperature
var divsTemp = GetElementsByClassName(webBrowser1, "div", "t");
// day
var divDayTemp = divsTemp[0]; // day temperature div
var dayTemp = divDayTemp.InnerText; // day temperature (i.e. 20 C)
label2.Text = "Day temperature: " + dayTemp;
// night
var divNightTemp = divsTemp[1]; // night temperature div
var nightTemp = divNightTemp.InnerText; // night temperature (i.e. 18 C)
label3.Text = "Night temperature: " + nightTemp;
Way 2 - Via WebClient And HtmlAligityPack
You can download full HTML website content via System.Net.WebClient:
using System.Net;
// ...
string HTML;
using (var wc = new WebClient()) // "using" keyword automatically closes WebClient stream on download completed
{
HTML = wc.DownloadString("http://csharp-novichku.ucoz.org/pagetoparse.html")
}
And then you can parse it via HtmlAgilityPack third-party library identically to web engine:
// create HtmlAgilityPack document object from HTML
var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(HTML);
// parsing HTML
label1.Text = "City: " + doc.GetElementbyId("city").InnerText;
Note that HtmlAgilityPack supports NOT all WebBrower engine methods! For example, there are no GetElementByTagName() method. You should define them yourself.
Way 3 - Via Website API
To be continued...
Which way is better?
Website API is most convenient and light way usually. But it usually contains limitations, mainly for reasons of security.
WebBrowser way is lighest way if there is no API in website. Also, it very naturally simulates user actions and sometimes allow you to bypass website anti-bot security. It is one way if site content can load only by JS, because JS can't work in WebClient.
WebClient way is more fast-running and usually very durable than WebBrowser way.