Multiple values using HtmlAgility

Question

0.00/5 (No votes)

See more:

Hi Everyone,

I am going to post this in a separate question as its better to get an answer. I just started messing with HAP and I am having some difficulties in figuring out how to get some of my values.

I am using this file as an example and I am storing the returned values into a listview, problem is I don't know how to go about in getting each value.

XML

<?xml version="1.0" encoding="UTF-8"?>

<bookstore>

<book>
   <title lang="en">Harry Potter</title>
   <price>29.99</price>
   <available>In Stock</available>
</book>

<book>
   <title lang="en">Learning XML</title>
   <price>39.95</price>
   <available>In Stock</available>
</book>

<book>
   <title lang="en">Learning C#</title>
   <price>59.95</price>
   <available>Backorder</available>
</book>

<book>
   <title lang="en">Learning Java</title>
   <price>39.95</price>
   <available>In Stock</available>
</book>

</bookstore>

Can someone show me an example on how to traverse the tree and getting each value for each of the books one at a time?

This is all I know how to do right now.

C#

HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.Load("sample.txt");

foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//title"))
{
    ListViewItem lView = new ListViewItem();
    lView.Text = node.InnerText;
    listView1.Items.Add(lView);
}

Appreciate any help.

Posted 13-May-15 15:00pm

theadmin

Updated 13-May-15 23:07pm

Mario Z

v2

Add a Solution

2 solutions

Add a Solution

Add your solution here

Treat my content as plain text, not as HTML

Preview 0

…

Existing Members

Sign in to your account

...or Join us

Download, Vote, Comment, Publish.

Your Email
Password
Forgot your password?

Your Email
This email is in use. Do you need your password?
Optional Password

I have read and agree to the Terms of Service and Privacy Policy
Please subscribe me to the CodeProject newsletters

When answering a question please:

Read the question carefully.
Understand that English isn't everyone's first language so be lenient of bad spelling and grammar.
If a question is poorly phrased then either ask for clarification, ignore it, or edit the question and fix the problem. Insults are not welcome.
Don't tell someone to read the manual. Chances are they have and don't get it. Provide an answer or move on to the next question.

Let's work to help developers, not make them feel stupid.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Mario Z · Accepted Answer · 2015-05-13T23:18:00

Solution 1

Why are you using HtmlAgilityPack, can't you just use some XML reader?
.NET has few of those available out of the box:

Here is how you can read that content with XmlDocument class (it is quite similar to HtmlAgilityPack's HtmlDocument):

C#

XmlDocument doc = new XmlDocument();
doc.Load("sample.txt");

foreach (XmlNode node in doc.SelectNodes("//book"))
{
    string title = node.SelectSingleNode("title").InnerText;
    string price = node.SelectSingleNode("price").InnerText;
    string available = node.SelectSingleNode("available").InnerText;

    // Do something with these values ...
}

Posted 13-May-15 23:18pm

Mario Z

Comments

theadmin 14-May-15 18:10pm

Well here is the story. I was manually parsing webpages for my values, someone asked my why I am parsing just use HtmlAgility. I decided to give it a shot and I am trying to learn how to use it, now you present an even easier way. I am like 100% more confused than I was before but I do like your solution, I am going to definitely use this.

theadmin 14-May-15 18:10pm

I just found out that this isn't going to work with the html files. I was just using that file as a sample to test.

Mario Z 15-May-15 3:54am

Actually you could use XmlDocument to parse HTML content as well, but to be honest I would personally also gone with HtmlAgilityPack in that case because it's a library focused on parsing the HTML content and comes with quite a few nice HTML only related features.

Now regarding the provided sample file, I would strongly recommend you that in the future you do not do that. You see you can easily mislead your requirements if you are providing an incorrect sample, there is a difference between HTML and XML content and there are different tools that are generally used for processing them.

Also just to add a side note regarding the misleading, I would say that this is equal to asking a help in learning how to use a skateboard, but what you actually want to learn is how to usa a snowboard.
I hope you understand what I'm saying...

theadmin 15-May-15 7:59am

I understand what you are saying, I thought once I understood how it works in a much simpler file I would know what to do in a more complex file. That really backfired and I am still lost. Can you show me how to get the values from this site if you don't mind. http://www.butchsreloading.com/shop/35-powder?id_category=35&n=50 start with this product (Hodgdon Powder, H414, 1lbs) and get the product, price, availability. Using chrome I get the following. I will be getting all the products available on the page.

//*[@id="product_list"]/li[1]/div[2]/h3/a
//*[@id="product_list"]/li[1]/div[3]/div/span[1]
//*[@id="product_list"]/li[1]/div[3]/div/span[2]

Mario Z 18-May-15 7:25am

I apologize for a bit late response, did not have much free time.
Nevertheless please take a look at the "Solution 4".

Mario Z 25-May-15 3:21am

Hi, for further clarification here is what the code in "Solution 4" does:

"//ul[@id='product_list']" targets the "ul" element with the specified id attribute anywhere in the document. It targets anywhere because of the used "//" at start.
The follow-up "/li" targets all its child "li" elements.

Now that you have all the list items you can select the required elements in it, so the product's name is located in the list item's second "div", under the "h3", under the "a". In order to target this content you use the following "./div[2]/h3/a". Notice the "." is used here in the beginning, that indicates that this particular XPath is targeting from the current node.

I would recommend you to get familiarized with the XPath, you can go through these w3schools lessons, after that I believe everything will be more clear to you.

I hope this helps.

Mario Z · Accepted Answer · 2015-05-18T01:25:00

This is an answer for the OP requirements mentioned in the comments of the "Solution 1".

C#

HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load(url);

foreach (HtmlNode listItem in doc.DocumentNode.SelectNodes("//ul[@id='product_list']/li"))
{
    string product = listItem.SelectSingleNode("./div[2]/h3/a").InnerText;
    string price = listItem.SelectSingleNode("./div[3]/div/span[1]").InnerText;
    string availability = listItem.SelectSingleNode("./div[3]/div/span[2]").InnerText;

    // Do something with these values ...
}

EDIT: This is an answer for the OP requirements mentioned in the comments of this solution.

There is a slight difference with the XPath results that you get from Chrome and the HtmlAgilityPack's requirements.
So may I suggest that instead you to use the following tool in order to get the required XPaths:
https://hapxpathfinder.codeplex.com/

Now for the targeted products in "http://store.thirdgenerationshootingsupply.com/browse.cfm/2,3612.html":

C#

string rowsXPath = "/html[1]/body[1]/table[1]/tr[4]/td[1]/table[1]/tr[1]/td[2]/table[2]/tr";
HtmlNodeCollection rows = doc.DocumentNode.SelectNodes(rowsXPath);
for (int i = 0; i < rows.Count; i++)
{
    // Only even indexed rows have content, odd indexed rows have only a grey line ("<hr>" element). 
    if (i % 2 != 0)
        continue;

    HtmlNode row = rows[i];
    string product = row.SelectSingleNode("./td[2]/a/b").InnerText;

    // For availability it seems a bit tricky.
	// You either have a "div" with the "Out of stock" content or you will have none.
    // There is no indication if the product is available, so you can use something link this:
    HtmlNode availabilityDiv = row.SelectSingleNode("./td[3]/div");
    string availability = (availabilityDiv != null) ? availabilityDiv.InnerText : "Available";

    string price = row.SelectSingleNode("./td[3]/table[1]/tr[1]/td[2]/span").InnerText;

    // Do something with these values ...
}

And for the targeted products in "https://www.americanreloading.com/en/31-gunpowder":

C#

foreach (HtmlNode listItem in doc.DocumentNode.SelectNodes("//ul[@id='product_list']/li"))
{
    HtmlNode productAnchore = listItem.SelectSingleNode("./div[1]/h3/a");
    string product = productAnchore.Attributes["title"].Value;

    string availability = listItem.SelectSingleNode("./div[1]/div[1]/span[1]").InnerText;
    string price = listItem.SelectSingleNode("./div[2]/span[@class='price']").InnerText;

    // Do something with these values ...
}