Click here to Skip to main content
15,894,328 members
Please Sign up or sign in to vote.
0.00/5 (No votes)
See more:
Hi Everyone,

I am going to post this in a separate question as its better to get an answer. I just started messing with HAP and I am having some difficulties in figuring out how to get some of my values.

I am using this file as an example and I am storing the returned values into a listview, problem is I don't know how to go about in getting each value.
XML
<?xml version="1.0" encoding="UTF-8"?>

<bookstore>

<book>
   <title lang="en">Harry Potter</title>
   <price>29.99</price>
   <available>In Stock</available>
</book>

<book>
   <title lang="en">Learning XML</title>
   <price>39.95</price>
   <available>In Stock</available>
</book>

<book>
   <title lang="en">Learning C#</title>
   <price>59.95</price>
   <available>Backorder</available>
</book>

<book>
   <title lang="en">Learning Java</title>
   <price>39.95</price>
   <available>In Stock</available>
</book>

</bookstore>

Can someone show me an example on how to traverse the tree and getting each value for each of the books one at a time?

This is all I know how to do right now.
C#
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.Load("sample.txt");

foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//title"))
{
    ListViewItem lView = new ListViewItem();
    lView.Text = node.InnerText;
    listView1.Items.Add(lView);
}


Appreciate any help.
Posted
Updated 13-May-15 23:07pm
v2

Why are you using HtmlAgilityPack, can't you just use some XML reader?
.NET has few of those available out of the box:

Here is how you can read that content with XmlDocument class (it is quite similar to HtmlAgilityPack's HtmlDocument):
C#
XmlDocument doc = new XmlDocument();
doc.Load("sample.txt");

foreach (XmlNode node in doc.SelectNodes("//book"))
{
    string title = node.SelectSingleNode("title").InnerText;
    string price = node.SelectSingleNode("price").InnerText;
    string available = node.SelectSingleNode("available").InnerText;

    // Do something with these values ...
}
 
Share this answer
 
Comments
theadmin 14-May-15 18:10pm    
Well here is the story. I was manually parsing webpages for my values, someone asked my why I am parsing just use HtmlAgility. I decided to give it a shot and I am trying to learn how to use it, now you present an even easier way. I am like 100% more confused than I was before but I do like your solution, I am going to definitely use this.
theadmin 14-May-15 18:10pm    
I just found out that this isn't going to work with the html files. I was just using that file as a sample to test.
Mario Z 15-May-15 3:54am    
Actually you could use XmlDocument to parse HTML content as well, but to be honest I would personally also gone with HtmlAgilityPack in that case because it's a library focused on parsing the HTML content and comes with quite a few nice HTML only related features.

Now regarding the provided sample file, I would strongly recommend you that in the future you do not do that. You see you can easily mislead your requirements if you are providing an incorrect sample, there is a difference between HTML and XML content and there are different tools that are generally used for processing them.

Also just to add a side note regarding the misleading, I would say that this is equal to asking a help in learning how to use a skateboard, but what you actually want to learn is how to usa a snowboard.
I hope you understand what I'm saying...
theadmin 15-May-15 7:59am    
I understand what you are saying, I thought once I understood how it works in a much simpler file I would know what to do in a more complex file. That really backfired and I am still lost. Can you show me how to get the values from this site if you don't mind. http://www.butchsreloading.com/shop/35-powder?id_category=35&n=50 start with this product (Hodgdon Powder, H414, 1lbs) and get the product, price, availability. Using chrome I get the following. I will be getting all the products available on the page.

//*[@id="product_list"]/li[1]/div[2]/h3/a
//*[@id="product_list"]/li[1]/div[3]/div/span[1]
//*[@id="product_list"]/li[1]/div[3]/div/span[2]
Mario Z 18-May-15 7:25am    
I apologize for a bit late response, did not have much free time.
Nevertheless please take a look at the "Solution 4".
This is an answer for the OP requirements mentioned in the comments of the "Solution 1".
C#
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load(url);

foreach (HtmlNode listItem in doc.DocumentNode.SelectNodes("//ul[@id='product_list']/li"))
{
    string product = listItem.SelectSingleNode("./div[2]/h3/a").InnerText;
    string price = listItem.SelectSingleNode("./div[3]/div/span[1]").InnerText;
    string availability = listItem.SelectSingleNode("./div[3]/div/span[2]").InnerText;

    // Do something with these values ...
}


EDIT: This is an answer for the OP requirements mentioned in the comments of this solution.

There is a slight difference with the XPath results that you get from Chrome and the HtmlAgilityPack's requirements.
So may I suggest that instead you to use the following tool in order to get the required XPaths:
https://hapxpathfinder.codeplex.com/

Now for the targeted products in "http://store.thirdgenerationshootingsupply.com/browse.cfm/2,3612.html":
C#
string rowsXPath = "/html[1]/body[1]/table[1]/tr[4]/td[1]/table[1]/tr[1]/td[2]/table[2]/tr";
HtmlNodeCollection rows = doc.DocumentNode.SelectNodes(rowsXPath);
for (int i = 0; i < rows.Count; i++)
{
    // Only even indexed rows have content, odd indexed rows have only a grey line ("<hr>" element). 
    if (i % 2 != 0)
        continue;

    HtmlNode row = rows[i];
    string product = row.SelectSingleNode("./td[2]/a/b").InnerText;

    // For availability it seems a bit tricky.
	// You either have a "div" with the "Out of stock" content or you will have none.
    // There is no indication if the product is available, so you can use something link this:
    HtmlNode availabilityDiv = row.SelectSingleNode("./td[3]/div");
    string availability = (availabilityDiv != null) ? availabilityDiv.InnerText : "Available";

    string price = row.SelectSingleNode("./td[3]/table[1]/tr[1]/td[2]/span").InnerText;

    // Do something with these values ...
}


And for the targeted products in "https://www.americanreloading.com/en/31-gunpowder":
C#
foreach (HtmlNode listItem in doc.DocumentNode.SelectNodes("//ul[@id='product_list']/li"))
{
    HtmlNode productAnchore = listItem.SelectSingleNode("./div[1]/h3/a");
    string product = productAnchore.Attributes["title"].Value;

    string availability = listItem.SelectSingleNode("./div[1]/div[1]/span[1]").InnerText;
    string price = listItem.SelectSingleNode("./div[2]/span[@class='price']").InnerText;

    // Do something with these values ...
}
 
Share this answer
 
v2
Comments
theadmin 25-May-15 13:31pm    
Thanks Mario for working with me on this. I am going to continue the conversation here since others will benefit from it also.
When you posted the code I thought I had it all figured out, little did I realize that it was going to be a LOT more difficult than that.

On this page: http://store.thirdgenerationshootingsupply.com/browse.cfm/2,3612.html I use Chrome to get the xpath in order to make things a lot easier for me when dealing with this xpath issue. Once I had figured out how you got the code working on the first link I posted I pretty much did the same thing, I got the path for the entire grid with products then I went back one position in order to get the other values as I go down the tree.

// path to the entire table
/html/body/table/tbody/tr[4]/td/table/tbody/tr/td[2]/table

// this is what I used
/html/body/table/tr[4]/td/table/tr/td[2]/table/tr

// then to get the values under my first selection while traversing the tree I would use.
string product = listItem.SelectSingleNode("./td[1]/a").InnerText; (debugger stops the program)

I have no idea how to get the values that I need under these nodes since everything I am doing is crashing. Basically in all of this I am just trying to get the following info on every page I am scraping.

// productlink (./td[1]/a and //td[2]/a and ./td[2]/b crashes the debugger)

RED DOT 1 LB

<br>

// item # (difficult to get because of location)
Item #:
 ALLREDDOT1LB
<br>

// qty in stock
Qty In Stock:
 0
<br>

// price
<table class="itemPriceTable" cellpadding=0 cellspacing=0 border=0><tr class="itemSellPriceRow"><td class="bodyTextSmallBold"><span class="itemSellPriceLabel">Your Price: </span></td><td class="bodyTextSmall"><span class="itemSellPrice">$18.59</span></td></tr></table>

Hopefully the picture can explain a lot better than what I am trying to explain.

-----------------------------------------------------------------------------------------------------------------------------------

Here is another scenario, this other page used the same code that you posted the first time.

https://www.americanreloading.com/en/31-gunpowder

//[@id="product_list"]/li

When I tried to duplicate the code nothing worked. There are some values that are more descriptive than others and its better to get those values instead, unfortunately I couldn't get past go.

when you go back up the tree like this

./a (how do you get the href vaule or the title value??)

Google searched for many hours, read many tutorials on xpath and still not getting anywhere efficiently.

I was going to post two pics with the post but it looks like its not possible.


Thanks again for the help.
Mario Z 26-May-15 5:02am    
Hi, I edited the previous solution to include these two new scenarios.
Please check it out, I hope it helps.
theadmin 26-May-15 10:27am    
Thanks for pointing me to that tool Mario, I have been looking for something to simplify finding the correct path.

The first solution definitely would have been impossible since its not the standard way of getting the results. Hopefully I can get a bit further with the tool and not have to bother you too much.

Thanks again for the help.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900