Click here to Skip to main content
15,885,216 members
Please Sign up or sign in to vote.
0.00/5 (No votes)
I am pulling data from a website where it is organized in a table. The first two rows look like this (I deleted some style info):

<table id="loads">
   <thead>
   <tr class="tableHeading">
     <th><a original='Load ID'></a></th>
     <th><a original='# of cars'></a></th>
     <th><a original='Year/Make/Model'></a></th>
     <th><a original='Origin City'></a></th>
     <th><a original='Origin State'></a></th>
     <th><a original='Destination City'></a></th>
     <th><a original='Destination State'></a></th>
     <th><a original='Mileage'></a></th>
     <th><a original='Price per Shipment'></a></th>
     <th><a original='Price per Mile'></a></th>
     <th>View</th>
     <th><a original='Comments'></a></th>
   </tr>
   </thead>

   <tbody>
   <tr>
     <td>123456789</td>
     <td>1</td>
     <td>2015 GMC TERRAIN SLE</td>
     <td>Los Angeles</td>
     <td>CA</td>
     <td>San Francisco</td>
     <td>CA</td>
     <td>400</td>
     <td>$400</td>
     <td>$1</td>
     <td>
        <a href="/ViewLoad.asp?nload_id=123456789&npickup_code=">
         <img src="/images/icons/view.gif" >
         </a>
     </td>
     <td>Some Text</td>
   </tr>


There are 12 cells per row - all strings except for the 11th, which is one of the main reasons i am posting this question.

What I have tried:

I created a class that has 13 string properties. The extra one (which i made the first) is a Status property which will be New or Old. Later I am going to do some things with New rows, but that is not my issue right now.

So now i want to grab the innertext of each cell (except 11) and assign the string into an array. Here are my steps:

string collect = webBrowser1.Document.Body.InnerHtml;
string data = WebUtility.HtmlDecode(collect);
HtmlAgilityPack.HtmlDocument htmlDoc = new HtmlAgilityPack.HtmlDocument();
htmlDoc.LoadHtml(data);
HtmlNodeCollection rows = htmlDoc.DocumentNode.SelectNodes("//table[@id='loads']//tbody//tr");


Note - I checked up until this point, and so far all of this works, and the rows collection is collecting all of the rows in the table except the header (I only showed one non-header row above, but there are many).

On the next step I get lost. I am trying to get the cell strings into a string array, and into a bindinglist that is set up at the form level:

BindingSource source = new BindingSource(); /// this binds to the dataviewgrid
BindingList<Load> list = new BindingList<Load>();
BindingList<Load> listDeleted = new BindingList<Load>();
List<Load> sortList = new List<Load>();


Here is my code:

int rowIndex = 0;

foreach (HtmlNode row in rows)
{
    int columnIndex = 0;
    string[] rowData = new string[13];

    foreach (HtmlNode cell in row.ChildNodes)
    {
        if (columnIndex != 0 && columnIndex != 11)
        {
            rowData[columnIndex - 1] = cell.InnerText;
        }

        rowData[11] = cell.FirstChild.Attributes["href"].Value;

        MessageBox.Show(rowData[11]);
        columnIndex++;
     }

     Load newLoad = new Load(rowData);

     if (!list.Contains(newLoad) && !listDeleted.Contains(newLoad))
     {
         list.Add(newLoad);
         updated = true;
     }
     else
     {
         int itemIndex = list.IndexOf(newLoad);
         if (itemIndex > 0)
         {
             if (!list[itemIndex].Comments.Equals(newLoad.Comments))
                 {
                     list[itemIndex].Comments = newLoad.Comments;
                     list[itemIndex].Status = "MODIFIED";
                     updated = true;
                 }
          }
       }
       rowIndex++;
   }

}

I am not sure what i am doing wrong in this last code block - and greatly appreciate any help.
Posted
Updated 29-Jan-17 15:22pm
Comments
Suvabrata Roy 27-Jan-17 4:44am    
what is your exact problem I am unable understand that.
Ken-in-California 27-Jan-17 5:47am    
It works until the point (above) where I create the array "rowData" and try to assign the table cell innerText strings to that array. (The section with the nested foreach loops).
The nodecollection called "rows" is collecting all of the row elements correctly, but when I try to get the cell nodes from that collection into the array nothing happens.

Once I get the data into the array I'm still going to need to get it into a bindinglist - but I am not there yet. Right now my problem is getting to the array.
Richard Deeming 27-Jan-17 8:39am    
rowData[11] = cell.FirstChild.Attributes["href"].Value;

You're executing that line for every cell, but only one cell has a child <a> element. You need to change your code so that it only tries to extract that value from the correct cell.

You might also need to check whether the <a> element really is the first child, or whether the HAP includes the white-space as a text node.
Ken-in-California 27-Jan-17 13:26pm    
Richard - Thanks.
The code is almost exactly as shown above - I only deleted some styles and changed some text (like a number that i changed to "123456789".
You seem to be validating my use of the firstchild property of the HTMLNode class - I've never used that before, and was not sure it was the right way to go.
Can recommend a different way to get that "href" text string?
Richard Deeming 27-Jan-17 13:41pm    
As I said, you'll need to change your code so that line only executes for the cell that contains an anchor. It's currently executing for every cell, which won't work.

1 solution

It turned out that the website was returning some escape characters that were showing up as additional rows, so I was able to handle that by rewriting my conditionals.
Thanks for taking the time to respond to my question Richard, it helped.
 
Share this answer
 

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900