Click here to Skip to main content
15,891,905 members
Please Sign up or sign in to vote.
1.00/5 (1 vote)
See more:
I have been trying to scrape some data off a website. The source has differentiated all the headers of tables to that of the actual contents by different class names. Because I want to scrape all the table information, I got all the headers into one array and contents into another array. But the problem is that when I am trying to write the array contents into a file, I can write a header but second array contains contents from all the table and I cannot mark where contents of first table ends. Because htmlagilitypack scrapes all the tags of specified Nodes, I get all the contents. First let me show the code to make it clear:
HTML
<tr class=tableHeader>
<th width=16%>Caught</th>
<th width=16%><p><a href="/url">Normal Range</a></p></th>
</tr>
<TR class=content><TD><a href="/url">Bluegill</a></TD>
<TD>trap net</TD>
<TD align=CENTER>4.05</TD>
<TD align=CENTER>    7.9 -    37.7</TD>
<TD align=CENTER>0.26</TD>
<TD align=CENTER>    0.1 -     0.2</TD>
</TR>
<TR class=content><TD></TD>
<TD>Gill net</TD>
<TD align=CENTER>1.50</TD>
<TD align=CENTER>N/A</TD>
<TD align=CENTER>0.07</TD>
<TD align=CENTER>N/A</TD>
</TR>
<tr class=tableHeader>
<th>0-5</th>
<th>6-8</th>
<th>9-11</th>
<th>12-14</th>
<th>15-19</th>
<th>20-24</th>
<th>25-29</th>
<th>30+</th>
<th>Total</th>
</tr>
<TR class=content><TD>bluegill</TD>
<TD align=CENTER>19</TD>
<TD align=CENTER>65</TD>
<TD align=CENTER>0</TD>
<TD align=CENTER>0</TD>
<TD align=CENTER>0</TD>
<TD align=CENTER>0</TD>
<TD align=CENTER>0</TD>
<TD align=CENTER>0</TD>
<TD align=CENTER>84</TD>
</TR>

Below is my code to save the headers and contents into array and try to display it exactly like in the website.
C#
int count =0;
foreach (var trTag4Pale in trTags4Pale)
{
    string trText4Pale = trTag4Pale.InnerText;
    paleLake[count] = trText4Pale;
    if (trTags4Small != null)
    {
        int counter = 0;
        foreach (var trTag4Small in trTags4Small)
        {
            string trText4Small = trTag4Small.InnerText;
            smallText[counter] = trText4Small;
            counter++;
        }
     }
     File.AppendAllText(path,paleLake[count]+Environment.Newline+smallText[count]+Environment.Newline);
}

As you see, When I try to append the contents of the array to a file, it lines in the first header, and contents of all the table. But I only want contents of the first table and would repeat the process to get the content of the second table and so forth. If I could get the contents between tr tag tableHeader, the arrays for the content would contain every contents for all the tables in different arrays. I don't know how to do this.
Posted

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900