C# how to scrap the page when its more then two htmltags?

Question

0.00/5 (No votes)

See more:

I am having a problem with scraping this:

HTML

<td class="main txt"><a href="http://bors-nliv.svd.se/index.php/detail/index/4600">Afarak Group</a></td>

I would like to scrap the name of the stock, in this example its: Afarak Group but couldn't figure out how after all my attempts and searching. But I've managed to scrap of the stock prices with this code:

C#

 private void button3_Click(object sender, EventArgs e)
    {
        List<string> aktier = new List<string>();
        WebClient web = new WebClient();
        String html = web.DownloadString("http://bors-nliv.svd.se/index.php/aktier/index/35244");
        MatchCollection m1 = Regex.Matches(html, @"<td>\s*(.+?)s*</td>", RegexOptions.Singleline);

        foreach (Match m in m1)
        {
            if (m.Groups[1].Value != "3")

            if (m.Groups[1].Value != "Aktier")
            {


                string aktie = m.Groups[1].Value;
                aktier.Add(aktie);
            }
        }
        listBox2.DataSource = aktier;
    }
}

Here the stock price that only has this two htmltags

HTML

<td>0,41</td>

But how do I scrap the stocks name of the page when it looks like this?

<pre lang="HTML">

HTML

<td class="main txt"><td class="main txt"><a href="http://bors-nliv.svd.se/index.php/detail/index/4600">Afarak Group</a></td>

it's a couple more html tags.

I've tried to set the matches to this

C#

MatchCollection m1 = Regex.Matches(html, @"<a href"">\s*(.+?)s*</td>", RegexOptions.Singleline);

But it still doesn't work. What am I missing?

What I have tried:

C#

MatchCollection m1 = Regex.Matches(html, @"<a href"">\s*(.+?)s*</td>", RegexOptions.Singleline);

Posted 5-Jul-16 14:18pm

Member 12620371

Updated 5-Jul-16 20:25pm

Add a Solution

1 solution

Add a Solution

Add your solution here

Treat my content as plain text, not as HTML

Preview 0

…

Existing Members

Sign in to your account

...or Join us

Download, Vote, Comment, Publish.

Your Email
Password
Forgot your password?

Your Email
This email is in use. Do you need your password?
Optional Password

I have read and agree to the Terms of Service and Privacy Policy
Please subscribe me to the CodeProject newsletters

When answering a question please:

Read the question carefully.
Understand that English isn't everyone's first language so be lenient of bad spelling and grammar.
If a question is poorly phrased then either ask for clarification, ignore it, or edit the question and fix the problem. Insults are not welcome.
Don't tell someone to read the manual. Chances are they have and don't get it. Provide an answer or move on to the next question.

Let's work to help developers, not make them feel stupid.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Karthik_Mahalingam · Answer 1 · 2016-07-05T20:25:00

Solution 1

Try this using Html Agility Pack[^]
Refer this dll to your project (pick the right framework)

C#

List<string> aktier = new List<string>();
           WebClient web = new WebClient();
           String html = web.DownloadString("http://bors-nliv.svd.se/index.php/detail/index/4600");
           HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
           doc.LoadHtml(html);
           var div = doc.DocumentNode.Descendants("div").Where(d =>d.Attributes.Contains("class") && d.Attributes["class"].Value.Contains("secondary-nr")).First();
           for (int i = 0; i < div.ChildNodes.Count; i++)
           {
               var node = div.ChildNodes[i];
               string temp = node.InnerText.Trim();
               if (temp.Length > 0)
                   aktier.Add(temp);
           }
           listBox2.DataSource = aktier;

Posted 5-Jul-16 20:25pm

Karthik_Mahalingam

Updated 5-Jul-16 20:26pm

Comments

Member 12620371 6-Jul-16 8:49am

How do I do it without using html agility pack? Whats the regex? Cant figure it out

Karthik_Mahalingam 6-Jul-16 9:50am

Regex is Regular Expression, which is part of core library used to search strings in a certain pattern.
but HTML Agility pack is a third party library used to parse the HTML/DOM

Member 12620371 6-Jul-16 8:54am

I tried your code and it does not work. Getting the wrong strings.

Karthik_Mahalingam 6-Jul-16 9:48am

which string data you need exactly?
provide more information .

Member 12620371 6-Jul-16 11:55am

<td class="main txt"><td class="main txt">Afarak Group</td>

I need the string "Afarak Group". Between the link( and the

Karthik_Mahalingam 6-Jul-16 13:19pm

try this for this link http://bors-nliv.svd.se/index.php/detail/index/4600 [^]
var h1 = doc.DocumentNode.Descendants("h1").First().InnerText;

Karthik_Mahalingam 6-Jul-16 13:31pm

WebClient web = new WebClient();
String html = web.DownloadString("http://bors-nliv.svd.se/index.php/aktier/index/35244");
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
var table = doc.DocumentNode.Descendants("table").First();
var tbody= table.ChildNodes.Where(k => k.Name == "tbody").First();
var rows = tbody.ChildNodes.Where(k => k.Name == "tr").ToList();
var target = rows[0].ChildNodes[7].InnerText; //Afarak Group