Click here to Skip to main content
15,881,882 members
Please Sign up or sign in to vote.
0.00/5 (No votes)
See more:
I am writing to seek help, as to how I can extract the value ‘Editor’s Pick’ from the following html data source:
<P align=justify>Editor's picks<BR><A href="/Article.asp?PUB=250&ISS=22792&SID=52855&TS=1&article=Idiosyncratic risk" name="" target=_blank>Idiosyncratic risk?</A>


So far, I have created the following function, which is currently returning null.
C#
public static string getHTMLTags()
    {

        string url = "";

        string data = storyMethod();

        HtmlDocument html = new HtmlDocument();
        html.LoadHtml(data);

        var nodes = html.DocumentNode.SelectNodes("//p[@align=justify]//strong[1]");

        if (nodes != null)
        {
            foreach (var node in nodes)
            {
                string Description = node.InnerHtml;
                return Description;
            }
        }

        return null;

    }


Any further assistance as to what methods/properties I could use within the Html agility pack, which could help me to solve this task.

Expected output:
Editor's picks


Thank you for any further assistance.
Posted

1 solution

The problem is that your <P> tag isn't closed, so HAP is treating the <strong> tag as a sibling element, not a child element.

The solution is buried in the discussions on the CodePlex site:

Now, you can tweak the HTML agility pack to better suit what you expect using the HtmlNode.ElementFlags static property ... What you can do is tell it you don't want to support unclosed <p> tags:
C#
HtmlNode.ElementsFlags.Remove("p"); // remove the Empty and Closed flags
HtmlDocument doc = new HtmlDocument();
doc.Load(...);


You're also missing quotes around the attribute value, and you should only use a single / for the descendant node:
C#
HtmlNode.ElementsFlags.Remove("p");
HtmlDocument html = new HtmlDocument();
html.LoadHtml(data);

var nodes = html.DocumentNode.SelectNodes("//p[@align='justify']/strong[1]");
return nodes == null ? null : nodes.Select(n => n.InnerHtml).FirstOrDefault();

// Result:
// Editor's picks<br><a href="/Article.asp?PUB=250&ISS=22792&SID=52855&TS=1&article=Idiosyncratic risk" name="" target="_blank">Idiosyncratic risk?</a>
 
Share this answer
 

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900