How to extract image url's using HTML agility pack ?

Question

1.00/5 (1 vote)

See more:

I'm using HTML Agility Pack to extract image url from entered web address.

I'm able to fetch images except for "Paytm.com".

In paytm.com, when i see the page source, it displays 5 "img" tags, where as I am getting only 3.

Can anyone, tell me why I'm getting only three images in list instead five, and how can I solve this issue?

What I have tried:

C++

string[] imgList = new string[20];
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.Load("https://paytm.com/");
var i=0;
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//img"))
{
    imgList[i] = node.Attributes["src"].Value;
    i++;
}

Posted 29-Apr-16 4:11am

Sachin Makwana

Updated 22-Feb-18 1:44am

Add a Solution

Comments

Sergey Alexandrovich Kryukov 29-Apr-16 13:11pm

Show the fragments of original HTML code in question. Look at the 5 "img" cases and try to see the difference between them. Chances are, other IMG simply have different level of nesting.
—SA

[no name] 2-May-16 4:54am

Currently, I'm using above code in my project. It's working fine except "paytm.com".

How can i check if IMG having different level of nesting or not?

Sergey Alexandrovich Kryukov 2-May-16 9:27am

Your logic is failing.
"Your problem is different nesting" does not mean "need to check nesting"...
Did you notice: I already answered your question.

And note again: you are asking this question, and still don't show HTML with the anchor in question.

—SA

[no name] 3-May-16 2:54am

I'm not getting your point!
I'm using above code. there's no HTML page file. I'm loading documents using HTMLAgility pack if I'm getting response code "OK".

I was verifying the code by viewing source of the url, and there I found the problem. I was getting 3 images where as there are 5 images, i can see in page source of "https://www.paytm.com".

I didn't able to find out why so I posted the question. There's no such HTML exist, as I am using above code, to show.

Sergey Alexandrovich Kryukov 3-May-16 9:31am

Look at my solution and please tell me what's unclear.
I have no idea what you are talking about. If you are using HTML Agility pack, you do have HTML page.
—SA

Richard Deeming 29-Apr-16 13:23pm

I'm surprised you're seeing any image URLs; that site uses AngularJS, so unless you execute the script on the page, none of the <img> tags have a src attribute set.

Also, you should use a List<string> instead of an array. If the page has more that 20 images, your current code will crash with an IndexOutOfRangeException.

[no name] 2-May-16 4:56am

Yes, it uses AngularJs. I can exctract ng-src values. but the problem is that, in page source there're five img tags and i'm getting only three at run time.

Array of size 20 is just for testing. Later on I am going to use List<string>.

2 solutions

Add a Solution

Add your solution here

Treat my content as plain text, not as HTML

Preview 0

…

Existing Members

Sign in to your account

...or Join us

Download, Vote, Comment, Publish.

Your Email
Password
Forgot your password?

Your Email
This email is in use. Do you need your password?
Optional Password

I have read and agree to the Terms of Service and Privacy Policy
Please subscribe me to the CodeProject newsletters

When answering a question please:

Read the question carefully.
Understand that English isn't everyone's first language so be lenient of bad spelling and grammar.
If a question is poorly phrased then either ask for clarification, ignore it, or edit the question and fix the problem. Insults are not welcome.
Don't tell someone to read the manual. Chances are they have and don't get it. Provide an answer or move on to the next question.

Let's work to help developers, not make them feel stupid.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Sergey Alexandrovich Kryukov · Answer 1 · 2016-04-29T07:15:00

Please see my comment to the question.

The question makes little sense because you did not show how the element you search are written in original HTML. Actually, you can create some local files and just experiment with this simple stuff, reading the documentation. How can anyone help if you don't provide source information?

Nevertheless, I think most likely reason for missing some of your documents is different levels of nesting of them. Look at your expression "//img". '//' means "deeper descendant", amd '/' is immediate descendant. You only search for elements on one level. To get the idea, please see: XPath — Wikipedia, the free encyclopedia[^].

Yes, HTML Agility pack documentation explains that it supports XPATH and XSLT but claims: "you actually don't HAVE to understand XPATH nor XSLT to use it, don't worry…" Well, I can probably understand what they mean, but it hardly means that you may use it without any understanding at all, having no clue what, say, XPATH does. Certainly, you still have to learn the very basics of it.

Alternatively, you can collect all elements matching certain criteria, if you parse the whole document, each and every element, recognize/filter them and collect matching result in some collection. Usually, it won't take more time than loading the resource itself (you download it all anyway, and the tools parses it all), but will give you the result less dependent on particular document structure.

—SA

Amit Dubey · Answer 2 · 2018-02-22T01:44:00

<pre lang="c#">

public static List<string> AllImages(string startURL)
{
return SpecificLinks(startURL, "//img", "src");
}

public static List<string> SpecificLinks(string startUrl, string elementSelector, string attributeSelector)
{
List<string> links = new List<string>();

HtmlWeb hw = new HtmlWeb();
HtmlDocument doc = hw.Load(startUrl);
HtmlNodeCollection docNodes;

try
{
docNodes = doc.DocumentNode.SelectNodes(elementSelector);
}
catch
{
docNodes = null;
}

if (docNodes != null)
{
foreach (HtmlNode link in doc.DocumentNode.SelectNodes(elementSelector))
{
string elementSource = link.GetAttributeValue(attributeSelector, "#");

if (!elementSource.Equals("#"))
{
try
{
Uri uri = new Uri(new Uri(startUrl), elementSource);

if (!elementSource.Equals(uri.ToString()))
elementSource = uri.ToString();
else
elementSource = "#";
}
catch (Exception)
{
elementSource = "#";
}
}

if (!elementSource.Equals("#"))
links.Add(elementSource);
}
}

return links;
}

How to extract image url's using HTML agility pack ?

2 solutions

Solution 1

Solution 2

Add your solution here

Preview 0