Click here to Skip to main content
15,910,358 members
Please Sign up or sign in to vote.
1.00/5 (1 vote)
See more:
I'm using HTML Agility Pack to extract image url from entered web address.

I'm able to fetch images except for "Paytm.com".

In paytm.com, when i see the page source, it displays 5 "img" tags, where as I am getting only 3.

Can anyone, tell me why I'm getting only three images in list instead five, and how can I solve this issue?

What I have tried:

C++
string[] imgList = new string[20];
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.Load("https://paytm.com/");
var i=0;
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//img"))
{
    imgList[i] = node.Attributes["src"].Value;
    i++;
}
Posted
Updated 22-Feb-18 1:44am
Comments
Sergey Alexandrovich Kryukov 29-Apr-16 13:11pm    
Show the fragments of original HTML code in question. Look at the 5 "img" cases and try to see the difference between them. Chances are, other IMG simply have different level of nesting.
—SA
[no name] 2-May-16 4:54am    
Currently, I'm using above code in my project. It's working fine except "paytm.com".

How can i check if IMG having different level of nesting or not?
Sergey Alexandrovich Kryukov 2-May-16 9:27am    
Your logic is failing.
"Your problem is different nesting" does not mean "need to check nesting"...
Did you notice: I already answered your question.

And note again: you are asking this question, and still don't show HTML with the anchor in question.

—SA
[no name] 3-May-16 2:54am    
I'm not getting your point!
I'm using above code. there's no HTML page file. I'm loading documents using HTMLAgility pack if I'm getting response code "OK".

I was verifying the code by viewing source of the url, and there I found the problem. I was getting 3 images where as there are 5 images, i can see in page source of "https://www.paytm.com".

I didn't able to find out why so I posted the question. There's no such HTML exist, as I am using above code, to show.
Sergey Alexandrovich Kryukov 3-May-16 9:31am    
Look at my solution and please tell me what's unclear.
I have no idea what you are talking about. If you are using HTML Agility pack, you do have HTML page.
—SA

Please see my comment to the question.

The question makes little sense because you did not show how the element you search are written in original HTML. Actually, you can create some local files and just experiment with this simple stuff, reading the documentation. How can anyone help if you don't provide source information?

Nevertheless, I think most likely reason for missing some of your documents is different levels of nesting of them. Look at your expression "//img". '//' means "deeper descendant", amd '/' is immediate descendant. You only search for elements on one level. To get the idea, please see: XPath — Wikipedia, the free encyclopedia[^].

Yes, HTML Agility pack documentation explains that it supports XPATH and XSLT but claims: "you actually don't HAVE to understand XPATH nor XSLT to use it, don't worry…" Well, I can probably understand what they mean, but it hardly means that you may use it without any understanding at all, having no clue what, say, XPATH does. Certainly, you still have to learn the very basics of it.

Alternatively, you can collect all elements matching certain criteria, if you parse the whole document, each and every element, recognize/filter them and collect matching result in some collection. Usually, it won't take more time than loading the resource itself (you download it all anyway, and the tools parses it all), but will give you the result less dependent on particular document structure.

—SA
 
Share this answer
 
v4
<pre lang="c#">

public static List<string> AllImages(string startURL)
{
return SpecificLinks(startURL, "//img", "src");
}

public static List<string> SpecificLinks(string startUrl, string elementSelector, string attributeSelector)
{
List<string> links = new List<string>();

HtmlWeb hw = new HtmlWeb();
HtmlDocument doc = hw.Load(startUrl);
HtmlNodeCollection docNodes;

try
{
docNodes = doc.DocumentNode.SelectNodes(elementSelector);
}
catch
{
docNodes = null;
}

if (docNodes != null)
{
foreach (HtmlNode link in doc.DocumentNode.SelectNodes(elementSelector))
{
string elementSource = link.GetAttributeValue(attributeSelector, "#");

if (!elementSource.Equals("#"))
{
try
{
Uri uri = new Uri(new Uri(startUrl), elementSource);

if (!elementSource.Equals(uri.ToString()))
elementSource = uri.ToString();
else
elementSource = "#";
}
catch (Exception)
{
elementSource = "#";
}
}

if (!elementSource.Equals("#"))
links.Add(elementSource);
}
}

return links;
}
 
Share this answer
 

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900