Click here to Skip to main content
Rate this: bad
good
Please Sign up or sign in to vote.
See more: C++ C# VB.NET
Good evening guys:
 
How can i Extract links from a web page?
I notice it start with "http://". Is that right?
and what is the end?
 
please tell me if there is another one.
thanks.
Posted 25-Jan-13 7:22am
Comments
Logi Guna at 25-Jan-13 13:34pm
   
http://stackoverflow.com/questions/2248411/get-all-links-on-html-page
alcitect at 25-Jan-13 13:44pm
   
thanks let me see
Rate this: bad
good
Please Sign up or sign in to vote.

Solution 1

website link extractor[^]
or use easy tools-
Extract All Links from the Current Page[^]
or find anchor tags as links were placed in between them-
<a href="http://www.codeproject.com">CodeProject-For those who code</a>
Capture Current Page Content
  Permalink  
v5
Comments
alcitect at 25-Jan-13 14:06pm
   
thanks for answer.
I'm not prof in web scripting, how can i convert the page to html?
i can open the web page using webclient in VB.Net then what?
Sergey Alexandrovich Kryukov at 25-Jan-13 18:08pm
   
Nice links, could be helpful, my 5.
OP is advised to accept both Solution 1 and Solution 2 formally.
—SA
Abhishek Pant at 25-Jan-13 18:12pm
   
thank you Sergey
H.Brydon at 25-Jan-13 21:59pm
   
Nice links +5 from me.
Abhishek Pant at 26-Jan-13 2:32am
   
thanks H.Brydon.
Rate this: bad
good
Please Sign up or sign in to vote.

Solution 2

I'm not going to write out the code because there are plenty of examples online that you can find by googling "web scraping" but the basics are
1.) Get the HTML using something like libcurl
2.) Convert it to XML using something like libtidy
(Step 2 is optional but its easier to find good XML parsers depending on the language you use)
3.) Parse the results to pull out the links using something like libxml
(If you don't convert to XML you'd have to find a good HTML parser or create your own parser)
 
Generally if your doing web scraping its a lot easier to use a language like python and there are tons of resources on web scraping with python.
 
All of the stuff I listed above has plenty of documentation to show you how to properly use it to do what you need.
  Permalink  
Comments
Sergey Alexandrovich Kryukov at 25-Jan-13 18:07pm
   
Agree, a 5.
—SA
Rate this: bad
good
Please Sign up or sign in to vote.

Solution 3

You can uses the HTML Agility Pack (HAP)[^] for that purpose.
 
For simple tasks like getting all links, it will be easy to uses as you can use a simple pattern to search for matches...
  Permalink  
Comments
alcitect at 26-Jan-13 13:43pm
   
thanks Philippe Mori for answer.
can you tell me how can i do "simple pattern to search for matches"
you mean using find in string?? ok it is good, but how can i know the address?
i wait your reply.
 
thanks for help.
Philippe Mori at 26-Jan-13 13:54pm
   
You start by reading the documentation to get an idea how it works and then you try it with different queries to better understand it.
 
By using a debugger you can inspect result and if you put the query in a variable, you can easily change the string so that you don't have to recompile the whole application.
 
The example on the web site should give you a good idea on how to works with links:
Examples.
alcitect at 26-Jan-13 14:43pm
   
alcitect - 35 mins ago
hi i saw the example i can understand how it work but,i have many questions about the example can we chatting now?? if u have time
thanks again
Philippe Mori at 26-Jan-13 14:39pm
   
I use it many months ago so I don't remember all the détails. The example should be a good start as it show how to modify all links in a page. You would replace the line that fixes links by your own code (for ex. adding the link into a list).
 
By the way, if you inspect the variable with a debugger, it is much easier to figure out which field contains what but it should not be that hard from the documentation too.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

  Print Answers RSS
0 OriginalGriff 605
1 Sergey Alexandrovich Kryukov 305
2 BillWoodruff 259
3 PIEBALDconsult 220
4 CPallini 220


Advertise | Privacy | Mobile
Web04 | 2.8.141029.1 | Last Updated 25 Jan 2013
Copyright © CodeProject, 1999-2014
All Rights Reserved. Terms of Service
Layout: fixed | fluid

CodeProject, 503-250 Ferrand Drive Toronto Ontario, M3C 3G8 Canada +1 416-849-8900 x 100