Click here to Skip to main content
15,892,643 members
Please Sign up or sign in to vote.
0.00/5 (No votes)
See more:
Good evening guys:

How can i Extract links from a web page?
I notice it start with "http://". Is that right?
and what is the end?

please tell me if there is another one.
thanks.
Posted
Comments
Logi Guna 25-Jan-13 13:34pm    
http://stackoverflow.com/questions/2248411/get-all-links-on-html-page
alcitect 25-Jan-13 13:44pm    
thanks let me see

website link extractor[^]
or use easy tools-
Extract All Links from the Current Page[^]
or find anchor tags as links were placed in between them-
<a href="http://www.codeproject.com">CodeProject-For those who code</a>

Capture Current Page Content
 
Share this answer
 
v5
Comments
alcitect 25-Jan-13 14:06pm    
thanks for answer.
I'm not prof in web scripting, how can i convert the page to html?
i can open the web page using webclient in VB.Net then what?
Sergey Alexandrovich Kryukov 25-Jan-13 18:08pm    
Nice links, could be helpful, my 5.
OP is advised to accept both Solution 1 and Solution 2 formally.
—SA
Abhishek Pant 25-Jan-13 18:12pm    
thank you Sergey
H.Brydon 25-Jan-13 21:59pm    
Nice links +5 from me.
Abhishek Pant 26-Jan-13 2:32am    
thanks H.Brydon.
I'm not going to write out the code because there are plenty of examples online that you can find by googling "web scraping" but the basics are
1.) Get the HTML using something like libcurl
2.) Convert it to XML using something like libtidy
(Step 2 is optional but its easier to find good XML parsers depending on the language you use)
3.) Parse the results to pull out the links using something like libxml
(If you don't convert to XML you'd have to find a good HTML parser or create your own parser)

Generally if your doing web scraping its a lot easier to use a language like python and there are tons of resources on web scraping with python.

All of the stuff I listed above has plenty of documentation to show you how to properly use it to do what you need.
 
Share this answer
 
Comments
Sergey Alexandrovich Kryukov 25-Jan-13 18:07pm    
Agree, a 5.
—SA
You can uses the HTML Agility Pack (HAP)[^] for that purpose.

For simple tasks like getting all links, it will be easy to uses as you can use a simple pattern to search for matches...
 
Share this answer
 
Comments
alcitect 26-Jan-13 13:43pm    
thanks Philippe Mori for answer.
can you tell me how can i do "simple pattern to search for matches"
you mean using find in string?? ok it is good, but how can i know the address?
i wait your reply.

thanks for help.
Philippe Mori 26-Jan-13 13:54pm    
You start by reading the documentation to get an idea how it works and then you try it with different queries to better understand it.

By using a debugger you can inspect result and if you put the query in a variable, you can easily change the string so that you don't have to recompile the whole application.

The example on the web site should give you a good idea on how to works with links:
Examples.
alcitect 26-Jan-13 14:43pm    
alcitect - 35 mins ago
hi i saw the example i can understand how it work but,i have many questions about the example can we chatting now?? if u have time
thanks again
Philippe Mori 26-Jan-13 14:39pm    
I use it many months ago so I don't remember all the détails. The example should be a good start as it show how to modify all links in a page. You would replace the line that fixes links by your own code (for ex. adding the link into a list).

By the way, if you inspect the variable with a debugger, it is much easier to figure out which field contains what but it should not be that hard from the documentation too.
You can do this in 3 ways.

CSS Selectors : Just use a in selector and extract the HREF attribute. See some example with this Web scraping Chrome extension

XPATH : //body//a

Or use REGEX for link extraction : <a href="([^"]+)"
 
Share this answer
 
Comments
Dave Kreskowiak 30-Dec-15 9:56am    
This question is 2 years old. I don't think they are looking for an answer any more.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900