Extract links from a web page

Question

0.00/5 (No votes)

See more:

Good evening guys:

How can i Extract links from a web page?
I notice it start with "http://". Is that right?
and what is the end?

please tell me if there is another one.
thanks.

Posted 25-Jan-13 7:22am

alcitect

Add a Solution

Comments

Logi Guna 25-Jan-13 13:34pm

http://stackoverflow.com/questions/2248411/get-all-links-on-html-page

alcitect 25-Jan-13 13:44pm

thanks let me see

4 solutions

Solution 3

You can uses the HTML Agility Pack (HAP)[^] for that purpose.

For simple tasks like getting all links, it will be easy to uses as you can use a simple pattern to search for matches...

Posted 25-Jan-13 14:22pm

Philippe Mori

Comments

alcitect 26-Jan-13 13:43pm

thanks Philippe Mori for answer.
can you tell me how can i do "simple pattern to search for matches"
you mean using find in string?? ok it is good, but how can i know the address?
i wait your reply.

thanks for help.

Philippe Mori 26-Jan-13 13:54pm

You start by reading the documentation to get an idea how it works and then you try it with different queries to better understand it.

By using a debugger you can inspect result and if you put the query in a variable, you can easily change the string so that you don't have to recompile the whole application.

The example on the web site should give you a good idea on how to works with links:
Examples.

alcitect 26-Jan-13 14:43pm

alcitect - 35 mins ago
hi i saw the example i can understand how it work but,i have many questions about the example can we chatting now?? if u have time
thanks again

Philippe Mori 26-Jan-13 14:39pm

I use it many months ago so I don't remember all the détails. The example should be a good start as it show how to modify all links in a page. You would replace the line that fixes links by your own code (for ex. adding the link into a list).

By the way, if you inspect the variable with a debugger, it is much easier to figure out which field contains what but it should not be that hard from the documentation too.

Solution 4

You can do this in 3 ways.

CSS Selectors : Just use a in selector and extract the HREF attribute. See some example with this Web scraping Chrome extension

XPATH : //body//a

Or use REGEX for link extraction : <a href="([^"]+)"

Posted 30-Dec-15 2:30am

Vicky Rathee

Comments

Dave Kreskowiak 30-Dec-15 9:56am

This question is 2 years old. I don't think they are looking for an answer any more.

Add a Solution

Add your solution here

Treat my content as plain text, not as HTML

Preview 0

…

Existing Members

Sign in to your account

...or Join us

Download, Vote, Comment, Publish.

Your Email
Password
Forgot your password?

Your Email
This email is in use. Do you need your password?
Optional Password

I have read and agree to the Terms of Service and Privacy Policy
Please subscribe me to the CodeProject newsletters

When answering a question please:

Read the question carefully.
Understand that English isn't everyone's first language so be lenient of bad spelling and grammar.
If a question is poorly phrased then either ask for clarification, ignore it, or edit the question and fix the problem. Insults are not welcome.
Don't tell someone to read the manual. Chances are they have and don't get it. Provide an answer or move on to the next question.

Let's work to help developers, not make them feel stupid.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

*Abhishek Pant* · Accepted Answer · 2013-01-25T07:42:00

Solution 1

website link extractor[^]
or use easy tools-
Extract All Links from the Current Page[^]
or find anchor tags as links were placed in between them-

<a href="http://www.codeproject.com">CodeProject-For those who code</a>

Capture Current Page Content

Posted 25-Jan-13 7:42am

Abhishek Pant

Updated 25-Jan-13 12:14pm

v5

Comments

alcitect 25-Jan-13 14:06pm

thanks for answer.
I'm not prof in web scripting, how can i convert the page to html?
i can open the web page using webclient in VB.Net then what?

Sergey Alexandrovich Kryukov 25-Jan-13 18:08pm

Nice links, could be helpful, my 5.
OP is advised to accept both Solution 1 and Solution 2 formally.
—SA

Abhishek Pant 25-Jan-13 18:12pm

thank you Sergey

H.Brydon 25-Jan-13 21:59pm

Nice links +5 from me.

Abhishek Pant 26-Jan-13 2:32am

thanks H.Brydon.

K0t4 · Accepted Answer · 2013-01-25T07:47:00

I'm not going to write out the code because there are plenty of examples online that you can find by googling "web scraping" but the basics are
1.) Get the HTML using something like libcurl
2.) Convert it to XML using something like libtidy
(Step 2 is optional but its easier to find good XML parsers depending on the language you use)
3.) Parse the results to pull out the links using something like libxml
(If you don't convert to XML you'd have to find a good HTML parser or create your own parser)

Generally if your doing web scraping its a lot easier to use a language like python and there are tons of resources on web scraping with python.

All of the stuff I listed above has plenty of documentation to show you how to properly use it to do what you need.