How to extract data from HTML pages?

Question

3.20/5 (3 votes)

See more:

I want to extract job vacancies from sites but there is difficulty in extracting data from HTML pages. Someone told me to use RSS feeds of sites to extract data, but I am not getting such sites which I want.
can you please help me out.

Posted 11-May-15 22:33pm

Member 11683619

Updated 22-May-15 0:32am

Rahul VB

v2

Add a Solution

Comments

ZurdoDev 12-May-15 8:40am

Where are you stuck?

Mohibur Rashid 12-May-15 21:25pm

It depends on service. If the website is offering such service then the site will also give you the outline. Try to find the outline.

2 solutions

Add a Solution

Add your solution here

Treat my content as plain text, not as HTML

Preview 0

…

Existing Members

Sign in to your account

...or Join us

Download, Vote, Comment, Publish.

Your Email
Password
Forgot your password?

Your Email
This email is in use. Do you need your password?
Optional Password

I have read and agree to the Terms of Service and Privacy Policy
Please subscribe me to the CodeProject newsletters

When answering a question please:

Read the question carefully.
Understand that English isn't everyone's first language so be lenient of bad spelling and grammar.
If a question is poorly phrased then either ask for clarification, ignore it, or edit the question and fix the problem. Insults are not welcome.
Don't tell someone to read the manual. Chances are they have and don't get it. Provide an answer or move on to the next question.

Let's work to help developers, not make them feel stupid.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Rahul VB · Answer 1 · 2015-05-22T00:31:00

Hello,

Take a look at below code:

import urllib
sock = urllib.urlopen("http://en.wikipedia.org/wiki/Tkinter")
htmlsrc= sock.read()
sock.close()
print htmlsrc

This code will get the data from the site. However it will print the HTML source.
You can dump it into a file and then work on it by reading the file line by line.

Thanks

BD Star · Answer 2 · 2015-05-31T20:37:00

Consider that an web page contains following html segment,

HTML

<div name="test_div">This is sample text</div>

You can extract data by the DOMDocument is as following:

PHP

$html = file_get_contents( $url);

libxml_use_internal_errors( true);
$doc = new DOMDocument;
$doc->loadHTML( $html);
$xpath = new DOMXpath( $doc);

$node = $xpath->query( '//div[@name="test_div"]')->item( 0);

echo $node->textContent; // Output should be: This is sample text

Hope it will be helpful.

How to extract data from HTML pages?

2 solutions

Solution 1

Solution 2

Add your solution here

Preview 0