Click here to Skip to main content
15,885,741 members
Please Sign up or sign in to vote.
3.20/5 (3 votes)
See more:
I want to extract job vacancies from sites but there is difficulty in extracting data from HTML pages. Someone told me to use RSS feeds of sites to extract data, but I am not getting such sites which I want.
can you please help me out.
Posted
Updated 22-May-15 0:32am
v2
Comments
ZurdoDev 12-May-15 8:40am    
Where are you stuck?
Mohibur Rashid 12-May-15 21:25pm    
It depends on service. If the website is offering such service then the site will also give you the outline. Try to find the outline.

Hello,

Take a look at below code:

import urllib
sock = urllib.urlopen("http://en.wikipedia.org/wiki/Tkinter")
htmlsrc= sock.read()
sock.close()
print htmlsrc


This code will get the data from the site. However it will print the HTML source.
You can dump it into a file and then work on it by reading the file line by line.

Thanks
 
Share this answer
 
Consider that an web page contains following html segment,
HTML
<div name="test_div">This is sample text</div>


You can extract data by the DOMDocument is as following:

PHP
$html = file_get_contents( $url);

libxml_use_internal_errors( true);
$doc = new DOMDocument;
$doc->loadHTML( $html);
$xpath = new DOMXpath( $doc);

$node = $xpath->query( '//div[@name="test_div"]')->item( 0);

echo $node->textContent; // Output should be: This is sample text


Hope it will be helpful.
 
Share this answer
 

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900