Click here to Skip to main content
15,910,878 members
Please Sign up or sign in to vote.
0.00/5 (No votes)
See more:
We need to scrap a website (some sections) which contain PDF documents. Once we scrap, we need to also store current snapshot of content of that website. For example, if there are 3 pdf files available on a website, i should be able to somehow store this state of content which tells me what files were available last time (with pdf content stored as text). Now next time, when i scrap the same website, suppose, now there are 5 pdf files (2 old pdfs, 1 old pdf but with updated content and 2 new pdfs), i now need compare this state of content with previous version of website and automatically figure out what are "NEW" files, what files are "UPDATED" and what files are "DELETED". Once i detect such changes, i need to create a manifest.txt file which will capture changes for last web scraping session.

Can you please share your thoughts on designing/implementing such a requirement (snapshot comparison)

thanks in advance
Posted

I would do it like this:

1. Scrape the website and store its full state.
2. Next scrape operation, run a diff algorithm on it and note the Changed, Deleted, etc lines.
3. If those lines are links to a PDF document, then you can tell what documents were changed/deleted.
 
Share this answer
 
Comments
Sergey Alexandrovich Kryukov 5-Jun-13 21:06pm    
In a nutshell, that's it, a 5.
It's important to note that not always the notion of "current state" has some valid meaning.
For this and the detail on scrapping technique, please see my answer, Solution 2.
—SA
For Web scrapping (http://en.wikipedia.org/wiki/Web_scraping[^]), please see my past answers:
get specific data from web page[^],
How to get the data from another site[^].

Apparently, not any piece of content can possibly be scraped, or not any one can be easily scraped. Imagine that some server side-backed page generates randomized content (which is often the real case). It can be some interactive application, even a game. In such cases, the notion of the "current state" itself simply makes no sense. An example of some content notoriously difficult to scrape (which is probably intended by the site owners) is the Youtube video…

—SA
 
Share this answer
 

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900