How to capture current state of content on a website (snapshot)

Question

0.00/5 (No votes)

See more:

We need to scrap a website (some sections) which contain PDF documents. Once we scrap, we need to also store current snapshot of content of that website. For example, if there are 3 pdf files available on a website, i should be able to somehow store this state of content which tells me what files were available last time (with pdf content stored as text). Now next time, when i scrap the same website, suppose, now there are 5 pdf files (2 old pdfs, 1 old pdf but with updated content and 2 new pdfs), i now need compare this state of content with previous version of website and automatically figure out what are "NEW" files, what files are "UPDATED" and what files are "DELETED". Once i detect such changes, i need to create a manifest.txt file which will capture changes for last web scraping session.

Can you please share your thoughts on designing/implementing such a requirement (snapshot comparison)

thanks in advance

Posted 5-Jun-13 7:43am

Tarandeep Singh Sawhney

Add a Solution

2 solutions

Add a Solution

Add your solution here

Treat my content as plain text, not as HTML

Preview 0

…

Existing Members

Sign in to your account

...or Join us

Download, Vote, Comment, Publish.

Your Email
Password
Forgot your password?

Your Email
This email is in use. Do you need your password?
Optional Password

I have read and agree to the Terms of Service and Privacy Policy
Please subscribe me to the CodeProject newsletters

When answering a question please:

Read the question carefully.
Understand that English isn't everyone's first language so be lenient of bad spelling and grammar.
If a question is poorly phrased then either ask for clarification, ignore it, or edit the question and fix the problem. Insults are not welcome.
Don't tell someone to read the manual. Chances are they have and don't get it. Provide an answer or move on to the next question.

Let's work to help developers, not make them feel stupid.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Ron Beyer · Answer 1 · 2013-06-05T07:59:00

Solution 1

I would do it like this:

1. Scrape the website and store its full state.
2. Next scrape operation, run a diff algorithm on it and note the Changed, Deleted, etc lines.
3. If those lines are links to a PDF document, then you can tell what documents were changed/deleted.

Posted 5-Jun-13 7:59am

Ron Beyer

Comments

Sergey Alexandrovich Kryukov 5-Jun-13 21:06pm

In a nutshell, that's it, a 5.
It's important to note that not always the notion of "current state" has some valid meaning.
For this and the detail on scrapping technique, please see my answer, Solution 2.
—SA

Sergey Alexandrovich Kryukov · Answer 2 · 2013-06-05T15:04:00

For Web scrapping (http://en.wikipedia.org/wiki/Web_scraping[^]), please see my past answers:
get specific data from web page[^],
How to get the data from another site[^].

Apparently, not any piece of content can possibly be scraped, or not any one can be easily scraped. Imagine that some server side-backed page generates randomized content (which is often the real case). It can be some interactive application, even a game. In such cases, the notion of the "current state" itself simply makes no sense. An example of some content notoriously difficult to scrape (which is probably intended by the site owners) is the Youtube video…

—SA