We need to scrap a website (some sections) which contain PDF documents. Once we scrap, we need to also store current snapshot of content of that website. For example, if there are 3 pdf files available on a website, i should be able to somehow store this state of content which tells me what files were available last time (with pdf content stored as text). Now next time, when i scrap the same website, suppose, now there are 5 pdf files (2 old pdfs, 1 old pdf but with updated content and 2 new pdfs), i now need compare this state of content with previous version of website and automatically figure out what are "NEW" files, what files are "UPDATED" and what files are "DELETED". Once i detect such changes, i need to create a manifest.txt file which will capture changes for last web scraping session.
Can you please share your thoughts on designing/implementing such a requirement (snapshot comparison)
thanks in advance