Click here to Skip to main content
15,885,309 members
Please Sign up or sign in to vote.
3.00/5 (1 vote)
See more:
Hi All,

I am very new to data mining and I need help to accomplish the following: crawl a list of sites with a common theme e.g.a car sales websites and extract all the common elements that may be present for example:

Download/crawl --> site A
Download/crawl --> site B
Download/crawl --> site C
Download/crawl --> site ....

Process, index and store all the raw data in a structured way into a database and then produce a comparison of all the common elements. For example:

car: xyz 123 is present on site A and B but not C car: abc 123 is present on site A and C but not B ....etc

I know there will be a level of AI involved to recognize patterns and perform pattern matching in order to match the listings from the various sites.

Another issue is how to keep the database content fresh and up to date as the listing on each site changes, and then updating the changes in my back end database.

For example if Site A was to modify/delete car xyz 123, how would I check for this change? The only way I can think of doing this is re-crawling the whole website then comparing each entry in my database against what I have just re-downloaded and then overwriting changes as they come along. Is there another way/better way of accomplishing this? E.g. through open source software?

If anyone has done something similar could you please advise or provide guidance as to how you accomplished this. Also are there any tools that may help speed up efficiency and searching of records, I have heard Apaches Solr is quite good for indexing and searching large data sets.

Any advice and help is much appreciated.
Posted

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900