Data Mining Best Practices and Design

Question

3.00/5 (1 vote)

See more:

Hi All,

I am very new to data mining and I need help to accomplish the following: crawl a list of sites with a common theme e.g.a car sales websites and extract all the common elements that may be present for example:

Download/crawl --> site A
Download/crawl --> site B
Download/crawl --> site C
Download/crawl --> site ....

Process, index and store all the raw data in a structured way into a database and then produce a comparison of all the common elements. For example:

car: xyz 123 is present on site A and B but not C car: abc 123 is present on site A and C but not B ....etc

I know there will be a level of AI involved to recognize patterns and perform pattern matching in order to match the listings from the various sites.

Another issue is how to keep the database content fresh and up to date as the listing on each site changes, and then updating the changes in my back end database.

For example if Site A was to modify/delete car xyz 123, how would I check for this change? The only way I can think of doing this is re-crawling the whole website then comparing each entry in my database against what I have just re-downloaded and then overwriting changes as they come along. Is there another way/better way of accomplishing this? E.g. through open source software?

If anyone has done something similar could you please advise or provide guidance as to how you accomplished this. Also are there any tools that may help speed up efficiency and searching of records, I have heard Apaches Solr is quite good for indexing and searching large data sets.

Any advice and help is much appreciated.

Posted 28-Aug-13 19:12pm

Member 10239154

Add a Solution

Add your solution here

Treat my content as plain text, not as HTML

Preview 0

…

Existing Members

Sign in to your account

...or Join us

Download, Vote, Comment, Publish.

Your Email
Password
Forgot your password?

Your Email
This email is in use. Do you need your password?
Optional Password

I have read and agree to the Terms of Service and Privacy Policy
Please subscribe me to the CodeProject newsletters

When answering a question please:

Read the question carefully.
Understand that English isn't everyone's first language so be lenient of bad spelling and grammar.
If a question is poorly phrased then either ask for clarification, ignore it, or edit the question and fix the problem. Insults are not welcome.
Don't tell someone to read the manual. Chances are they have and don't get it. Provide an answer or move on to the next question.

Let's work to help developers, not make them feel stupid.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)