Click here to Skip to main content
15,895,799 members
Please Sign up or sign in to vote.
0.00/5 (No votes)
See more:
I have html pages from several websites, found by google using some keywords.
My task is:
1. detect groups of pages, generated using same template
2. create rules for extracting keywords data from each found template - remove template and preserve only non-template contents, matching values, used for search in google.

I've found several articles, but they do not contain some source code but just short description of algo. So this task looks very difficult to implement for me. Are there some ready examples/sources/libraries, which I could use as a base for my program?

What is more optimal - analyze DOM tree, or html text?
Are suffix arrays suitable for this task?
Posted

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900