I have html pages from several websites, found by google using some keywords.
My task is:
1. detect groups of pages, generated using same template
2. create rules for extracting keywords data from each found template - remove template and preserve only non-template contents, matching values, used for search in google.
I've found several articles, but they do not contain some source code but just short description of algo. So this task looks very difficult to implement for me. Are there some ready examples/sources/libraries, which I could use as a base for my program?
What is more optimal - analyze DOM tree, or html text?
Are suffix arrays suitable for this task?