How do I remove template tags from several pages of websites

0.00/5 (No votes)

See more:

I have html pages from several websites, found by google using some keywords.
My task is:
1. detect groups of pages, generated using same template
2. create rules for extracting keywords data from each found template - remove template and preserve only non-template contents, matching values, used for search in google.

I've found several articles, but they do not contain some source code but just short description of algo. So this task looks very difficult to implement for me. Are there some ready examples/sources/libraries, which I could use as a base for my program?

What is more optimal - analyze DOM tree, or html text?
Are suffix arrays suitable for this task?

Posted 10-Jun-15 3:56am

Sergey N

Add a Solution

Add your solution here

Treat my content as plain text, not as HTML

Preview 0

…

Existing Members

...or Join us

Download, Vote, Comment, Publish.

Your Email
Password
Forgot your password?

Your Email
This email is in use. Do you need your password?
Optional Password

I have read and agree to the Terms of Service and Privacy Policy
Please subscribe me to the CodeProject newsletters

When answering a question please:

Read the question carefully.
Understand that English isn't everyone's first language so be lenient of bad spelling and grammar.
If a question is poorly phrased then either ask for clarification, ignore it, or edit the question and fix the problem. Insults are not welcome.
Don't tell someone to read the manual. Chances are they have and don't get it. Provide an answer or move on to the next question.

Let's work to help developers, not make them feel stupid.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Last 24hrs

This month

Pete O'Hanlon	105
OriginalGriff	80
Anthony Hughes 2022	40
Salvatore Terress	10
Member 16261674	-6

Pete O'Hanlon	1,125
OriginalGriff	717
Richard MacCutchan	245
Dave Kreskowiak	130
merano99	120