Click here to Skip to main content
15,867,594 members
Articles / Web Development / ASP.NET
Article

HTML Tag Stripper

Rate me:
Please Sign up or sign in to vote.
3.52/5 (9 votes)
8 Nov 2007CPOL2 min read 114.6K   2.5K   65   23
A fast way to strip the HTML tags from an HTML fragment and leave only the visible text

Introduction

This article explains a simple method of stripping HTML tags that is similar to the PHP strip_tags() function. This is usually useful in CMS systems where you need to store the text-only version of, for example, an article in order to allow a full-text search through all articles.

Background

Stripping the tags, in this problem's context, means keeping only the visible text of an HTML document or HTML fragment. This means excluding all HTML comments and all HTML <script>, <style> and <noscript> blocks.

I must also mention the fact that the text resulting from this stripping can be processed even more by replacing named HTML entities such as &quot;, &amp;, &copy;, &nbsp;, etc. and unnamed HTML entities such as &#355; with their corresponding characters. Just set the method's respective parameters, i.e. replaceNamedEntities and replaceNumberedEntities, to true. Bear in mind, however, that these can slow the execution time down significantly.

Using the Code

There is only one method involved in this operation. I called it, without any inspiration, HtmlStripTags. It accepts three parameters:

  • htmlContent: the HTML content to process
  • replaceNamedEntities: whether to replace the HTML named entities such as &nbsp; and others
  • replaceNumberedEntities: whether to replace the HTML numbered entities, i.e. Unicode HTML representations such as &#355;
C#
public static string HtmlStripTags(string htmlContent, 
    bool replaceNamedEntities, bool replaceNumberedEntities)
{
    if (htmlContent == null)
        return null;
    htmlContent = htmlContent.Trim();
    if (htmlContent == string.Empty)
        return string.Empty;

    int bodyStartTagIdx = htmlContent.IndexOf("<body", 
        StringComparison.CurrentCultureIgnoreCase);
    int bodyEndTagIdx = htmlContent.IndexOf("</body>", 
        StringComparison.CurrentCultureIgnoreCase);

    int startIdx = 0, endIdx = htmlContent.Length - 1;
    if (bodyStartTagIdx >= 0)
        startIdx = bodyStartTagIdx;
    if (bodyEndTagIdx >= 0)
        endIdx = bodyEndTagIdx;

    bool insideTag = false,
        insideAttributeValue = false,
        insideHtmlComment = false,
        insideScriptBlock = false,
        insideNoScriptBlock = false,
        insideStyleBlock = false;
    char attributeValueDelimiter = '"';

    StringBuilder sb = new StringBuilder(htmlContent.Length);
    for (int i = startIdx; i <= endIdx; i++)
    {

        // html comment block
        if (!insideHtmlComment)
        {
            if (i + 3 < htmlContent.Length &&
                htmlContent[i] == '<' &&
                ...
                ...
                ...

Points of Interest

I avoided using Regular Expressions in order to achieve maximum performance. RegExs are not yet .NET Framework's strongest point. Moreover, I considered this task simple enough to not require such a universal tool. I ran benchmark tests comparing this implementation to another one presented here at CodeProject.com, Covert HTML to Plain Text, which uses mainly Regular Expressions. I found that when parsing large HTML contents (80+ KB) without replacing any HTML entities, it could yet give 5x times better performance, i.e. ~ 5 microseconds on a Intel Core Duo @ 1,83GHz with 1GB RAM. Of course, it is less elegant than using Regular Expressions. I simply maximized the performance.

History

  • 18 July, 2007 -- Original version posted
  • 8 November, 2007 -- Article content and downloads updated

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)


Written By
Software Developer (Senior) IBM, Business Analytics
Romania Romania
This member has not yet provided a Biography. Assume it's interesting and varied, and probably something to do with programming.

Comments and Discussions

 
QuestionEasiest way Pin
Patryk Moura20-Aug-13 12:15
Patryk Moura20-Aug-13 12:15 
AnswerRe: Easiest way Pin
PIEBALDconsult20-Aug-13 12:25
mvePIEBALDconsult20-Aug-13 12:25 
QuestionInfinite loop Pin
Paul C Smith28-Jun-13 2:47
Paul C Smith28-Jun-13 2:47 
GeneralMy vote of 4 Pin
rcardare4-Jul-11 6:36
rcardare4-Jul-11 6:36 
GeneralNice Code Pin
sweet_gangster_boy13-May-10 9:53
sweet_gangster_boy13-May-10 9:53 
Questionpossible to skip some tags? Pin
SelArom1-Jun-09 19:38
SelArom1-Jun-09 19:38 
Questioncool code Pin
Thea Ganoe8-Nov-07 13:55
Thea Ganoe8-Nov-07 13:55 
AnswerRe: cool code Pin
Andrei Ion Rînea8-Nov-07 23:13
Andrei Ion Rînea8-Nov-07 23:13 
Questionwhat is so special on position 4163 Pin
Kaldor Amir24-Jul-07 21:15
professionalKaldor Amir24-Jul-07 21:15 
AnswerRe: what is so special on position 4163 Pin
Andrei Ion Rînea24-Jul-07 21:54
Andrei Ion Rînea24-Jul-07 21:54 
QuestionWhy Not RegEx? Pin
thund3rstruck21-Jul-07 18:50
thund3rstruck21-Jul-07 18:50 
AnswerRe: Why Not RegEx? Pin
Uwe Keim22-Jul-07 0:53
sitebuilderUwe Keim22-Jul-07 0:53 
AnswerRe: Why Not RegEx? Pin
Andrei Ion Rînea22-Jul-07 11:24
Andrei Ion Rînea22-Jul-07 11:24 
.NET support for RegEx's is not poor. The performance can still be improved as far as I can see.
Why do I say this? Because a friend at work (which is a Java programmer) showed me benchmarks of RegEx's which outperformed .NET on the same hardware.

However as I pointed out in the ending of the article, the performance was the difference between my implementation and Paceman implementation (http://www.codeproject.com/useritems/HTML_to_Plain_Text.asp). That implementation uses mainly Regular Expressions and as I said in the ending of the article, IT IS MORE ELLEGANT. However I wanted to extract as much performance as I could.

Personal blog http://andreir.wordpress.com

GeneralRe: Why Not RegEx? Pin
thund3rstruck22-Jul-07 18:29
thund3rstruck22-Jul-07 18:29 
GeneralRe: Why Not RegEx? Pin
Andrei Ion Rînea23-Jul-07 0:03
Andrei Ion Rînea23-Jul-07 0:03 
AnswerRe: Why Not RegEx? Pin
Gleb Dolgich11-Nov-07 9:29
Gleb Dolgich11-Nov-07 9:29 
GeneralRe: Why Not RegEx? Pin
Andrei Ion Rînea12-Nov-07 0:25
Andrei Ion Rînea12-Nov-07 0:25 
GeneralRe: Why Not RegEx? Pin
sebbie(r)13-Jun-08 0:07
sebbie(r)13-Jun-08 0:07 
GeneralRe: Why Not RegEx? Pin
Kasim P13-Nov-07 4:12
Kasim P13-Nov-07 4:12 
AnswerRe: Why Not RegEx? Pin
Daniel Penrod22-Nov-07 14:57
Daniel Penrod22-Nov-07 14:57 
GeneralAs is, requires .NET 2 or above Pin
Fred_Smith21-Jul-07 13:12
Fred_Smith21-Jul-07 13:12 
GeneralRe: As is, requires .NET 2 or above Pin
Andrei Ion Rînea22-Jul-07 11:10
Andrei Ion Rînea22-Jul-07 11:10 
GeneralAlso see... Pin
Ravi Bhavnani19-Jul-07 2:55
professionalRavi Bhavnani19-Jul-07 2:55 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.