Click here to Skip to main content
6,822,123 members and growing! (17,789 online)
Email Password   helpLost your password?
Enterprise Systems » Content Management Server » General     Beginner License: The Code Project Open License (CPOL)

HTML Tag Stripper

By Andrei Rinea

A fast way to strip the HTML tags from an HTML fragment and leave only the visible text
C#, Windows, .NET, ASP.NET, Visual-Studio, WebForms, Dev
Posted:18 Jul 2007
Updated:8 Nov 2007
Views:41,994
Bookmarked:56 times
printPrint   add Share
      Discuss Discuss   Broken Article?Report  
5 votes for this article.
Popularity: 2.32 Rating: 3.31 out of 5

1
1 vote, 20.0%
2
1 vote, 20.0%
3
1 vote, 20.0%
4
2 votes, 40.0%
5

Introduction

This article explains a simple method of stripping HTML tags that is similar to the PHP strip_tags() function. This is usually useful in CMS systems where you need to store the text-only version of, for example, an article in order to allow a full-text search through all articles.

Background

Stripping the tags, in this problem's context, means keeping only the visible text of an HTML document or HTML fragment. This means excluding all HTML comments and all HTML <script>, <style> and <noscript> blocks.

I must also mention the fact that the text resulting from this stripping can be processed even more by replacing named HTML entities such as &quot;, &amp;, &copy;, , etc. and unnamed HTML entities such as &#355; with their corresponding characters. Just set the method's respective parameters, i.e. replaceNamedEntities and replaceNumberedEntities, to true. Bear in mind, however, that these can slow the execution time down significantly.

Using the Code

There is only one method involved in this operation. I called it, without any inspiration, HtmlStripTags. It accepts three parameters:

  • htmlContent: the HTML content to process
  • replaceNamedEntities: whether to replace the HTML named entities such as and others
  • replaceNumberedEntities: whether to replace the HTML numbered entities, i.e. Unicode HTML representations such as &#355;
public static string HtmlStripTags(string htmlContent, 
    bool replaceNamedEntities, bool replaceNumberedEntities)
{
    if (htmlContent == null)
        return null;
    htmlContent = htmlContent.Trim();
    if (htmlContent == string.Empty)
        return string.Empty;

    int bodyStartTagIdx = htmlContent.IndexOf("<body", 
        StringComparison.CurrentCultureIgnoreCase);
    int bodyEndTagIdx = htmlContent.IndexOf("</body>", 
        StringComparison.CurrentCultureIgnoreCase);

    int startIdx = 0, endIdx = htmlContent.Length - 1;
    if (bodyStartTagIdx >= 0)
        startIdx = bodyStartTagIdx;
    if (bodyEndTagIdx >= 0)
        endIdx = bodyEndTagIdx;

    bool insideTag = false,
        insideAttributeValue = false,
        insideHtmlComment = false,
        insideScriptBlock = false,
        insideNoScriptBlock = false,
        insideStyleBlock = false;
    char attributeValueDelimiter = '"';

    StringBuilder sb = new StringBuilder(htmlContent.Length);
    for (int i = startIdx; i <= endIdx; i++)
    {

        // html comment block

        if (!insideHtmlComment)
        {
            if (i + 3 < htmlContent.Length &&
                htmlContent[i] == '<' &&
                ...
                ...
                ...

Points of Interest

I avoided using Regular Expressions in order to achieve maximum performance. RegExs are not yet .NET Framework's strongest point. Moreover, I considered this task simple enough to not require such a universal tool. I ran benchmark tests comparing this implementation to another one presented here at CodeProject.com, Covert HTML to Plain Text, which uses mainly Regular Expressions. I found that when parsing large HTML contents (80+ KB) without replacing any HTML entities, it could yet give 5x times better performance, i.e. ~ 5 microseconds on a Intel Core Duo @ 1,83GHz with 1GB RAM. Of course, it is less elegant than using Regular Expressions. I simply maximized the performance.

History

  • 18 July, 2007 -- Original version posted
  • 8 November, 2007 -- Article content and downloads updated

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

About the Author

Andrei Rinea


Member

Occupation: Software Developer (Senior)
Company: TeamNet International
Location: Romania Romania

Other popular Content Management Server articles:

Article Top
You must Sign In to use this message board.
FAQ FAQ 
 
Noise Tolerance  Layout  Per page   
 Msgs 1 to 18 of 18 (Total in Forum: 18) (Refresh)FirstPrevNext
Generalpossible to skip some tags? PinmemberSelArom20:38 1 Jun '09  
Questioncool code PinmemberThea Ganoe14:55 8 Nov '07  
AnswerRe: cool code PinmemberAndrei Rinea0:13 9 Nov '07  
Generalwhat is so special on position 4163 PinmemberKaldor Amir22:15 24 Jul '07  
AnswerRe: what is so special on position 4163 PinmemberAndrei Rinea22:54 24 Jul '07  
GeneralWhy Not RegEx? Pinmemberthund3rstruck19:50 21 Jul '07  
GeneralRe: Why Not RegEx? PinsitebuilderUwe Keim1:53 22 Jul '07  
GeneralRe: Why Not RegEx? PinmemberAndrei Rinea12:24 22 Jul '07  
GeneralRe: Why Not RegEx? Pinmemberthund3rstruck19:29 22 Jul '07  
GeneralRe: Why Not RegEx? PinmemberAndrei Rinea1:03 23 Jul '07  
AnswerRe: Why Not RegEx? PinmemberGleb Dolgich10:29 11 Nov '07  
GeneralRe: Why Not RegEx? PinmemberAndrei Rinea1:25 12 Nov '07  
GeneralRe: Why Not RegEx? Pinmembersebbie(r)1:07 13 Jun '08  
GeneralRe: Why Not RegEx? PinmemberKasim P5:12 13 Nov '07  
GeneralRe: Why Not RegEx? PinmemberDaniel Penrod15:57 22 Nov '07  
GeneralAs is, requires .NET 2 or above PinmemberFred_Smith14:12 21 Jul '07  
GeneralRe: As is, requires .NET 2 or above PinmemberAndrei Rinea12:10 22 Jul '07  
GeneralAlso see... PinmemberRavi Bhavnani3:55 19 Jul '07  

General General    News News    Question Question    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads.

PermaLink | Privacy | Terms of Use
Last Updated: 8 Nov 2007
Editor: Genevieve Sovereign
Copyright 2007 by Andrei Rinea
Everything else Copyright © CodeProject, 1999-2010
Web11 | Advertise on the Code Project