65.9K
CodeProject is changing. Read more.
Home

Remove all the HTML tags and display a plain text only inside (in case XML is not well formed)

starIconstarIconstarIconstarIcon
emptyStarIcon
starIcon

4.94/5 (17 votes)

Dec 15, 2010

CPOL

1 min read

viewsIcon

169502

Remove HTML and get a plain text from inside

Introduction

I was encouraged to write this Tip/Trick because of so many questions received for this issue. Suppose you're having a bunch of HTML strings, but you just want to remove all the HTML tags and want a plain text. You can use REGEX to come to the rescue. The Regex I had developed before was more cumbersome, then Chris made a suggestion, so I will now go further with the regex suggested by Chris that is a "\<[^\>]*\>". I have tested it for many cases. It detects all types of HTML tags, but there may be loopholes inside so if you find any tags which are not passing through this Regex, then kindly inform me about the same.

Regex Definition

  • Regex :\<[^\>]*\>
    • Literal >
    • Any character that NOT in this class:[\>], any number of repetations
    • Literal >

Program

string ss = "<b><i>The tag is about to be removed</i></b>";
        Regex regex = new Regex("\\<[^\\>]*\\>");
        Response.Write(String.Format("<b>Before:</b>{0}", ss)); // HTML Text
        Response.Write("<br/>");
        ss = regex.Replace(ss, String.Empty);
        Response.Write(String.Format("<b>After:</b>{0}", ss));// Plain Text as a OUTPUT

Program understanding

The above program just finds the matched Regex string and replaces the same with an empty string. Suppose you have an HTML String like "<li>Hiren</li>", then it will just output the string with simple "Hiren" as a PlainText.

Above sample Program OUTPUT

INPUT String : The tag is about to be removed OUTPUT String : The tag is about to be removed