Click here to Skip to main content
15,867,453 members
Articles / Web Development / HTML
Tip/Trick

Extract inner text from HTML using Regex

Rate me:
Please Sign up or sign in to vote.
3.67/5 (2 votes)
17 Oct 2012CPOL 57.9K   6   4
How to extract the inner text from HTML using a Regular Expression.

Introduction

Use this code snippet to extract the inner text from Html, its very lightweight, simple and efficient, work well even with malformed Html, no extra dll is needed such as htmlagilitypack.

Note:

This method is intended to be used with simple HTML that is free of scripts, styles or comments 

Background

Some tasks require you to extract text from HTML, especially in web scraping. one popular solution is to use the HtmlAgilityPack-DocumentNode.InnerText-, however this requiring you add an extra library to your project, and have drawbacks in some edge cases.

one drawback I noticed is that it might concatenate two words as a single word for example consider the Html string: "<p>this<b>is<b/> a test</p>"  using the HtmlAgilityPack to extract the text will result in "thisis a test".

Using the code   

To use this code you need to import System.Text.RegularExpressions namespace.  Add the following function to your Utilities class or as an extension method:

C#
public static string ExtractHtmlInnerText(string htmlText)
{
    //Match any Html tag (opening or closing tags) 
    // followed by any successive whitespaces
    //consider the Html text as a single line

    Regex regex = new Regex("(<.*?>\\s*)+", RegexOptions.Singleline);
    
    // replace all html tags (and consequtive whitespaces) by spaces
    // trim the first and last space

    string resultText = regex.Replace(htmlText, " ").Trim();

    return resultText;
}

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)


Written By
Web Developer
Jordan Jordan
This member has not yet provided a Biography. Assume it's interesting and varied, and probably something to do with programming.

Comments and Discussions

 
QuestionData Scraping from Paginated Grid View Pin
Member 1050050624-Jan-14 18:52
professionalMember 1050050624-Jan-14 18:52 
QuestionScraping Of Data from Paginated Grid View Pin
Member 1050050624-Jan-14 18:52
professionalMember 1050050624-Jan-14 18:52 
SuggestionYou may try to process thes file... Pin
Andreas Gieriet16-Oct-12 20:00
professionalAndreas Gieriet16-Oct-12 20:00 
GeneralRe: You may try to process thes file... Pin
jahmani17-Oct-12 5:20
jahmani17-Oct-12 5:20 
Thanks for your interest. I used this method to extract text from Html description of rss feeds, which is usually a bit simple. Your html code is somewhat more complex and cannot be managed by this snippet I should have mentioned this in the tip.
thanks again for your note, I will update the article to include your note.

Jahmani.

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.