Click here to Skip to main content
Click here to Skip to main content
Go to top

Extract inner text from HTML using Regex

, 17 Oct 2012
Rate this:
Please Sign up or sign in to vote.
How to extract the inner text from HTML using a Regular Expression.

Introduction

Use this code snippet to extract the inner text from Html, its very lightweight, simple and efficient, work well even with malformed Html, no extra dll is needed such as htmlagilitypack.

Note:

This method is intended to be used with simple HTML that is free of scripts, styles or comments 

Background

Some tasks require you to extract text from HTML, especially in web scraping. one popular solution is to use the HtmlAgilityPack-DocumentNode.InnerText-, however this requiring you add an extra library to your project, and have drawbacks in some edge cases.

one drawback I noticed is that it might concatenate two words as a single word for example consider the Html string: "<p>this<b>is<b/> a test</p>"  using the HtmlAgilityPack to extract the text will result in "thisis a test".

Using the code   

To use this code you need to import System.Text.RegularExpressions namespace.  Add the following function to your Utilities class or as an extension method:

public static string ExtractHtmlInnerText(string htmlText)
{
    //Match any Html tag (opening or closing tags) 
    // followed by any successive whitespaces
    //consider the Html text as a single line

    Regex regex = new Regex("(<.*?>\\s*)+", RegexOptions.Singleline);
    
    // replace all html tags (and consequtive whitespaces) by spaces
    // trim the first and last space

    string resultText = regex.Replace(htmlText, " ").Trim();

    return resultText;
}

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Share

About the Author

jahmani
Web Developer
Jordan Jordan
No Biography provided

Comments and Discussions

 
QuestionData Scraping from Paginated Grid View PinprofessionalMember 1050050624-Jan-14 18:52 
QuestionScraping Of Data from Paginated Grid View PinprofessionalMember 1050050624-Jan-14 18:52 
SuggestionYou may try to process thes file... [modified] PinmemberAndreas Gieriet16-Oct-12 20:00 
GeneralRe: You may try to process thes file... Pinmemberjahmani17-Oct-12 5:20 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

| Advertise | Privacy | Mobile
Web01 | 2.8.140916.1 | Last Updated 17 Oct 2012
Article Copyright 2012 by jahmani
Everything else Copyright © CodeProject, 1999-2014
Terms of Service
Layout: fixed | fluid