Click here to Skip to main content
Click here to Skip to main content

Tagged as

Go to top

Getting Only The Text Displayed On A Webpage Using C#

, 3 May 2013
Rate this:
Please Sign up or sign in to vote.
How to get only the text displayed on a webpage using C#

Editorial Note

This article appears in the Third Party Product Reviews section. Articles in this section are for the members only and must not be used by tool vendors to promote or advertise products in any way, shape or form. Please report any spam or advertising.

Introduction

After looking around for months at various ways to get only the text displayed on a web browser using C#, it all boiled down to only a few simple lines of code. I looked at several very robust solutions such as the HTML Agility Pack and Majestic 12 open source .NET solutions. However, for applications which only require getting tag free / HTML free text from a web page, these solutions seem to be overkill, at least in my case.

Here are three very simplistic ways to get only the displayed text on a web page:

Method 1 – In Memory Cut and Paste

Use WebBrowser control object to process the web page, and then copy the text from the control…

Use the following code to download the web page:

//Create the WebBrowser control
WebBrowser wb = new WebBrowser();
//Add a new event to process document when download is completed   
wb.DocumentCompleted +=
    new WebBrowserDocumentCompletedEventHandler(DisplayText);
//Download the webpage
wb.Url = urlPath;

Use the following event code to process the downloaded web page text:

private void DisplayText(object sender, WebBrowserDocumentCompletedEventArgs e)
{
WebBrowser wb = (WebBrowser)sender;
wb.Document.ExecCommand(“SelectAll”, false, null);
wb.Document.ExecCommand(“Copy”, false, null);
textResultsBox.Text = CleanText(Clipboard.GetText());
}

Method 2 – In Memory Selection Object

This is a second method of processing the downloaded web page text. It seems to take just a bit longer (very minimal difference). However, it avoids using the clipboard and the limitations associated with that.

private void DisplayText(object sender, WebBrowserDocumentCompletedEventArgs e)
{   //Create the WebBrowser control and IHTMLDocument2
WebBrowser wb = (WebBrowser)sender;
IHTMLDocument2 htmlDocument =
wb.Document.DomDocument as IHTMLDocument2;
//Select all the text on the page and create a selection object
wb.Document.ExecCommand(“SelectAll”, false, null);
IHTMLSelectionObject currentSelection = htmlDocument.selection;
//Create a text range and send the range’s text to your text box
IHTMLTxtRange range = currentSelection.createRange() as IHTMLTxtRange
textResultsBox.Text = range.text;
}

Method 3 – The Elegant, Simple, Slower XmlDocument Approach

A good friend shared this example with me. I am a huge fan of simple, and this example wins the simplicity contest hands down. It was unfortunately very slow compared to the other two approaches.

The XmlDocument object will load / process HTML files with only 3 simple lines of code:

XmlDocument document = new XmlDocument();
document.Load(“www.yourwebsite.com”);
string allText = document.InnerText;

There you have it! Three simple ways to scrape only displayed text from web pages with no external “packages” involved.

Packages

I have recently used the WatiN web application testing package to get website text using C#. WatiN was not the easiest package to get set up for website text retrieval from C# as it required references to the WatiN core DLL, Microsoft.mshtml, windows.forms, and then several additional classes included in my project. However, I still think it is worth mentioning, because I like the results it produces. The package is stable and very simple to use once you get it set up. In fact, the website text can be obtained using only 3 lines of code:

var browser = new MsHtmlBrowser();
browser.GoTo(“www.YourURLHere.com”);
commandLog.Text = browser.Text;

I have included a simple Visual Studio ASP.NET project for download here.

Links

License

This article, along with any associated source code and files, is licensed under The GNU General Public License (GPLv3)

Share

About the Author

Jake Drew
Student
United States United States
If you would like to know more about me, please feel free to visit my website at http://www.jakemdrew.com/
 
Thanks!
 
Jake Drew

Comments and Discussions

 
-- There are no messages in this forum --
| Advertise | Privacy | Mobile
Web01 | 2.8.140916.1 | Last Updated 3 May 2013
Article Copyright 2013 by Jake Drew
Everything else Copyright © CodeProject, 1999-2014
Terms of Service
Layout: fixed | fluid