65.9K
CodeProject is changing. Read more.
Home

Getting Only The Text Displayed On A Webpage Using C#

starIconstarIconstarIconstarIconstarIcon

5.00/5 (4 votes)

May 3, 2013

GPL3

2 min read

viewsIcon

171885

How to get only the text displayed on a webpage using C#

Introduction

After looking around for months at various ways to get only the text displayed on a web browser using C#, it all boiled down to only a few simple lines of code. I looked at several very robust solutions such as the HTML Agility Pack and Majestic 12 open source .NET solutions. However, for applications which only require getting tag free / HTML free text from a web page, these solutions seem to be overkill, at least in my case.

Here are three very simplistic ways to get only the displayed text on a web page:

Method 1 – In Memory Cut and Paste

Use WebBrowser control object to process the web page, and then copy the text from the control…

Use the following code to download the web page:

//Create the WebBrowser control
WebBrowser wb = new WebBrowser();
//Add a new event to process document when download is completed   
wb.DocumentCompleted +=
    new WebBrowserDocumentCompletedEventHandler(DisplayText);
//Download the webpage
wb.Url = urlPath;

Use the following event code to process the downloaded web page text:

private void DisplayText(object sender, WebBrowserDocumentCompletedEventArgs e)
{
WebBrowser wb = (WebBrowser)sender;
wb.Document.ExecCommand(“SelectAll”, false, null);
wb.Document.ExecCommand(“Copy”, false, null);
textResultsBox.Text = CleanText(Clipboard.GetText());
}

Method 2 – In Memory Selection Object

This is a second method of processing the downloaded web page text. It seems to take just a bit longer (very minimal difference). However, it avoids using the clipboard and the limitations associated with that.

private void DisplayText(object sender, WebBrowserDocumentCompletedEventArgs e)
{   //Create the WebBrowser control and IHTMLDocument2
WebBrowser wb = (WebBrowser)sender;
IHTMLDocument2 htmlDocument =
wb.Document.DomDocument as IHTMLDocument2;
//Select all the text on the page and create a selection object
wb.Document.ExecCommand(“SelectAll”, false, null);
IHTMLSelectionObject currentSelection = htmlDocument.selection;
//Create a text range and send the range’s text to your text box
IHTMLTxtRange range = currentSelection.createRange() as IHTMLTxtRange
textResultsBox.Text = range.text;
}

Method 3 – The Elegant, Simple, Slower XmlDocument Approach

A good friend shared this example with me. I am a huge fan of simple, and this example wins the simplicity contest hands down. It was unfortunately very slow compared to the other two approaches.

The XmlDocument object will load / process HTML files with only 3 simple lines of code:

XmlDocument document = new XmlDocument();
document.Load(“www.yourwebsite.com”);
string allText = document.InnerText;

There you have it! Three simple ways to scrape only displayed text from web pages with no external “packages” involved.

Packages

I have recently used the WatiN web application testing package to get website text using C#. WatiN was not the easiest package to get set up for website text retrieval from C# as it required references to the WatiN core DLL, Microsoft.mshtml, windows.forms, and then several additional classes included in my project. However, I still think it is worth mentioning, because I like the results it produces. The package is stable and very simple to use once you get it set up. In fact, the website text can be obtained using only 3 lines of code:

var browser = new MsHtmlBrowser();
browser.GoTo(“www.YourURLHere.com”);
commandLog.Text = browser.Text;

I have included a simple Visual Studio ASP.NET project for download here.

Links