Click here to Skip to main content
15,879,535 members
Articles / Programming Languages / C#
Technical Blog

Getting Only The Text Displayed On A Webpage Using C#

Rate me:
Please Sign up or sign in to vote.
5.00/5 (4 votes)
3 May 2013GPL32 min read 169.4K   5   16
How to get only the text displayed on a webpage using C#

Introduction

After looking around for months at various ways to get only the text displayed on a web browser using C#, it all boiled down to only a few simple lines of code. I looked at several very robust solutions such as the HTML Agility Pack and Majestic 12 open source .NET solutions. However, for applications which only require getting tag free / HTML free text from a web page, these solutions seem to be overkill, at least in my case.

Here are three very simplistic ways to get only the displayed text on a web page:

Method 1 – In Memory Cut and Paste

Use WebBrowser control object to process the web page, and then copy the text from the control…

Use the following code to download the web page:

C#
//Create the WebBrowser control
WebBrowser wb = new WebBrowser();
//Add a new event to process document when download is completed   
wb.DocumentCompleted +=
    new WebBrowserDocumentCompletedEventHandler(DisplayText);
//Download the webpage
wb.Url = urlPath;

Use the following event code to process the downloaded web page text:

C#
private void DisplayText(object sender, WebBrowserDocumentCompletedEventArgs e)
{
WebBrowser wb = (WebBrowser)sender;
wb.Document.ExecCommand("SelectAll", false, null);
wb.Document.ExecCommand("Copy", false, null);
textResultsBox.Text = CleanText(Clipboard.GetText());
}

Method 2 – In Memory Selection Object

This is a second method of processing the downloaded web page text. It seems to take just a bit longer (very minimal difference). However, it avoids using the clipboard and the limitations associated with that.

C#
private void DisplayText(object sender, WebBrowserDocumentCompletedEventArgs e)
{   //Create the WebBrowser control and IHTMLDocument2
WebBrowser wb = (WebBrowser)sender;
IHTMLDocument2 htmlDocument =
wb.Document.DomDocument as IHTMLDocument2;
//Select all the text on the page and create a selection object
wb.Document.ExecCommand("SelectAll", false, null);
IHTMLSelectionObject currentSelection = htmlDocument.selection;
//Create a text range and send the range’s text to your text box
IHTMLTxtRange range = currentSelection.createRange() as IHTMLTxtRange
textResultsBox.Text = range.text;
}

Method 3 – The Elegant, Simple, Slower XmlDocument Approach

A good friend shared this example with me. I am a huge fan of simple, and this example wins the simplicity contest hands down. It was unfortunately very slow compared to the other two approaches.

The XmlDocument object will load / process HTML files with only 3 simple lines of code:

C#
XmlDocument document = new XmlDocument();
document.Load("www.yourwebsite.com");
string allText = document.InnerText;

There you have it! Three simple ways to scrape only displayed text from web pages with no external “packages” involved.

Packages

I have recently used the WatiN web application testing package to get website text using C#. WatiN was not the easiest package to get set up for website text retrieval from C# as it required references to the WatiN core DLL, Microsoft.mshtml, windows.forms, and then several additional classes included in my project. However, I still think it is worth mentioning, because I like the results it produces. The package is stable and very simple to use once you get it set up. In fact, the website text can be obtained using only 3 lines of code:

C#
var browser = new MsHtmlBrowser();
browser.GoTo("www.YourURLHere.com");
commandLog.Text = browser.Text;

I have included a simple Visual Studio ASP.NET project for download here.

Links

License

This article, along with any associated source code and files, is licensed under The GNU General Public License (GPLv3)


Written By
Student
United States United States
If you would like to know more about me, please feel free to visit my website at http://www.jakemdrew.com/

Thanks!

Jake Drew

Comments and Discussions

 
QuestionWebpage or Web form? Pin
Member 151322327-Apr-21 1:31
Member 151322327-Apr-21 1:31 
So were you using a web form in asp.net? i want to do something similar, but i want the user to be able to input something like, "The employee award ceramony is @ 1 PM today. Don't miss it!" and have it show on a page/form for a group of people
QuestionCan you use the contents of a .NET WebBrowser control withouth navigating with [Goto]? Pin
relston17-May-18 4:07
relston17-May-18 4:07 
Questionthanks to your wonderful contribution... Pin
g gm24-Mar-18 1:47
g gm24-Mar-18 1:47 
QuestionSelenium can do this in one line. Pin
ne0hisda0ne26-Oct-16 11:28
ne0hisda0ne26-Oct-16 11:28 
QuestionRequesting help on Watin method Pin
Member 851874216-Nov-15 19:34
Member 851874216-Nov-15 19:34 
AnswerRe: Requesting help on Watin method Pin
Orilon1-Dec-15 23:21
professionalOrilon1-Dec-15 23:21 
QuestionFetching Text from a webpage with C# Pin
Ammar Shaukat16-Oct-15 10:03
professionalAmmar Shaukat16-Oct-15 10:03 
AnswerRe: Fetching Text from a webpage with C# Pin
Jake Drew16-Oct-15 15:33
Jake Drew16-Oct-15 15:33 
QuestionNot working Pin
ali_crash7-Mar-15 21:35
ali_crash7-Mar-15 21:35 
AnswerRe: Not working Pin
Jake Drew7-Mar-15 21:44
Jake Drew7-Mar-15 21:44 
GeneralRe: Not working Pin
ali_crash10-Mar-15 6:39
ali_crash10-Mar-15 6:39 
GeneralRe: Not working Pin
Jake Drew10-Mar-15 7:30
Jake Drew10-Mar-15 7:30 
GeneralRe: Not working Pin
ali_crash10-Mar-15 11:51
ali_crash10-Mar-15 11:51 
GeneralRe: Not working Pin
Jake Drew10-Mar-15 15:46
Jake Drew10-Mar-15 15:46 
GeneralRe: Not working Pin
ali_crash11-Mar-15 0:43
ali_crash11-Mar-15 0:43 
GeneralRe: Not working Pin
Jake Drew18-Mar-15 10:48
Jake Drew18-Mar-15 10:48 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.