C# Get Only Certian Text From Wiki Page?

Question

5.00/5 (4 votes)

See more:

, +

Ok so now that i have gotten my program to speak something and the UI not freeze up (a big thanks to Marcin Kozub for that), I now need to know how to retrieve the text from a Wikipedia article.
I have figured out how to do that(i just used a web browser control and got the WebBrowserControl.Document.Body.InnerText and spoke that) but when I speak the text, I hear it speak the little citation links and navigation bar
links and so on so on. So my question is how do i remove all of that from the text i retrieve?
like i don't want the little citation needed links, the navigation bar links, the bookmark links or the edit links to be read.
here is what i have so far:

C#

// WIKI is the name of my web browser control
synth.SpeakAsync(WIKI.Document.Body.InnerText)

this just gets the straight text from the page with all the links and everything.
here is the wiki link i am trying to read:
http://en.wikipedia.org/wiki/Forza_Horizon_2[^]
and yes I love to play video games and i love to play the forza games.
Oh and how might one do a search on Wikipedia through c#?
the end goal is to be able to ask the program a question or to give me information on something and it read a Wikipedia article to me(kind of like the startrek computers)
Any help is very much appreciated.

Please and Thank you,
MasterCodeon

Posted 25-Nov-14 7:39am

MasterCodeon

Updated 25-Nov-14 8:56am

v2

Add a Solution

Comments

[no name] 25-Nov-14 15:11pm

This is not easy to do and may require 2 or 3 for loops with REGEX (Regular Expression to filter out the text you want to take from the page.) I think i done something like this a while back, If I still have the code, I will post it as a solution, but you will need to research MSDN about the methods used in my code.

BillWoodruff 25-Nov-14 16:15pm

Perhaps the HTML Agility Pack would help ?

A quick search reveals two CodeProject articles less than two years old on using the HTML Agility Pack:

https://www.google.com/search?&q=html%20agility%20pack

[no name] 25-Nov-14 17:17pm

Yes, actually that would be helpful Bill. If you post the link as an alternative solution, I'll give you 5.

BillWoodruff 25-Nov-14 17:58pm

I do not (usually) post a links-only "solution" when I have not personally used the software I link to, and I'm not desperate for reputation here, as some are :)

But, I have added a link to a search to my comment, above. Cheers, Bill

[no name] 25-Nov-14 18:25pm

No problem. :) I'm here more so to help than to score points; the points just show participation in my view which is a good thing. No big deal really, but the link suggestion you made would be supportive to the OP. ;)

BillWoodruff 25-Nov-14 18:59pm

Hopefully, the OP will read the comment, and see the link, and then ...

[no name] 25-Nov-14 20:18pm

I put your link in the end of the solution.

Time for a sleep, have a good night Bill.

[no name] 26-Nov-14 10:01am

Does the wiki API not help in a more flexible way for your needs?

http://www.mediawiki.org/wiki/API:Main_page[^]

Maybe see also
http://stackoverflow.com/questions/627594/is-there-a-wikipedia-api[^]

3 solutions

Solution 1

You can try this quickly converted code to invoke a button which is C#

C#

if (WebBrowser.Url.ToString == "http://www.Yoursite.com/" && WebBrowser.Document.Body.InnerText.Contains("Wiki") == true) {
	if (WebBrowser.ReadyState == WebBrowserReadyState.Complete) {
//Below we will set the textbox text property to the loaded webpages text field where the if of the html field is username     
               WebBrowser.Document.GetElementById("username").SetAttribute("Value", textbox1.Text)
//Below we will set the textbox text property to the loaded webpages text field where the if of the html field is password
		WebBrowser.Document.GetElementById("password").SetAttribute("Value", textbox2.Text);
	}
//This will allow you to invoke click on a search button providing the ID of the html button is Submit. You will need to change these to reflect Wikipedia source code.
	WebBrowser.Document.GetElementById("Submit").InvokeMember("click");

}

You will need to look at Wikipedia Source code (Html) and see what Div tags have IDs and Class names. I.e.

HTML

<div id="SomeId" classname="SomeClass">Text you want</div>

Its those Classnames and IDs you need to loop through in the WebBrowser.Document.Body.innerhtml to get the text you need by iterating the class names of that tag.

You can use the GetElementsByTagName("a") to retrieve the html element collection of links. (The above would retrieve links since its looking for tags like

HTML

<a href="#">Text to get</a>

' a ' being the tag. ' href' being the class name attribute.

If you want to loop through Div tags, just change ' a ' to div, and change the class name accordingly relative to the tag you want to get.

You then need to cast that collection as you iterate through it to look for the Classname of the ' a ' tag using: MainElement.GetAttribute("href") == "http://" which will return the links if there is a match.

From there, you can use an If statement to check the .inntertext of the html element is not null and set the element to a declared variable where you can do as you please with the returned result.

I only have the code wrote out in .Net and don't have time to convert it, so you will need to convert it to C#, but I have also provided you with the links you need for achieving this above, and you can try convert some of the code with Teleric but my guess is you may need to manually change some of it yourself.

But I hope this quick post will give you a general insight how to approach this.

VB

Dim MyString As String = Nothing
Dim myElement = (From MainElement As HtmlElement In WebBrowser.document.GetElementsByTagName("a").Cast(Of HtmlElement)()
                                             Where MainElement.GetAttribute("href") = "http://"
                                             Select MainElement)
        If myElement(0) IsNot Nothing Then
            myElement(0).InnerText = MyString
        End If

Hope it helps.

Edit:

Link worth checking which might also be helpful to you with this solution.

Agility Pack Recommended by BillWoodruff

Posted 25-Nov-14 11:11am

Sheepings

Updated 25-Nov-14 14:17pm

v6

Comments

BillWoodruff 25-Nov-14 18:00pm

Looks like an interesting effort on your part. What was the C# code you show converted from ? Always a good idea to list your sources.

[no name] 25-Nov-14 18:22pm

Thanks Bill. I only had the original code in VB.Net from a project i did a while back. So I didn't want to post that code since the OP wanted it in C#. But since I've limited time, I ran it through Teleric Code converter linked in the solution above. But there is enough coverage there to get started in the right direction I think.

Add a Solution

Add your solution here

Treat my content as plain text, not as HTML

Preview 0

…

Existing Members

Sign in to your account

...or Join us

Download, Vote, Comment, Publish.

Your Email
Password
Forgot your password?

Your Email
This email is in use. Do you need your password?
Optional Password

I have read and agree to the Terms of Service and Privacy Policy
Please subscribe me to the CodeProject newsletters

When answering a question please:

Read the question carefully.
Understand that English isn't everyone's first language so be lenient of bad spelling and grammar.
If a question is poorly phrased then either ask for clarification, ignore it, or edit the question and fix the problem. Insults are not welcome.
Don't tell someone to read the manual. Chances are they have and don't get it. Provide an answer or move on to the next question.

Let's work to help developers, not make them feel stupid.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Marcin Kozub · Accepted Answer · 2014-11-25T20:14:00

Hi,

I think that you can use Wikipedia API to get what you want.
The link below will return a XML file with content of desired page:
http://en.wikipedia.org/w/api.php?format=xml&action=query&prop=extracts&titles=Forza%20Horizon%202&redirects=true[^]

Notice that the content in output XML has standard HTML tags inside, like 'p', 'i', 'b' and 'h2'. You can parse this content and take some actions on specific tags like 'h2' to make bigger pause etc.

I never used it before and didn't test it but there is similar question on StackOverflow I found:
http://stackoverflow.com/questions/1625162/get-text-content-from-mediawiki-page-via-api[^]

And ofcourse Wikipedia API link:
http://www.mediawiki.org/wiki/API:Main_page[^]

[Update 1]
For this solution you don't need WebBrowser to get XML (if you don't need to display wiki page). Simply use this code:

C#

var webClient = new WebClient();
var pageSourceCode = webClient.DownloadString("place_url_here");

Then use XmlDocument to access nodes. There is many exapmles on codeproject.com :)

[Update 2]
You've mentioned that you want to search the Wiki for some topics. I did some research for you and using Wiki API for search is fairly easy. You just need to call API url:
http://en.wikipedia.org/w/api.php?action=opensearch&search=forza%20horizon[^]

It will return results in format like this:
["forza horizon",["Forza Horizon","Forza Horizon 2"]]

It means that for search phrase are two results. Next you can display results and let user to specify which result he/she wants to open. Finally retrive Wiki content for that and use Speech Synthesis to output content to the user :)

[Update 3 - Final I think ;)]
Wikipedia is multilingual, so you can do the same thing in your app. To get data in specific language change it short code at begging of the link i.e:
http://pl.wikipedia.org/w/api.php?action=opensearch&search=forza%20horizon[^]

You can change the voice in your synthesiser too. Iterate through installed voices on your computer to get information about them and display available languages. There is Culture property of VoiceInfo. My sample app you've downloaded contains everything you need to do that.

Cheers!

===EDIT: Fixed Broken Link===
CodingK

===EDIT: Fixed Broken Link ===
Marcin Kozub

MasterCodeon · Accepted Answer · 2014-11-26T03:52:00

ok so here is the code i came up with from Marcin Kozub's solution

C#

var webClient = new WebClient();
var pageSourceCode = webClient.DownloadString("http://en.wikipedia.org/w/api.php?format=xml&action=query&prop=extracts&titles=" + "Forza Horizon 2" + "&redirects=true");

XmlDocument doc = new XmlDocument();

doc.LoadXml(pageSourceCode);

var fnode = doc.GetElementsByTagName("extract")[0];

string ss = fnode.InnerText;

Regex regex = new Regex("\\<[^\\>]*\\>");

String.Format("Before:{0}", ss); // HTML Text

ss = regex.Replace(ss, String.Empty);

string result =  String.Format(ss);// Plain Text as a OUTPUT



TextBox.Text += result;

i was able to get the xml node i wanted by using(and modifing) the code from this answer:
store specific nodes from xml[^]
thanks for everyone's help!

C# Get Only Certian Text From Wiki Page?

3 solutions

Solution 1

Solution 2

Solution 3

Add your solution here

Preview 0