Click here to Skip to main content
15,883,988 members
Please Sign up or sign in to vote.
5.00/5 (4 votes)
Ok so now that i have gotten my program to speak something and the UI not freeze up (a big thanks to Marcin Kozub for that), I now need to know how to retrieve the text from a Wikipedia article.
I have figured out how to do that(i just used a web browser control and got the WebBrowserControl.Document.Body.InnerText and spoke that) but when I speak the text, I hear it speak the little citation links and navigation bar
links and so on so on. So my question is how do i remove all of that from the text i retrieve?
like i don't want the little citation needed links, the navigation bar links, the bookmark links or the edit links to be read.
here is what i have so far:

C#
// WIKI is the name of my web browser control
synth.SpeakAsync(WIKI.Document.Body.InnerText)

this just gets the straight text from the page with all the links and everything.
here is the wiki link i am trying to read:
http://en.wikipedia.org/wiki/Forza_Horizon_2[^]
and yes I love to play video games and i love to play the forza games.
Oh and how might one do a search on Wikipedia through c#?
the end goal is to be able to ask the program a question or to give me information on something and it read a Wikipedia article to me(kind of like the startrek computers)
Any help is very much appreciated.

Please and Thank you,
MasterCodeon
Posted
Updated 25-Nov-14 8:56am
v2
Comments
[no name] 25-Nov-14 15:11pm    
This is not easy to do and may require 2 or 3 for loops with REGEX (Regular Expression to filter out the text you want to take from the page.) I think i done something like this a while back, If I still have the code, I will post it as a solution, but you will need to research MSDN about the methods used in my code.
BillWoodruff 25-Nov-14 16:15pm    
Perhaps the HTML Agility Pack would help ?

A quick search reveals two CodeProject articles less than two years old on using the HTML Agility Pack:

https://www.google.com/search?&q=html%20agility%20pack
[no name] 25-Nov-14 17:17pm    
Yes, actually that would be helpful Bill. If you post the link as an alternative solution, I'll give you 5.
BillWoodruff 25-Nov-14 17:58pm    
I do not (usually) post a links-only "solution" when I have not personally used the software I link to, and I'm not desperate for reputation here, as some are :)

But, I have added a link to a search to my comment, above. Cheers, Bill
[no name] 25-Nov-14 18:25pm    
No problem. :) I'm here more so to help than to score points; the points just show participation in my view which is a good thing. No big deal really, but the link suggestion you made would be supportive to the OP. ;)

You can try this quickly converted code to invoke a button which is C#
C#
if (WebBrowser.Url.ToString == "http://www.Yoursite.com/" && WebBrowser.Document.Body.InnerText.Contains("Wiki") == true) {
	if (WebBrowser.ReadyState == WebBrowserReadyState.Complete) {
//Below we will set the textbox text property to the loaded webpages text field where the if of the html field is username     
               WebBrowser.Document.GetElementById("username").SetAttribute("Value", textbox1.Text)
//Below we will set the textbox text property to the loaded webpages text field where the if of the html field is password
		WebBrowser.Document.GetElementById("password").SetAttribute("Value", textbox2.Text);
	}
//This will allow you to invoke click on a search button providing the ID of the html button is Submit. You will need to change these to reflect Wikipedia source code.
	WebBrowser.Document.GetElementById("Submit").InvokeMember("click");

}


You will need to look at Wikipedia Source code (Html) and see what Div tags have IDs and Class names. I.e.
HTML
<div id="SomeId" classname="SomeClass">Text you want</div>


Its those Classnames and IDs you need to loop through in the WebBrowser.Document.Body.innerhtml to get the text you need by iterating the class names of that tag.

You can use the GetElementsByTagName("a") to retrieve the html element collection of links. (The above would retrieve links since its looking for tags like
HTML
<a href="#">Text to get</a>
' a ' being the tag. ' href' being the class name attribute.

If you want to loop through Div tags, just change ' a ' to div, and change the class name accordingly relative to the tag you want to get.

You then need to cast that collection as you iterate through it to look for the Classname of the ' a ' tag using: MainElement.GetAttribute("href") == "http://" which will return the links if there is a match.

From there, you can use an If statement to check the .inntertext of the html element is not null and set the element to a declared variable where you can do as you please with the returned result.

I only have the code wrote out in .Net and don't have time to convert it, so you will need to convert it to C#, but I have also provided you with the links you need for achieving this above, and you can try convert some of the code with Teleric but my guess is you may need to manually change some of it yourself.

But I hope this quick post will give you a general insight how to approach this.

VB
Dim MyString As String = Nothing
Dim myElement = (From MainElement As HtmlElement In WebBrowser.document.GetElementsByTagName("a").Cast(Of HtmlElement)()
                                             Where MainElement.GetAttribute("href") = "http://"
                                             Select MainElement)
        If myElement(0) IsNot Nothing Then
            myElement(0).InnerText = MyString
        End If


Hope it helps.

Edit:

Link worth checking which might also be helpful to you with this solution.

Agility Pack Recommended by BillWoodruff
 
Share this answer
 
v6
Comments
BillWoodruff 25-Nov-14 18:00pm    
Looks like an interesting effort on your part. What was the C# code you show converted from ? Always a good idea to list your sources.
[no name] 25-Nov-14 18:22pm    
Thanks Bill. I only had the original code in VB.Net from a project i did a while back. So I didn't want to post that code since the OP wanted it in C#. But since I've limited time, I ran it through Teleric Code converter linked in the solution above. But there is enough coverage there to get started in the right direction I think.
Hi,

I think that you can use Wikipedia API to get what you want.
The link below will return a XML file with content of desired page:
http://en.wikipedia.org/w/api.php?format=xml&action=query&prop=extracts&titles=Forza%20Horizon%202&redirects=true[^]

Notice that the content in output XML has standard HTML tags inside, like 'p', 'i', 'b' and 'h2'. You can parse this content and take some actions on specific tags like 'h2' to make bigger pause etc.

I never used it before and didn't test it but there is similar question on StackOverflow I found:
http://stackoverflow.com/questions/1625162/get-text-content-from-mediawiki-page-via-api[^]

And ofcourse Wikipedia API link:
http://www.mediawiki.org/wiki/API:Main_page[^]

[Update 1]
For this solution you don't need WebBrowser to get XML (if you don't need to display wiki page). Simply use this code:
C#
var webClient = new WebClient();
var pageSourceCode = webClient.DownloadString("place_url_here");


Then use XmlDocument to access nodes. There is many exapmles on codeproject.com :)

[Update 2]
You've mentioned that you want to search the Wiki for some topics. I did some research for you and using Wiki API for search is fairly easy. You just need to call API url:
http://en.wikipedia.org/w/api.php?action=opensearch&search=forza%20horizon[^]

It will return results in format like this:
["forza horizon",["Forza Horizon","Forza Horizon 2"]]

It means that for search phrase are two results. Next you can display results and let user to specify which result he/she wants to open. Finally retrive Wiki content for that and use Speech Synthesis to output content to the user :)

[Update 3 - Final I think ;)]
Wikipedia is multilingual, so you can do the same thing in your app. To get data in specific language change it short code at begging of the link i.e:
http://pl.wikipedia.org/w/api.php?action=opensearch&search=forza%20horizon[^]

You can change the voice in your synthesiser too. Iterate through installed voices on your computer to get information about them and display available languages. There is Culture property of VoiceInfo. My sample app you've downloaded contains everything you need to do that.

Cheers!

===EDIT: Fixed Broken Link===
CodingK

===EDIT: Fixed Broken Link ===
Marcin Kozub
 
Share this answer
 
v8
Comments
[no name] 26-Nov-14 9:26am    
Good coverage, good solution, my 5 --Your last link was totally broken. I tried to recover it, so just check to see that it returns what you wanted it to.
Marcin Kozub 26-Nov-14 9:29am    
Thx CodingK :)
[no name] 26-Nov-14 9:31am    
You're welcome.
Marcin Kozub 26-Nov-14 9:31am    
And now you broke my link ;P Everything worked fine earlier...
[no name] 26-Nov-14 9:40am    
:p Yea blame me. I put my hands up. =)
ok so here is the code i came up with from Marcin Kozub's solution

C#
var webClient = new WebClient();
var pageSourceCode = webClient.DownloadString("http://en.wikipedia.org/w/api.php?format=xml&action=query&prop=extracts&titles=" + "Forza Horizon 2" + "&redirects=true");

XmlDocument doc = new XmlDocument();

doc.LoadXml(pageSourceCode);

var fnode = doc.GetElementsByTagName("extract")[0];

string ss = fnode.InnerText;

Regex regex = new Regex("\\<[^\\>]*\\>");

String.Format("Before:{0}", ss); // HTML Text

ss = regex.Replace(ss, String.Empty);

string result =  String.Format(ss);// Plain Text as a OUTPUT



TextBox.Text += result;

i was able to get the xml node i wanted by using(and modifing) the code from this answer:
store specific nodes from xml[^]
thanks for everyone's help!
 
Share this answer
 

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900