Introduction
In this article I show some techniques for manipulation of XML elements by
manipulating the values as strings.
Background
One day a friend stopped by and asked me if I knew anything about the
dictionary that came on his iPad. He told me how it had a dictionary on it that
gave example sentences that didn’t match the headword. I had spent some time
trying to work my way through the Wordnet database files so I said, “I know
exactly what happened!” And so will you: but first, here's the background.
Wordnet is a lexical project run
by Princeton University. They have a very large lexical database and it is
available for download here. You will need this database if you want to
run the project. If you want to know more about Wordnet,
please check their site. The key to solving the dictionary on the iPad that had
the wrong examples sentences is in the structure of the Wordnet XML
database.
Structure The main relation among words in WordNet is
synonymy, as between the words shut and close or car and automobile.
Synonyms--words that denote the same concept and are interchangeable in many
contexts--are grouped into unordered sets (synsets). Each of WordNet’s 117 000
synsets is linked to other synsets by means of a small number of “conceptual
relations.” Additionally, a synset contains a brief definition (“gloss”) and, in
most cases, one or more short sentences illustrating the use of the synset
members. Word forms with several distinct meanings are represented in as many
distinct synsets. Thus, each form-meaning pair in WordNet is unique.
This project is released under CPOL. However, WordNet has its own licence and the terms of its use should be understood
and honored.
Following is an excerpt of just one element in the WordNet database. It
includes all synset data described above under "structure".
<synset pos="r" ofs="00516492" id="r00516492">
<terms><term>wrongfully</term></terms>
<keys><sk>wrongfully%4:02:00::</sk></keys>
<gloss desc="orig">
<orig>in an unjust or unfair manner; "the employee claimed that she was
wrongfully dismissed"; "people who were wrongfully imprisoned should be
released"</orig>
</gloss>
<gloss desc="text">
<text>in an unjust or unfair manner ; â the employee claimed that she was
wrongfully dismissed â ; â people who were wrongfully imprisoned should be
released â</text>
</gloss>
<gloss desc="wsd">
<def id="r00516492_d">
<wf pos="IN" id="r00516492_wf1" tag="ignore"
lemma="in">in</wf>
<wf pos="DT" id="r00516492_wf2" tag="ignore"
lemma="an">an</wf>
<wf pos="JJ" id="r00516492_wf3" tag="man"
lemma="unjust%3">
<id id="r00516492_id.6" lemma="unjust" sk="unjust
%3:00:02::"/>
<id id="r00516492_id.5" lemma="unjust" sk="unjust
%3:00:04::"/>
<id id="r00516492_id.4" lemma="unjust" sk="unjust
%3:00:00::"/>unjust</wf>
<wf pos="CC" id="r00516492_wf4" tag="ignore"
lemma="or">or</wf>
<wf pos="JJ" id="r00516492_wf5" tag="man"
lemma="unfair%3">
<id id="r00516492_id.8" lemma="unfair" sk="unfair
%3:00:00::"/>unfair</wf>
<wf pos="NN" id="r00516492_wf6" tag="man"
lemma="manner%1" sep="">
<id id="r00516492_id.7" lemma="manner" sk="manner
%1:07:02::"/>manner</wf>
<wf pos=":" id="r00516492_wf7" tag="ignore"
type="punc">;</wf>
</def><ex id="r00516492_ex1"><qf rend="dq">
<wf id="r00516492_wf8" tag="ignore"
lemma="the">the</wf>
<wf id="r00516492_wf9" tag="un" lemma="employee
%1">employee</wf>
<wf id="r00516492_wf10" tag="un" lemma="claim
%2">claimed</wf>
<wf id="r00516492_wf11" tag="ignore"
lemma="that">that</wf>
<wf id="r00516492_wf12" tag="ignore"
lemma="she">she</wf>
<wf id="r00516492_wf13" tag="un" lemma="be
%2">was</wf>
<wf id="r00516492_wf14" tag="auto" lemma="wrongfully
%4">
<id id="r00516492_id.2" lemma="wrongfully" sk="wrongfully
%4:02:00::"/>wrongfully</wf>
<wf id="r00516492_wf15" tag="un" lemma="dismiss%2|dismissed
%3" sep="">dismissed</wf>
</qf>
<wf id="r00516492_wf16" tag="ignore"
type="punc">;</wf>
</ex><ex id="r00516492_ex2"><qf rend="dq">
<wf id="r00516492_wf17" tag="un" lemma="people%1|people
%2">people</wf>
<wf id="r00516492_wf18" tag="ignore"
lemma="who">who</wf>
<wf id="r00516492_wf19" tag="un" lemma="be
%2">were</wf>
<wf id="r00516492_wf20" tag="auto" lemma="wrongfully
%4">
<id id="r00516492_id.3" lemma="wrongfully" sk="wrongfully
%4:02:00::"/>wrongfully</wf>
<wf id="r00516492_wf21" tag="un" lemma="imprison%2|imprisoned
%3">imprisoned</wf>
<wf id="r00516492_wf22" tag="ignore"
lemma="should">should</wf>
<wf id="r00516492_wf23" tag="un" lemma="be
%2">be</wf>
<wf id="r00516492_wf24" tag="un" lemma="release%2"
sep="">released</wf>
</qf>
<wf id="r00516492_wf25" tag="ignore"
type="punc">;</wf>
</ex>
</gloss>
</synset>
But I just want a dictionary, not all these fancy cross-referenced elements. Extracting
just what I want from this example element programmatically would produce
a dictionary entry something like the following:
"Wrongfully: in an unjust or unfair manner; "the employee claimed that she
was wrongfully dismissed"; "people who were wrongfully imprisoned should be
released""
Nothing wrong with that. But wait, what about my friend's iPad? OK, let's
look at another synset element extracted from the XML files. As you can see
from the structure, the element <terms> has 3
<term> elements in it:
<synset id="v00384055" ofs="00384055" pos="v">
<terms>
<term>metamorphose</term>
<term>transfigure</term>
<term>transmogrify</term>
</terms>
<keys>
<sk>metamorphose%2:30:00::</sk>
<sk>transfigure%2:30:00::</sk>
<sk>transmogrify%2:30:00::</sk>
</keys>
<gloss desc="orig">
<orig>change completely the nature or appearance of; "In
Kafka's story, a person metamorphoses into a bug"; "The treatment and diet
transfigured her into a beautiful young woman"; "Jesus was transfigured after
his resurrection"</orig>
</gloss>
<gloss desc="text">
<text>change completely the nature or appearance of ; â In
Kafka's story , a person metamorphoses into a bug â ; â The treatment and
diet transfigured her into a beautiful young woman â ; â Jesus was
transfigured after his resurrection â</text>
</gloss>
<gloss desc="wsd">
<def id="v00384055_d">
<wf id="v00384055_wf1" lemma="change%1|change%2"
pos="VB" tag="man">
<id id="v00384055_id.5" lemma="change"
sk="change%2:30:01::"/>change</wf>
<wf id="v00384055_wf2" lemma="completely%4"
pos="RB" tag="un">completely</wf>
<wf id="v00384055_wf3" lemma="the"
pos="DT" tag="ignore">the</wf>
<wf id="v00384055_wf4" lemma="nature%1"
pos="NN" tag="un">nature</wf>
<wf id="v00384055_wf5" lemma="or"
pos="CC" tag="ignore">or</wf>
<wf id="v00384055_wf6" lemma="appearance%1"
pos="NN" tag="man">
<id id="v00384055_id.4" lemma="appearance"
sk="appearance%1:07:00::"/>appearance</wf>
<wf id="v00384055_wf7" lemma="of"
pos="IN" sep="" tag="ignore">of</wf>
<wf id="v00384055_wf8" pos=":"
tag="ignore" type="punc">;</wf>
</def>
<ex id="v00384055_ex1">
<qf rend="dq">
<wf id="v00384055_wf9" lemma="in"
tag="ignore">In</wf>
<wf id="v00384055_wf10" lemma="Kafka%1"
tag="un">Kafka's</wf>
<wf id="v00384055_wf11" lemma="story%1"
sep="" tag="un">story</wf>
<wf id="v00384055_wf12" tag="ignore"
type="punc">,</wf>
<wf id="v00384055_wf13" lemma="a"
tag="ignore">a</wf>
<wf id="v00384055_wf14" lemma="person%1"
tag="un">person</wf>
<wf id="v00384055_wf15" lemma="metamorphosis
%1|metamorphose%2" tag="auto">
<id id="v00384055_id.1"
lemma="metamorphose" sk="metamorphose
%2:30:00::"/>metamorphoses</wf>
<wf id="v00384055_wf16" lemma="into"
tag="ignore">into</wf>
<wf id="v00384055_wf17" lemma="a"
tag="ignore">a</wf>
<wf id="v00384055_wf18" lemma="bug%1|bug
%2" sep="" tag="un">bug</wf>
</qf>
<wf id="v00384055_wf19" tag="ignore"
type="punc">;</wf>
</ex>
<ex id="v00384055_ex2">
<qf rend="dq">
<wf id="v00384055_wf20" lemma="the"
tag="ignore">The</wf>
<wf id="v00384055_wf21" lemma="treatment
%1" tag="un">treatment</wf>
<wf id="v00384055_wf22" lemma="and"
tag="ignore">and</wf>
<wf id="v00384055_wf23" lemma="diet%1|diet
%2" tag="un">diet</wf>
<wf id="v00384055_wf24" lemma="transfigure
%2" tag="auto">
<id id="v00384055_id.2"
lemma="transfigure" sk="transfigure
%2:30:00::"/>transfigured</wf>
<wf id="v00384055_wf25" lemma="her"
tag="ignore">her</wf>
<wf id="v00384055_wf26" lemma="into"
tag="ignore">into</wf>
<wf id="v00384055_wf27" lemma="a"
tag="ignore">a</wf>
<wf id="v00384055_wf28" lemma="beautiful
%3" tag="un">beautiful</wf>
<wf id="v00384055_wf29" lemma="young%1|young
%3" tag="un">young</wf>
<wf id="v00384055_wf30" lemma="woman%1"
sep="" tag="un">woman</wf>
</qf>
<wf id="v00384055_wf31" tag="ignore"
type="punc">;</wf>
</ex>
<ex id="v00384055_ex3">
<qf rend="dq">
<wf id="v00384055_wf32" lemma="Jesus%1"
tag="un">Jesus</wf>
<wf id="v00384055_wf33" lemma="be%2"
tag="un">was</wf>
<wf id="v00384055_wf34" lemma="transfigure
%2" tag="auto">
<id id="v00384055_id.3"
lemma="transfigure" sk="transfigure
%2:30:00::"/>transfigured</wf>
<wf id="v00384055_wf35" lemma="after%3|after
%4" tag="un">after</wf>
<wf id="v00384055_wf36" lemma="his"
tag="ignore">his</wf>
<wf id="v00384055_wf37" lemma="resurrection
%1" sep="" tag="un">resurrection</wf>
</qf>
<wf id="v00384055_wf38" tag="ignore"
type="punc">;</wf>
</ex>
</gloss>
</synset>
And if I were going to programmatically
extract dictionary entries out of this one, the final text would look
something like:
metamorphose: change completely the nature or appearance of; "In
Kafka's story, a person metamorphoses into a bug"; "The treatment and diet
transfigured her into a beautiful young woman"; "Jesus was transfigured after
his resurrection"
transfigure: change completely the nature
or appearance of; "In Kafka's story, a person metamorphoses into a bug"; "The
treatment and diet transfigured her into a beautiful young woman"; "Jesus was
transfigured after his resurrection"
transmogrify: change
completely the nature or appearance of; "In Kafka's story, a person
metamorphoses into a bug"; "The treatment and diet transfigured her into a
beautiful young woman"; "Jesus was transfigured after his resurrection"
The first two almost work as they each have at least one example
that matches their headword but the entry for 'transmogrify" has three
wrong example sentences with it. It is not even safe to attempt changing out the
match for the <term> in the example sentence. Without
visually inspecting each one, you might create example sentences like, "Jesus
was transmogrified after his resurrection" which might be technically correct
but I'm sure some would take offense at it. And thus, any attempt to run a
simple query extracting headword, definition and example-sentence will produce
errors.
Using the code
In order to run the code you will need to download the WordNet
database files per above link, extract the folder "merged" and put that folder
in the debug folder of the project. I've left
Console.WriteLine() so running the code will display
the points and examples given in this article but most of them will zip by
so fast you won't even be able to read them. So if you want to have it stop at
any point, insert a Console.Readline() at the appropriate place. As
released, there is only one at the end.
The code all runs in
a console application. Sub Main calls the subs that demonstrate
different ways of string manipulation of the XML element's values and attributes.
These subs are described below but it will be easier to step through the code if
you need to see exactly what was done. I'm only showing specific points in the
article that demonstrate what I'm talking about.
Sub Main()
wrongfully()
Rightfully()
Wordley()
HTML()
End Sub
WordNet Wrongfully() Transmogrified: shows how to
achieve the wrong example that I first described above. It produces an XML file
"wrongfully.xml". Looking through this file you only go 4 entries
before you hit one that looks goofy:
<entry>
<hw>dorsal</hw>
<orig>facing away from the axis of an organ or organism;
"the abaxial surface of a leaf is the underside or side facing away from the stem"</orig>
<pos>a</pos>
</entry>
But looking on the bright side, the first three comes out OK. And
the XML for the "transmogrify" example comes out looking just like I
predicted, explaining how wrong example sentences can end up on an iPad:
<entry>
<hw>metamorphose</hw>
<orig>change completely the nature or appearance of; "In Kafka's story, a person metamorphoses into a bug";
"The treatment and diet transfigured her into a beautiful young woman";
"Jesus was transfigured after his resurrection"</orig>
<pos>v</pos>
</entry>
<entry>
<hw>transfigure</hw>
<orig>change completely the nature or appearance of; "In Kafka's story, a person metamorphoses into a bug";
"The treatment and diet transfigured her into a beautiful young woman";
"Jesus was transfigured after his resurrection"</orig>
<pos>v</pos>
</entry>
<entry>
<hw>transmogrify</hw>
<orig>change completely the nature or appearance of; "In Kafka's story, a person metamorphoses into a bug";
"The treatment and diet transfigured her into a beautiful young woman";
"Jesus was transfigured after his resurrection"</orig>
<pos>v</pos>
</entry>
WordNet Rightfully() Transmogrified: What I want is an XML
document that contains the Wordnet XML database reduced down to entries that
each contain a headword, gives the definition, list the synonyms if any,
and if there is an example that matches the headword then it includes
the example. Just a simple dictionary in XML format that can then be translated
to other programs or formats. By looking closely at the elements in the
database, I find the <orig> element contains the definition that fits all
the <term> values but after that, it may or may not include
example sentences and may not contain examples that fit each term listed. So to
that I add that I am not going to decide if the synonym would be replaceable in
the example sentence. I only want the ones that match.
Rightfully() shows how to use string manipulation of the WordNet
XML elements to get the WordNet database converted to a dictionary
with entries in XML format. On this one I take the
<orig> element from the database files (you can see this
element in the above code examples) which includes the definition and
examples that apply to the synset.
I split the data first on the semicolon which worked
for most of the elements. I had to add string replace to deal with
the handful of entries that were not separated by semicolons. I
simply ran it to error and worked out the string replacement that was
needed to deal with the error. I use replace to add a
semicolon at the right place and therefore the text is the same as before when
the semicolon gets stripped out in the split. Now I have a
headword and all available example sentences. I split the examples
and check if it contains the headword. If it does, then it gets
matched up with that headword.
This produces an entry for the "transmogrify" example where
I have achieved the desired result of example sentences that match the
headword, etc.
<!---->
<entry>
<hw>metamorphose</hw>
<pos>v</pos>
<def>change completely the nature or appearance of</def>
<term>transfigure</term>
<term>transmogrify</term>
<q> "In Kafka's story, a person metamorphoses into a bug"</q>
</entry>
<entry>
<hw>transfigure</hw>
<pos>v</pos>
<def>change completely the nature or appearance of</def>
<term>metamorphose</term>
<term>transmogrify</term>
<q> "The treatment and diet transfigured her into a beautiful young woman"</q>
<q> "Jesus was transfigured after his resurrection"</q>
</entry>
<entry>
<hw>transmogrify</hw>
<pos>v</pos>
<def>change completely the nature or appearance of</def>
<term>metamorphose</term>
<term>transfigure</term>
</entry>
I did some testing on the result XML files and I could not prove
that any mistakes were entered in from the process of doing the string replace
in rightfully(), but I did find one thing by accident as I wasn't
looking for it here. There is at least 1 example sentence lost that should be in
there. Some examples had "felt" as the past tense for "feel" which gets lost by
using the contains("feel"). But the object had been achieved of
having no wrongfully entered example sentences. I may have lost some along the
way where the plural or past tense doesn't match up.
Wordley() is a sub that uses the XML file
created in Rightfully() to show further string manipulation of the
XML file to convert it to text files in the exact same format as the dictionary
Alan Burkhart provides in his CP article: Wordley. The only exception being that this produces the
full database converted to dictionary while Wordley is a trimmed back version. I
start with StringBuilders and IDictionarys to build
the Wordley files. I take advantage of the fact that you can't add a duplicate
key in the IDictionary so I try to add it in
Try and if it already exists it goes to Catch so
in Catch I look up the already existing one and add to it.
HTML() converts the file created in Rightfully()
into individual numbered XML files, similar to but not exactly the same
as the directory created for the XML dictionary in Christ Kennedy's CP Article
GCIDE: A
Complete English Language Dictionary. In my version I have it worked out so
each file carries all entries of the same headword, rather than one file per
definition. In both Wordley and HTML, I take advantage of the fact that the
IDictionary will not allow for duplicate keys by putting this in a
Try - Catch. First it tries creating a key for the headword of the next element.
If it does not already exist it does this. Otherwise it goes to catch where I
make it look up the already existing one and add to it. If you do not want to
see the XML files and stylesheet and how they work together, I recommend
commenting out the HTML()sub as it will make 147,306 XML files
using about 600 mb of disk space. If you just want to see a few and how they
work, you can stop the project anytime after the HTML() sub starts running
because the XSL stylesheet is already in place. Then, if you double-click an
XML file it will open in your web browser, but it will be random selection as
they are numbered files. The stylesheet ("wn.xsl") is created programmatically
and saved in the WordnetFiles directory when the directories are being created.
OR...
Viewing the XML as HTML: The following code will make a simple
Visual Basic browser with autocomplete textbox for viewing the XML/HTML
files. The XML file "WNdicty.xml" is created during the processing
of the XML files and saves the dictionary as a key value pair in the form
<p><k></k><v></v></p>. The file
doesn't get saved until all the files are saved so if you want to try this out
you will have to run the whole sub.
1. In VS 2010 create a new Windows forms project targeting 3.5
framework in Visual Basic. It might work in other versions but up to
you to convert if it doesn't.
2. Add a textbox and dock it at the
top of the form.
3. Add a WebBrowser control and set the
Dock property to "Fill" and the
ScriptErrorSuppressed property to "true".
4.
Stretch the form out to a respectable viewing size.
5.
Double-click on the form (or F7) to get Form1 showing. Replace the empty Form1
with the following code.
6. Copy the folder "WordnetFiles"
created in this project into the debug folder of the new project.
This code is given without comments, no explanation, to give a bare bones viewer
for looking up the files, or learning about XSL stylesheets (don't
ask me - I read XML for Dummies before I found CodeProject) or a base for
building a better dictionary, should you care to do so. Otherwise I recommend
Wordley.
Public Class Form1
Public Shared AutoCompleteList As AutoCompleteStringCollection = New AutoCompleteStringCollection
Public Shared WNDicty As IDictionary(Of String, String) = New Dictionary(Of String, String)
Public Shared whereiam As String = My.Computer.FileSystem.CurrentDirectory & "\"
Private Sub autocompletefill()
Dim DictySource As XElement = XElement.Load(whereiam & "\WordnetFiles\WNdicty.xml")
WNDicty.Clear()
For Each kvp In DictySource.<p>
Dim searchkey As String = kvp.<k>.Value
Dim ID As String = kvp.<v>.Value
WNDicty.Add(searchkey, ID)
AutoCompleteList.Add(searchkey)
Next
End Sub
Private Sub Form1_Load(sender As System.Object, e As System.EventArgs) Handles MyBase.Load
autocompletefill()
Me.TextBox1.Select(0, 1)
TextBox1.AutoCompleteSource = AutoCompleteSource.CustomSource
TextBox1.AutoCompleteCustomSource = AutoCompleteList
TextBox1.AutoCompleteMode = AutoCompleteMode.SuggestAppend
WebBrowser1.Navigate(whereiam & "WordnetFiles\032\032088.xml", False)
End Sub
Private Sub TextBox1_KeyDown(sender As Object, e As System.Windows.Forms.KeyEventArgs) Handles TextBox1.KeyDown
If e.KeyCode = Keys.Enter Then
Dim path As String = ""
If WNDicty.TryGetValue(TextBox1.Text, path) Then
Dim foldername As String = path.Substring(0, 3) & "\"
Dim makeurl As String = "file://"
Dim filelocation As String = makeurl & whereiam & _
"WordnetFiles\" & foldername & path & ".xml"
WebBrowser1.Navigate(filelocation, False)
End If
End If
End Sub
End Class
On the point of XSL stylesheets: I added extra title
(tooltip) attributes to give it the hover explanation, colors, link to the
WordNet site, etc. Yes, because it is in a WebBrowser control it does look
up http addresses if they are in the link. It is a bit obnoxious intentionally
so as to give incentive to learn to edit the XSL or to use
Wordley.
Points of Interest
In this article I have attempted to show that there is a right
way and a wrong way to do something and that time invested at the beginning to
work out what you are going to do is time well spent.
I give a couple examples of ways to do string
manipulation of XML
files. Rightfully() shows the string manipulation that converts the
WordNet synset into dictionary entries with correct example sentences.
Wordley() shows further string manipulation and a way to convert the
XML to .txt files compatible with Wordley. On the second, HTML() I show you
how to convert thE XML document into individual XML files, one per word, with an
XSLT stylesheet applied that converts it to HTML.
HTML() also shows an example of using
XDocument in real life. I did a lot of searching and there isn't
much available on it that I could find. This is useful for including the processing
instructions for converting the XML documents to HTML with the XSL stylesheet.
I attempt to show that XML is a versatile way to convert data
from one form into another.
The WordNet project is part of the subject of computational lexicology. I am using it as the base for the
main project I'm working on, of which the HTML() sub is a
modified part of my current working model for this. It will probably be
much different by the time I am done with it. The more I study about it
the more I find there is to learn but the one thing I have
not yet seen a definition for computational
lexicographer. So I would like to propose: someone
who applies both computer programming and lexicology in order to build a
computer program that can assist build a better dictionary. You know, not just
the hack who tries to interpret what the lexi-guy wants but actually studying
and applying it from both ends. Thanks to Princeton for their Wordnet
project!
This is my article.single for CodeProject but I hope to make it
my article.first. If I get a favorable response then maybe I can show
it to the Personnel Director to support my claim that I would be more
valuable in IT than in
Maintenance...
History
Released 5 November 2012.
19 November 2012: minor
typos & clarifications in article; fix point in
stylesheet that occasionally rendered the wrong part of speech for a word.
I am a maintenanceman. I started studying computer programming a few years ago when I got the idea of a computer program that would build a better dictionary. I'm using Visual Basic and XML. Soon I will be adding Web and JScript to my toolbelt.
I'm having great fun building my first real project in that direction. WordNet Rightfully Transmogrified was a sideline project based on my main direction.