Spanish stemmer in C#





5.00/5 (4 votes)
This is a Spanish language stemmer written in C#. This is my first approach and is based on the rules defined in the Snowball web.
Introduction
«In linguistic morphology and information retrieval, stemming is the process for reducing inflected (or sometimes derived) words to their stem, base or root form—generally a written word form. The stem need not be identical to the morphological root of the word; it is usually sufficient that related words map to the same stem, even if this stem is not in itself a valid root. Algorithms for stemming have been studied in computer science since the 1960s. Many search engines treat words with the same stem as synonyms as a kind of query expansion, a process called conflation.» Wikipedia
I was searching for C# code for steemming in spanish language but I found nothing. OK, let's code!... and this is the result. This is a Spanish language stemmer written in C#. This is my first approach and is based on the rules defined in the Snowball web. The method of processing the word stemm follows the steps defined below:
The stemming algorithm
Letters in Spanish include the following accented forms,- á é í ó ú ü ñ
- a e i o u á é í ó ú ü
R1 and R2 is defined as follows:
R1 is the region after the first non-vowel following a vowel, or is the null region at the end of the word if there is no such non-vowel.
R2 is the region after the first non-vowel following a vowel in R1, or is the null region at the end of the word if there is no such non-vowel.
RV is defined as follows:
If the second letter is a consonant, RV is the region after the next following vowel, or if the first two letters are vowels, RVis the region after the next consonant, and otherwise (consonant-vowel case) RV is the region after the third letter. But RVis the end of the word if these positions cannot be found.
For example,
m a c h o o l i v a t r a b a j o á u r e o |...| |...| |.......| |...|Always do steps 0 and 1.
Step 0: Attached pronoun
- Search for the longest among the following suffixes
- me se sela selo selas selos la le lo las les los nos
and delete it, if comes after one of- (a) iéndo ándo ár ér ír
(b) ando iendo ar er ir
(c) yendo following u
in RV. In the case of (c), yendo must lie in RV, but the preceding u can be outside it.
In the case of (a), deletion is followed by removing the acute accent (for example, haciéndola -> haciendo).
- Search for the longest among the following suffixes, and perform the action indicated.
- anza anzas ico ica icos icas ismo ismos able ables ible ibles ista istas oso osa osos osas amiento amientos imiento imientos
- delete if in R2
- adora ador ación adoras adores aciones ante antes ancia ancias
- delete if in R2
- if preceded by ic, delete if in R2
- logía logías
- replace with log if in R2
- ución uciones
- replace with u if in R2
- encia encias
- replace with ente if in R2
- amente
- delete if in R1
- if preceded by iv, delete if in R2 (and if further preceded by at, delete if in R2), otherwise,
- if preceded by os, ic or ad, delete if in R2
- mente
- delete if in R2
- if preceded by ante, able or ible, delete if in R2
- idad idades
- delete if in R2
- if preceded by abil, ic or iv, delete if in R2
- iva ivo ivas ivos
- delete if in R2
- if preceded by at, delete if in R2
Step 2a: Verb suffixes beginning y
- Search for the longest among the following suffixes in RV, and if found, delete if preceded by u.
- ya ye yan yen yeron yendo yo yó yas yes yais yamos
(Note that the preceding u need not be in RV.)
Step 2b: Other verb suffixes
- Search for the longest among the following suffixes in RV, and perform the action indicated.
- en es éis emos
- delete, and if preceded by gu delete the u (the gu need not be in RV)
- arían arías arán arás aríais aría aréis aríamos aremos ará aré erían erías erán erás eríais ería eréis eríamos eremos erá eré irían irías irán irás iríais iría iréis iríamos iremos irá iré aba ada ida ía ara iera ad ed id ase iese aste iste an aban ían aran ieran asen iesen aron ieron ado ido ando iendo ió ar er ir as abas adas idas ías aras ieras ases ieses ís áis abais íais arais ierais aseis ieseis asteis isteis ados idos amos ábamos íamos imos áramos iéramos iésemos ásemos
- delete
Step 3: residual suffix
- Search for the longest among the following suffixes in RV, and perform the action indicated.
- os a o á í ó
- delete if in RV
- e é
- delete if in RV, and if preceded by gu with the u in RV delete the u
- Remove acute accents
Using the code
There are two projects.
- Class library of SpanishStemmer
- Test Project (I will no talk about it because is very simply)
The Class library has two classes, Stemmer
and Specials
. The Specials contains the lists of terms used by Stemmer class to parse the input words and contains specials words in spanish. Stemmer is the class who does the work. The Execute
method receives a word and return the stem. This method optionally receives a bool variable (useStopWords
) to include or not some spanish specials words in the process of stemming. When the word is one of the list of Specials.stop_words
, then return the whole word.
When useStopWord
is false or the input word is not in the Specials.stop_words
, the process is:
- Computes R1, R2 and RV as defined
- Computes Step 0.
- Computes Step 1.
- Computes Step 2a if Step 1 did not remove any ending.
- Computes Step 2b if Step 2a executed and did not remove any sufixes.
- Computes Step 3
- Removes acutes accents
- Returns the result stem of the input word
Points of Interest
I tried to generate the stemmed catalog of ICD-9 and the result is cool. Enjoy it!