Click here to Skip to main content
12,302,764 members (75,471 online)
Click here to Skip to main content
Add your own
alternative version

Tagged as

Stats

5.6K views
129 downloads
5 bookmarked
Posted

Spanish stemmer in C#

, 23 May 2014 CPOL
Rate this:
Please Sign up or sign in to vote.
This is a Spanish language stemmer written in C#. This is my first approach and is based on the rules defined in the Snowball web.

Introduction

«In linguistic morphology and information retrieval, stemming is the process for reducing inflected (or sometimes derived) words to their stem, base or root form—generally a written word form. The stem need not be identical to the morphological root of the word; it is usually sufficient that related words map to the same stem, even if this stem is not in itself a valid root. Algorithms for stemming have been studied in computer science since the 1960s. Many search engines treat words with the same stem as synonyms as a kind of query expansion, a process called conflation Wikipedia

I was searching for C# code for steemming in spanish language but I found nothing. OK, let's code!... and this is the result. This is a Spanish language stemmer written in C#. This is my first approach and is based on the rules defined in the Snowball web. The method of processing the word stemm follows the steps defined below:


The stemming algorithm

Letters in Spanish include the following accented forms,
á é í ó ú ü ñ
The following letters are vowels:
a e i o u á é í ó ú ü

R1 and R2 is defined as follows:

R1 is the region after the first non-vowel following a vowel, or is the null region at the end of the word if there is no such non-vowel.

R2 is the region after the first non-vowel following a vowel in R1, or is the null region at the end of the word if there is no such non-vowel.

RV is defined as follows:

If the second letter is a consonant, RV is the region after the next following vowel, or if the first two letters are vowels, RVis the region after the next consonant, and otherwise (consonant-vowel case) RV is the region after the third letter. But RVis the end of the word if these positions cannot be found.

For example,

m a c h o     o l i v a     t r a b a j o     á u r e o
     |...|         |...|         |.......|         |...|
Always do steps 0 and 1.

Step 0: Attached pronoun
Search for the longest among the following suffixes
me se sela selo selas selos la le lo las les los nos

and delete it, if comes after one of
(a) iéndo ándo ár ér ír
(b) ando iendo ar er ir
(c) yendo following u

in RV. In the case of (c), yendo must lie in RV, but the preceding u can be outside it.

In the case of (a), deletion is followed by removing the acute accent (for example, haciéndola -> haciendo).
Step 1: Standard suffix removal
Search for the longest among the following suffixes, and perform the action indicated.
anza anzas ico ica icos icas ismo ismos able ables ible ibles ista istas oso osa osos osas amiento amientos imiento imientos
delete if in R2
adora ador ación adoras adores aciones ante antes ancia ancias
delete if in R2
if preceded by ic, delete if in R2
logía logías
replace with log if in R2
ución uciones
replace with u if in R2
encia encias
replace with ente if in R2
amente
delete if in R1
if preceded by iv, delete if in R2 (and if further preceded by at, delete if in R2), otherwise,
if preceded by os, ic or ad, delete if in R2
mente
delete if in R2
if preceded by ante, able or ible, delete if in R2
idad idades
delete if in R2
if preceded by abil, ic or iv, delete if in R2
iva ivo ivas ivos
delete if in R2
if preceded by at, delete if in R2
Do step 2a if no ending was removed by step 1.

Step 2a: Verb suffixes beginning y
Search for the longest among the following suffixes in RV, and if found, delete if preceded by u.
ya ye yan yen yeron yendo yo yó yas yes yais yamos

(Note that the preceding u need not be in RV.)
Do Step 2b if step 2a was done, but failed to remove a suffix.

Step 2b: Other verb suffixes
Search for the longest among the following suffixes in RV, and perform the action indicated.
en es éis emos
delete, and if preceded by gu delete the u (the gu need not be in RV)
arían arías arán arás aríais aría aréis aríamos aremos ará aré erían erías erán erás eríais ería eréis eríamos eremos erá eré irían irías irán irás iríais iría iréis iríamos iremos irá iré aba ada ida ía ara iera ad ed id ase iese aste iste an aban ían aran ieran asen iesen aron ieron ado ido ando iendo ió ar er ir as abas adas idas ías aras ieras ases ieses ís áis abais íais arais ierais aseis ieseis asteis isteis ados idos amos ábamos íamos imos áramos iéramos iésemos ásemos
delete
Always do step 3.

Step 3: residual suffix
Search for the longest among the following suffixes in RV, and perform the action indicated.
os a o á í ó
delete if in RV
e é
delete if in RV, and if preceded by gu with the u in RV delete the u
And finally:
Remove acute accents

Using the code

There are two projects.

  1. Class library of SpanishStemmer
  2. Test Project (I will no talk about it because is very simply)

The Class library has two classes, Stemmer and Specials. The Specials contains the lists of terms used by Stemmer class to parse the input words and contains specials words in spanish. Stemmer is the class who does the work. The Execute method receives a word and return the stem. This method optionally receives a bool variable (useStopWords) to include or not some spanish specials words in the process of stemming. When the word is one of the list of Specials.stop_words, then return the whole word.

When useStopWord is false or the input word is not in the Specials.stop_words, the process is:

  1. Computes R1, R2 and RV as defined
  2. Computes Step 0.
  3. Computes Step 1.
  4. Computes Step 2a if Step 1 did not remove any ending.
  5. Computes Step 2b if Step 2a executed and did not remove any sufixes.
  6. Computes Step 3
  7. Removes acutes accents
  8. Returns the result stem of the input word


Points of Interest

I tried to generate the stemmed catalog of ICD-9 and the result is cool. Enjoy it!

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Share

About the Author

Jesús Utrera
Software Developer (Senior) AtSistemas
Spain Spain
Work at AtSistemas in Jerez de la Frontera (Cádiz)

You may also be interested in...

Comments and Discussions

 
GeneralMy vote of 5 Pin
Qwertie27-May-14 5:27
memberQwertie27-May-14 5:27 
GeneralMy vote of 5 Pin
franch198025-May-14 23:23
memberfranch198025-May-14 23:23 
GeneralRe: My vote of 5 Pin
Jesús Utrera25-May-14 23:26
memberJesús Utrera25-May-14 23:26 
GeneralRe: My vote of 5 Pin
franch198025-May-14 23:28
memberfranch198025-May-14 23:28 
GeneralMy vote of 5 Pin
Volynsky Alex23-May-14 12:16
professionalVolynsky Alex23-May-14 12:16 
SuggestionFormatting and content Pin
DaveAuld23-May-14 3:36
protectorDaveAuld23-May-14 3:36 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

| Advertise | Privacy | Terms of Use | Mobile
Web02 | 2.8.160530.1 | Last Updated 23 May 2014
Article Copyright 2014 by Jesús Utrera
Everything else Copyright © CodeProject, 1999-2016
Layout: fixed | fluid