Click here to Skip to main content
65,938 articles
CodeProject is changing. Read more.
Articles / Languages / C++

an efficient C++/STL library for word puzzles and spell checking

1.80/5 (6 votes)
14 Sep 2005CPOL1 min read 2   611  
DicoLib stores words in lists of anagrams indexed by their length and a 26 bits bitset which describe which letters are present in the words. This makes it extremely fast to search for words which contain specified letters, and to search for words which are "close" for spell checking appli

Introduction

This project started from a puzzle : There is a common English word that is nine letters long. Each time you remove a letter from it, it still remains an English word - from nine letters right down to a single letter. What is the original word, and what are the words that it becomes after removing one letter at a time? I wrote DicoLib to find the word, but since I stupidely translated "nine" by "8" in the code, I didn't find THE solution, but some others listed here DicoLib uses an original data structure that optimises the search of words which are "close" to a given one or contain the same letters. It is therefore suitable for spell checking, although this application hasn't (yet) been developed.

Data Structure

DicoLib's indexes words:

  • by their length
  • and by a bitset of 26 bits which describe which letters are present in the word

Actually, DicoLib stores a list of anagrams for each length/bitset pair. Finding anagrams of a given word therefore takes 0 time!

Finding words that are made of specified letters for Scrabble, or of a given length and contianing given letters is extremely fast as it requires mainly bitset operations and table lookup.

The whole data structure used in DicoLib is depicted in the schema below:

Image 1

Source code

DicoLib is available on SourceForge You're welcome to contribute to the development of this project towards a very efficient spell-checker or any other application you may think of.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)