12,063,278 members (64,954 online)
Article
alternative version

8.2K views
12 bookmarked
Posted

# Arabic/English Crossword Generation using Progressive Search

, 4 Dec 2013 CPOL
 Rate this:
The article describes an ASP.NET web application for generating Crossword (Arabic and English) puzzles using progressive search algorithm

## Introduction

The purpose of this post is to describe a Crossword generation web application and its associated progressive search algorithm.

The reader can browse the English version or the Arabic version.

Progressive search is a randomized iterative search algorithm that the author has developed and found to be effective for a variety of optimization problems; more information on this can be found at my activities page and my book on algorithms, Algorithms: Development and Programming.

Crossword puzzles are popular word games that serve to challenge, entertain and educate. Crossword puzzles owe their popularity to their being a stable feature in almost every newspaper worldwide. Educators have also found crossword puzzles an effective means for measuring a person’s knowledge on a particular subject.

In its basic form, the crossword generation problem is stated as follows:

Given a rectangular (or square) grid of certain dimensions and a word list (lexicon), produce a set of horizontal words and a set of vertical words along with their row and column locations (cells) for filling of the empty cells. For unconstrained crossword, the algorithm is to determine the positions of the black cells — some of which are needed to separate adjacent words. For constrained crossword, the positions of black cells are fixed at the start.

The algorithm presented here is for unconstrained crosswords (that is, it includes sub-algorithms for determining black cells as explained later). The crossword generation problem is different from (and seems easier than) crossword solving problem. The latter problem calls for finding word answers consistent with the given clues.

Crosswords puzzles can be about numbers too. Using a programmatically generated lexicon of numbers (i.e., for a number z=x*y, z is a word and x*y is a word definition), it is possible to produce a crossword puzzle of grid size = 10 with about 40 interrelated multiplication problems that can serve as an exercise for students in the third grade. Our Crossword application generates such crosswords using the Numeric option (see above figure).

In this article, we explain our progressive-search algorithm for generating crosswords and its encapsulation into a program component (a class library that can be compiled into a DLL), along with its web-based application. The above figure shows the basic structure of the interface and a sample crossword generated by the component.

## How to use the Crossword component

To create a crossword, as illustrated by the following ASPX script, create an instance of the Crossword class, set its attributes (`DBPath`, `GridSize`, ... etc.) and call `CreateCrossword()`.

```<%@ Language="C#" debug="true" %>
<html>
<title>Test Component</title>```

The preceding script (when accessed by browser) produces an output similar to the following screen capture.

## Crossword Generation using Progressive Search

The proposed crossword generation algorithm starts by filling the grid with horizontal words and then computes a cost-measure of this solution — for example, the number of unknown (not found in the lexicon) vertical words — which represents the objective function to be minimized. Then as a mutation of the current solution, the algorithm chooses a horizontal word and replaces it by a randomly-selected lexicon word of the same length. Using this kind of mutation guarantees that all solutions generated during the progress toward the final solution, have valid horizontal words; hence, the cost of a solution can be computed by scanning the grid columns and counting unknown or scrambled words. The coupling of this idea to progressive search can be outlines as follows.

1. Generate an initial solution X, through horizontal fill and compute its measure of goodness, f(X) as some expression related to the number of unknown or scrambled words in the vertical view. Save X as the current best solution and enqueue X.
2. Progressive Search: Repeatedly dequeue (or choose one at a random) a solution S, generate a neighborhood of solutions from S by mutation and enqueue those solutions whose cost fall within some preset threshold. Terminate if an optimal solution is found (i.e., cost = 0) or the elapsed time has exceeded some preset time-period.

As explained in the author’s Sudoku article, we will use an implementation of progressive search where the items are enqueued by negative of cost to avoid threshold calculation.

The Crossword component consists of two classes: `Crossword` class that handles input and output (including the loading of a lexicon from a specified database), and `Solution` class that encodes any solution generated during the exploration of the search space. The `Solution` class is subordinate to the `Crossword` class. To facilitate access to members defined in the `Crossword` class, the `Solution` class include a member `cw` that points to this instance of the `Crossword` class.

The following listing shows parts of the program code for the `Crossword` class.

```public string CreateCrossWord(Solution s1)
{
if (Numeric)
{   MaxWordLength = GenWords(ref WordList, ref WordDescList);
FreqChar = new char[] { '1', '2', '3', '4', '5', '6', '7', '8', '9' };
MinWordLength = 2;
goto cont;
}

if (Arabic)
FreqChar = new char[] { 'ع', 'ب', 'ت', 'ي', 'ن', 'م', 'ل', 'و' };
else FreqChar = new char[] { 'e', 'r', 'a', 'i', 'n', 't', 'o', 's' };

if (s1 != null)
{  this.WordList = s1.cw.WordList;
this.WordDescList = s1.cw.WordDescList;
this.WordSigList = s1.cw.WordSigList;
this.MaxWordLength = s1.cw.MaxWordLength;
goto cont;
}

MaxWordLength = GetLexicon(DBPath, ref WordList, ref WordDescList, ref WordSigList);
// check for gaps in Wordlength
int max = MaxWordLength;
if (GridSize < MaxWordLength) max = GridSize;
for (int i = MinWordLength; i <= max; i++)
if (WordList[i].Count == 0)
return "Maximum Word Length:" + MaxWordLength + " Supply Words of Length:" + i;

cont:
if (s1 == null) s1 = new Solution(this, false);
if (TimeToStop == -1) return ComputeFillMeasure();

if (TimeToStop == 0) // used to report starting solution
{   BestSolution = s1;
StatusInfo = "";
}
else BestSolution = ProgressiveSearch(s1);

return ""; // indicate success
}
```

The following listing shows parts of the program code for the `Solution` class. The basic information here is that any filling of the crossword grid represents a solution (node) in our search space. Thus, a solution is completely characterized by its filled grid or, alternatively, its `ItemList` (a list of horizontal words and their respective locations). However, for clarity and to speed up the process of mutation, we have chosen to keep both structures, which is evident from the initial lines of the following partial listing of the Solution class:

```struct SolutionItem
{  internal int row, col;
internal string word;
internal SolutionItem(int r, int c, string w)
{ row = r; col = c; word = w; }
}

public class Solution
{  internal Crossword cw; // a reference to Crossword instance
Random rand;

// A solution is a grid and an ItemList (a list of horizontal words)
internal char[,] grid; // The crossword grid
internal int GridSize;
ArrayList ItemList = new ArrayList(); // a list of "SoultionItem" objects

public float Cost; // Cost measure
public int InvalidWordCount, ScrambledWordCount;

public Solution(Crossword cw, bool empty)
{ //  empty is passed as true to generate a solution object with undefined grid or ItemList
this.cw = cw;
GridSize = cw.GridSize;
this.rand = cw.rand;
if (empty) return;

grid = new char[GridSize, GridSize];
if (cw.VPlacement || (cw.MaxVWordLength == -1))
Fill_2();
else Fill_1();

Cost = ComputeMeasure(); // compute and set Solution's cost
}

// ... some other methods omitted

void PlaceWord(string wrd, int row, int col)
{   if (wrd.Length == 1)
{ grid[row, col] = wrd[0]; return; }

SolutionItem it = new SolutionItem(row, col, wrd);

if (cw.Arabic)
for (int i = col; i < wrd.Length; i++)
{ grid[row, col+i] = wrd[wrd.Length - 1 - i]; }
else
for (int i = col; i < wrd.Length; i++)
{ grid[row, col+i] = wrd[i]; }
}
}```

Most of the methods in the `Solution` class are language neutral. The method `PlaceWord()` (shown above) is language dependent. The function modifies a solution by placing a word (specified as `wrd` parameter) horizontally at the location specified by the `row` and `col` parameters.

A key method in the `Solution` class is `NewSolution()`, which is defined by the following:

```internal Solution NewSolution()
{  Solution s1 = this.CopySolution(); // copy grid only to s1; s1 will have empty ItemList
// Select a horizontal word
int r = rand.Next(ItemList.Count);
SolutionItem it = (SolutionItem) ItemList[r];

// Get a random lexicon word of a specified length
string wrd = cw.GetWord(it.word.Length);
s1.PlaceWord(wrd, it.row, it.col);// Place word in grid and add it to ItemList
// Build the list of horizontal words for s1
foreach (SolutionItem itx in ItemList)
if ( (itx.row!=it.row) || (itx.col!=it.col))

s1.ComputeMeasure(); // Compute and set Solution’s cost
return s1;
}
```

This works as follows. We have to mutate a solution `S` into a new solution `S1`. We begin by having `S1` as a copy of `S`. Then we randomly pick a horizontal word w from the grid (equivalent to picking an item it from `ItemList`). We then select a random lexicon word `wrd` and have it replace the word `w`. This last action is reflected in the grid of `S1` via the call to `PlaceWord()`. We then use the `foreach` loop a build the `ItemList` of `S1`. Finally, we issue the call `ComputeMeasure()` to compute and set the cost of `S1`.

## What Objective Function to Use?

This is concerning the expression used by `ComputeMeasure()` for the solution's cost (i.e.,how far it is from an optimal (with cost=0) solution).

Given a solution S to a crossword puzzle, let u denote the number of unknown vertical words and w denote the number of scrambled vertical words. As a simple cost function for a solution S, we define f(S) = u. However, such a function causes termination when the number unknown vertical words reaches zero, yet there may be plenty of scrambled vertical words, which is generally undesirable. To account for scrambled words, the expression f(S) = u+0.5*w is a better choice (unknown words have twice the weight of scrambled words). Further reflection reveals that such a function does not distinguish among solutions that have the same number of unknown and scrambled words but where the length of such words vary sharply for different solutions. Intuition suggests that by moving along the direction that reduces the length of unknown words [scrambled words], we eventually reduce the number of unknown words [scrambled words]. Thus, a better expression for f(S) = len(u) + 0.2*len(w), where len(u) is the combined length of unknown vertical words and len(w) is the combined length of scrambled vertical words (Note: we have found that a factor of 0.2 works better than 0.5).

## Filling Strategies

A basic strategy for the filling (determining the positions of black cells) of a grid is the following:

process the grid row-wise, and for each row place randomly chosen words.

However, there are few problems that must be addressed. First, we like to have the possibility of having word slots of length one as well as the possibility of a leftmost (or a rightmost) cell in any row be black. This suggests that the determination of word slots (equivalently word lengths) be based on generating random numbers between zero and maximum word length `MaxWordLength` (as determined from input lexicon) inclusive, where a word length of zero means that the current cell, that will be normally the start of a word, will be blackened. A word of length one is selected at random from the set of the eight most frequent English letters. Such a strategy has still one minor problem relating to the vertical view; a column may end up having many adjacent white cells, making it difficult to match with a 9-letter word later for instance. To address this problem, and to vary the density of black cells, the filling strategy is augmented with a parameter `MaximumVWordLength` (the maximum length for any vertical word).

The implementation of this strategy is shown as `Fill_1()` method in the following listing.

```void Fill_1()
{   int k, wlen, remlen; int[] cc = new int[GridSize];
for (int i = 0; i < GridSize; i++)
{  cc[i] = 0;
for (int j = 0; j < GridSize; j++)
{ grid[i, j] = '*'; }
}

for (int i = 0; i < GridSize; i++)
{  int j = 0;
while (j < GridSize)
{  remlen = GridSize - j;
if (remlen > cw.MaxWordLength) remlen = cw.MaxWordLength;
if (remlen > 3)
wlen = rand.Next(remlen + 1); // allow a word-length of 0 to allow * to appear in column 0
else wlen = remlen;
for (k = j; k < j + wlen; k++)
{  if (cc[k] >= cw.MaxVWordLength) break;
cc[k]++;
}

wlen = k - j; // shorten word length in order not to exceed MaxVWordLength
if (k < GridSize) cc[k] = 0;

if (wlen > 0) PlaceWord(cw.GetWord(wlen), i, j);
j = j + wlen + 1;
}
}
}
```

We utilize a vector `cc[0..GridSize-1]`, where `cc[k]` is the count of white cells for column k. Also, note that, first, the word length `wlen` is chosen as a random number between zero and minimum of (`MaxWordLength`, `remlen`), where `remlen` is the number of remaining unfilled cells in the current row. If `remlen` is ≤ 3 then `wlen` is reset to `remlen`. Then `wlen` is further reduced to ensure that `cc[k] ≤ MaxVWordLength` for the columns spanned by the current word.

An alternative filling strategy is given by `Fill_2()` method in the following listing.

``` void Fill_2()
{
for (int i = 0; i < GridSize; i++)
for (int j = 0; j < GridSize; j++)
{ grid[i, j] = 'X'; }

for (int col = 0; col < GridSize; col++)
{   int row = 0;
while (row < GridSize)
{   // allow a wordlength of 0 to allow * to appear in column 0
int wlen = rand.Next(cw.MaxVWordLength + 1);
row = row + wlen;
if (row >= GridSize) break;
if ((row > 0) && (grid[row - 1, col] == '*')) continue;
grid[row, col] = '*';
}
}

for (int i = 0; i < GridSize; i++)
{   int wlen = 0; int startpos = 0;
for (int j = 0; j <= GridSize; j++)
{   if (j == GridSize)
{   if (wlen > 0) PlaceWord(cw.GetWord(wlen), i, startpos);
break;
}
if ((grid[i, j] == 'X') && (wlen < cw.MaxWordLength)) wlen++;
else //found * or wlen= MaxWordLen
{   if (wlen > 0) PlaceWord(cw.GetWord(wlen), i, startpos);
grid[i, j] = '*'; //needed if wlen= MaxWordLength
wlen = 0;
startpos = j + 1;
}
}
}
}```

In this strategy, the black cells are chosen based on a vertical view, by processing the grid column-wise and generating random word lengths (up to `MaximumVWordLength`). The grid is then processed row-wise for the purpose of horizontal word filling.

## Conclusions

Crossword generation is a good example demonstrating the effectiveness of progressive search. This application was tested with popular browsers and found to work properly.

## History

• 4th December, 2013: Version 1.0.

## Share

 Instructor / Trainer KFUPM Saudi Arabia

Nasir Darwish is an associate professor with the Department of Information and Computer Science, King Fahd University of Petroleum and Minerals (KFUPM), Saudi Arabia.

Developed some practical tools including COPS (Cooperative Problem Solving), PageGen (a tool for automatic generation of web pages), and an English/Arabic full-text search engine. The latter tools were used for the Global Arabic Encyclopedia and various other multimedia projects.

Recently, developed TilerPro which is a web-based software for construction of symmetric curves and their utilization in the design of aesthetic tiles. For more information, visit TilerPro web site.