This is my first article on this great site which helps me so much. My article is about the preprocessing steps for string matching. This field or string matching have a lot of applications specially in genetic research, searching within texts, and many other applications.
I tried so hard to find an algorithm that deals with this issue and can be built by C#, but I didn't find any code. So I decided to build my own code and put it up on this great site in the hands of great programmers like you. I hope my first step pleases you.
Suppose that you have this
cal.. is a prefix (start of the
string must begin with the first character)
nia.. is a suffix (must have the end character of the
orni,…. are substrings
string matching and analysis algorithms are able to efficiently skip comparisons by first spending "modest" time learning about the internal structure of either the pattern ( for example the word I search for), or the text (which I search in). This part of the overall algorithm is called the "Preprocessing stage".
string S and a position
i > 1:
Zi(S) = length of the longest substring of
S that starts at
i and matches a prefix of
S = aabcaabxaaz
Z<sub>5</sub>(s) =3 (aabc …..aabx)
Z<sub>6</sub>(s) =1 (aa …..ab)
Z<sub>7</sub>(s) = Z<sub>8</sub>(s) =0
For any position
i > 1 where
Zi > 0, the
i is defined as the interval starting at
i and ending at position
i + Zi – 1.
i > 1,
ri is the right-most endpoint of the
Z-boxes that begin at or before position
i. Another way to state this:
ri is the largest value of
j+Zj-1 over all
1<j= i such that
Step 1: Find
Z2 by explicitly comparing, the characters of
S[1…|S|] until a mismatch is found.
If Z2 > 0, then
r = r2 = Z2 + 1 and
l = l2 is set to
Step 3: Use
l to compute next
i > r, then
i is outside of a Zbox, repeat step 1.
i = r, then
i is inside a Zbox.
k is in a Zbox:
k is part of a previous prefix of
The character at
k appears at the previous position
The sub-string starting at
k (beta) matches a previous beta.
Zk > 0, then it also follows that:
k must match a prefix of
S of length at least the minimum of
Zand length beta.
Zk’ < length beta,
l remain unchanged.
Zk’ = length beta….
Zk’ = length beta….
Entire substring beta is a prefix of
Zk may be equal to length beta.
Zk may be larger than beta, so compare characters from prefix to characters from
S until a mismatch occurs.
Zk is then set depending upon where the first mismatch occurs and
These are the general steps of the Z algorithm.
Using the Code
I attached two *.zip files , the first "STRING_Matching" is the class which contains the function used, while the other "ZAlgorithm.zip" is the implementation or the test of the class
Exact_Matching S = new Exact_Matching();
string pattern = "abcdaaabchoabcd";
char Stest = pattern.ToCharArray();
int M = S.Zblock(Stest);
for (int i = 0; i < M.Length; i++)
- 15th March, 2008: Initial post