Click here to Skip to main content
12,696,092 members (30,418 online)
Click here to Skip to main content
Add your own
alternative version


167 bookmarked

N-gram and Fast Pattern Extraction Algorithm

, 31 Oct 2007 GPL3
Rate this:
Please Sign up or sign in to vote.
This article demonstrates N-gram construction and Fast Text Pattern Extraction using a modified LZW algorithm.
Sample Image - Patterns.jpg



In this article, I introduce a very fast algorithm to extract text patterns from large size text and give statistical information about patterns' frequency and length. Actually, the idea of this algorithm came to me when one of my friends asked me to give him an idea for extracting patterns from text. I told him immediately that he could use the LZW compression algorithm, take the final dictionary and drop the compressed buffer, and then he could have a dictionary containing all text patterns with each pattern frequency. I don't know if he understood me or not, but I decided to do it later. If pattern word count is fixed (N), then it is a generation for N-gram of the input sequence.



An n-gram is a sub-sequence of n items from a given sequence. n-grams are used in various areas of statistical natural language processing and genetic sequence analysis. The items in question can be characters, words or base pairs according to the application. For example, the sequence of characters "Hatem mostafa helmy" has a 3-gram of ("Hat", "ate", "tem", "em ", "m m", ...), and has a 2-gram of ("Ha", "at", "te", "em", "m ", " m", ...). This n-gram output can be used for a variety of R&D subjects, such as Statistical machine translation and Spell checking.

Pattern Extraction

Pattern extraction is the process of parsing a sequence of items to find or extract a certain pattern of items. Pattern length can be fixed, as in the n-gram model, or it can be variable. Variable length patterns can be directives to certain rules, like regular expressions. They can also be random and depend on the context and pattern repetition in the patterns dictionary.

Algorithm Idea for Variable Length Pattern Extraction

The algorithm introduced here is derived from the LZW compression algorithm, which includes a magic idea about generating dictionary items at compression time while parsing the input sequence. If you have no idea about LZW, you can check it out at my article, Fast LZW compression. And of course, the algorithm inherits the speed of my implementation to LZW, plus extra speed for two reasons:

  1. The parsing item is a word, not a letter
  2. There's no destination buffer, as there is no need for a compressed buffer

The algorithm uses a binary tree to keep extracted patterns that give the algorithm excellent speed at run-time to find and fetch new items to the dictionary. Let us discuss the algorithm pseudo code. We have some figures to clarify the idea with an algorithm flow chart and a simple example.

Example: the input words sequence w0w1w2w3w4w0w1w5w6w7w2w3w4w0w1w5w6

Assume n equals 2; then the initial pattern will be w0w1. After applying the algorithm steps, the resultant dictionary would be as in the fourth column:

Input Sequence Pattern Step Dictionary Frequency
w0w1w2w3w4w0w1w5w6w7w2w3w4w0w1w5w6 w0w1 Words available?
w0w1w2w3w4w0w1w5w6w7w2w3w4w0w1w5w6 w0w1 Add to dictionary w0w1 3
w0w1w2w3w4w0w1w5w6w7w2w3w4w0w1w5w6 w0w1 Pattern exists?
w1w2w3w4w0w1w5w6w7w2w3w4w0w1w5w6 w1w2 Take new pattern
w1w2w3w4w0w1w5w6w7w2w3w4w0w1w5w6 w1w2 Words available?
w1w2w3w4w0w1w5w6w7w2w3w4w0w1w5w6 w1w2 Add to dictionary w1w2 1
w1w2w3w4w0w1w5w6w7w2w3w4w0w1w5w6 w1w2 Pattern exists?
w2w3w4w0w1w5w6w7w2w3w4w0w1w5w6 w2w3 Take new pattern
w2w3w4w0w1w5w6w7w2w3w4w0w1w5w6 w2w3 Words available?
w2w3w4w0w1w5w6w7w2w3w4w0w1w5w6 w2w3 Add to dictionary w2w3 2
w2w3w4w0w1w5w6w7w2w3w4w0w1w5w6 w2w3 Pattern exists?
w3w4w0w1w5w6w7w2w3w4w0w1w5w6 w3w4 Take new pattern
w3w4w0w1w5w6w7w2w3w4w0w1w5w6 w3w4 Words available?
w3w4w0w1w5w6w7w2w3w4w0w1w5w6 w3w4 Add to dictionary w3w4 2
w3w4w0w1w5w6w7w2w3w4w0w1w5w6 w3w4 Pattern exists?
w4w0w1w5w6w7w2w3w4w0w1w5w6 w4w0 Take new pattern
w4w0w1w5w6w7w2w3w4w0w1w5w6 w4w0 Words available?
w4w0w1w5w6w7w2w3w4w0w1w5w6 w4w0 Add to dictionary w4w0 2
w4w0w1w5w6w7w2w3w4w0w1w5w6 w4w0 Pattern exists?
w0w1w5w6w7w2w3w4w0w1w5w6 w0w1 Take new pattern
w0w1w5w6w7w2w3w4w0w1w5w6 w0w1 Words available?
w0w1w5w6w7w2w3w4w0w1w5w6 w0w1 Add to dictionary
w0w1w5w6w7w2w3w4w0w1w5w6 w0w1 Pattern exists?
w0w1w5w6w7w2w3w4w0w1w5w6 w0w1w5 Add word to pattern
w0w1w5w6w7w2w3w4w0w1w5w6 w0w1w5 Words available?
w0w1w5w6w7w2w3w4w0w1w5w6 w0w1w5 Add to dictionary w0w1w5 2
w0w1w5w6w7w2w3w4w0w1w5w6 w0w1w5 Pattern exists?
w1w5w6w7w2w3w4w0w1w5w6 w1w5 Take new pattern
w1w5w6w7w2w3w4w0w1w5w6 w1w5 Words available?
w1w5w6w7w2w3w4w0w1w5w6 w1w5 Add to dictionary w1w5 2
w1w5w6w7w2w3w4w0w1w5w6 w1w5 Pattern exists?
w5w6w7w2w3w4w0w1w5w6 w5w6 Take new pattern
w5w6w7w2w3w4w0w1w5w6 w5w6 Words available?
w5w6w7w2w3w4w0w1w5w6 w5w6 Add to dictionary w5w6 1
w5w6w7w2w3w4w0w1w5w6 w5w6 Pattern exists?
w6w7w2w3w4w0w1w5w6 w6w7 Take new pattern
w6w7w2w3w4w0w1w5w6 w6w7 Words available?
w6w7w2w3w4w0w1w5w6 w6w7 Add to dictionary w6w7 1
w6w7w2w3w4w0w1w5w6 w6w7 Pattern exists?
w7w2w3w4w0w1w5w6 w7w2 Take new pattern
w7w2w3w4w0w1w5w6 w7w2 Words available?
w7w2w3w4w0w1w5w6 w7w2 Add to dictionary w7w2 1
w7w2w3w4w0w1w5w6 w7w2 Pattern exists?
w2w3w4w0w1w5w6 w2w3 Take new pattern
w2w3w4w0w1w5w6 w2w3 Words available?
w2w3w4w0w1w5w6 w2w3 Add to dictionary
w2w3w4w0w1w5w6 w2w3 Pattern exists?
w2w3w4w0w1w5w6 w2w3w4 Add word to pattern
w2w3w4w0w1w5w6 w2w3w4 Words available?
w2w3w4w0w1w5w6 w2w3w4 Add to dictionary w2w3w4 1
w2w3w4w0w1w5w6 w2w3w4 Pattern exists?
w3w4w0w1w5w6 w3w4 Take new pattern
w3w4w0w1w5w6 w3w4 Words available?
w3w4w0w1w5w6 w3w4 Add to dictionary
w3w4w0w1w5w6 w3w4 Pattern exists?
w3w4w0w1w5w6 w3w4w0 Add word to pattern
w3w4w0w1w5w6 w3w4w0 Words available?
w3w4w0w1w5w6 w3w4w0 Add to dictionary w3w4w0 1
w3w4w0w1w5w6 w3w4w0 Pattern exists?
w4w0w1w5w6 w4w0 Take new pattern
w4w0w1w5w6 w4w0 Words available?
w4w0w1w5w6 w4w0 Add to dictionary
w4w0w1w5w6 w4w0 Pattern exists?
w4w0w1w5w6 w4w0w1 Add word to pattern
w4w0w1w5w6 w4w0w1 Words available?
w4w0w1w5w6 w4w0w1 Add to dictionary w4w0w1 1
w4w0w1w5w6 w4w0w1 Pattern exists?
w0w1w5w6 w0w1 Take new pattern
w0w1w5w6 w0w1 Words available?
w0w1w5w6 w0w1 Add to dictionary
w0w1w5w6 w0w1 Pattern exists?
w0w1w5w6 w0w1w5 Add word to pattern
w0w1w5w6 w0w1w5 Words available?
w0w1w5w6 w0w1w5 Add to dictionary
w0w1w5w6 w0w1w5 Pattern exists?
w0w1w5w6 w0w1w5w6 Add word to pattern
w0w1w5w6 w0w1w5w6 Words available?
w0w1w5w6 w0w1w5w6 Add to dictionary w0w1w5w6 1
w0w1w5w6 w0w1w5w6 Pattern exists?
w1w5w6 w1w5 Take new pattern
w1w5w6 w1w5 Add to dictionary
w1w5w6 w1w5 Pattern exists?
w1w5w6 w1w5w6 Add word to pattern
w1w5w6 w1w5w6 Words available?
w1w5w6 w1w5w6 Add to dictionary w1w5w6 1
w1w5w6 w1w5w6 Pattern exists?
w5w6 w5w6 Take new pattern
w5w6 w5w6 Words available?
w5w6 w5w6 Add to dictionary
w5w6 w5w6 Pattern exists?
Take new pattern
Words available?

Algorithm Pseudo Code

ConstructPatterns(src, delimiters, n, fixed)
    des = AllocateBuffer()
    Copy(des, src)
    DiscardDelimiters(des, delimiters)
    dic = InitializePatternsDictionary()

    pattern = InitializeNewPattern(des)
        node = dic.Insert(pattern)
        if(!fixed AND node.IsRepeated)
            AddWordToPattern(des, pattern)
            pattern = InitializeNewPattern(des)

Code Description


This function receives the input buffer, copies it to a destination buffer, and parses it to add found patterns to the dictionary. The constructed dictionary is a binary tree template CBinaryTree<CPattern, CPattern*, int, int> m_alpDic with a key of type CPattern.

void CPatternAlaysis::ConstructPatterns(BYTE *pSrc, int nSrcLen,
    LPCSTR lpcsDelimiters /*= NULL*/,
    int nMinPatternWords /*= 2*/,
    bool bFixedNGram /*= false*/)
    // discard initial spaces
    // allocate destination buffer
    // discard delimiters
    // discard repeated spaces
    // initialize dictionary
    // tree node to keep last success search to start with
    CBinaryTreeNode<CPattern, int>* pNode = m_alpDic.Root;
    // left m_alpDic Samples points to the source buffer
    int nPrevLength;
    CPattern node(m_pDes, GetPatternLength(
        m_pDes, nPrevLength, nMinPatternWords));
    // scan the input buffer
    while(node.m_pBuffer < m_pDes+nDesLen)
        pNode = m_alpDic.Insert(&node, -1, pNode);
        pNode->Key.m_nFrequency = pNode->Count;
        if(bFixedNGram == false && pNode->Count > 1)
        // (repeated pattern), increment node length
        // by a new word length
        node.m_nLength += AddWordToPattern(node.m_pBuffer+node.m_nLength);
        {   // initialize node to next entity
            node.m_pBuffer += nPrevLength;
            node.m_nLength = GetPatternLength(node.m_pBuffer,
                nPrevLength, nMinPatternWords);
            // initialize binary tree search root
            pNode = m_alpDic.Root;

Note: The first good point in this function is that it allocates one buffer for the dictionary and all dictionary nodes point to their start buffer, keeping their buffer length in the class CPattern. So, no allocation or reallocation is done during the algorithm.

class CPattern
    CPattern()    {}
    CPattern(BYTE* pBuffer, int nLength)
    m_pBuffer = pBuffer, m_nLength = nLength;
    CPattern(const CPattern& buffer)
    *this = buffer;

    BYTE* m_pBuffer;
    int m_nLength;
    int m_nFrequency;
    inline int compare(const CPattern* buffer);
    {    ...  }
    inline void operator=(const CPattern* buffer)
    m_pBuffer = buffer->m_pBuffer;
    m_nLength = buffer->m_nLength;

The function does the steps of the pseudo code. The second good point is the usage of a binary tree CBinaryTree to keep the dictionary, with a very good trick here:

The function Insert() of the tree takes a third parameter to start the search from. In normal cases, this parameter should be the tree Root. However, if a pattern is found and a new word is added to it, then we can start the search for the pattern from the current node, as it must be under current node. This is because it is only the previous pattern plus a new word.

In the case of a new pattern, we should start the search from the tree root, so we have this line at the bottom of the function:

// initialize binary tree search root
pNode = m_alpDic.Root;


This function doesn't construct patterns. It just retrieves the constructed patterns with three types of sort: Alphabetical, Frequency and Pattern length. The returned patterns are stored in a vector of patterns (OUT vector<CPattern*>& vPatterns).

  1. Alphabetical

    CBinaryTreeNode<CPattern, int>* pAlpNode = m_alpDic.Min(m_alpDic.Root);
        if(pAlpNode->Count > 1 || !bIgnoreUniquePatterns)
            // ignore unique pattern
        pAlpNode = m_alpDic.Successor(pAlpNode);
  2. Frequency

    // construct a new dictionary to sort stored patterns
    // depending on frequency
    CBinaryTree<CValue<int>, int, vector<CPattern*>,
        vector<CPattern*>* >  displayDic;
    CBinaryTreeNode<CPattern, int>* pAlpNode = m_alpDic.Min(m_alpDic.Root);
    while(pAlpNode != NULL)
        if(pAlpNode->Count >    1 || !bIgnoreUniquePatterns)
            // ignore unique pattern
        pAlpNode = m_alpDic.Successor(pAlpNode);
    // iterate through the binary tree to get sorted pattern
    // (depend on frequency)
    CBinaryTreeNode<CValue<int>, vector<CPattern*>* pNode =
        for(vector<CPattern*>::iterator i = pNode->Data.begin(),
            end = pNode->Data.end(); i != end; i++)
        pNode = displayDic.Predecessor(pNode);
  3. Pattern length

    // construct a new dictionary to sort stored patterns
    // depending on Pattern length
    CBinaryTree<CValue<int>, int, vector<CPattern*>,
        vector<CPattern*>* >  displayDic;
    CBinaryTreeNode<CPattern, int>* pAlpNode = m_alpDic.Min(m_alpDic.Root);
    while(pAlpNode != NULL)
        if(pAlpNode->Count >    1 || !bIgnoreUniquePatterns)
           // ignore unique pattern
        pAlpNode = m_alpDic.Successor(pAlpNode);
    // iterate through the binary tree to get sorted pattern
    // (depend on Pattern length)
    CBinaryTreeNode<CValue<int>, vector<CPattern*>* pNode =
        for(vector<CPattern*>::iterator i = pNode->Data.begin(),
            end = pNode->Data.end(); i != end; i++)
        pNode = displayDic.Predecessor(pNode);


This function retrieves the stored patterns count.

int CPatternAlaysis::GetPatternCount()
    return m_alpDic.Count;

Points of Interest

  1. Algorithm Accuracy

    The algorithm doesn't give accuracy about pattern frequency in the case of variable length patterns (not n-gram with fixed n). That is because the algorithm constructs patterns while parsing the sequence and checks each constructed pattern with the dynamic dictionary. So, if any pattern is first added to the dictionary, a new pattern is constructed starting from the second word of the previous pattern with length n (min pattern length).

  2. Cross-Document Co-reference

    Cross-Document Co-reference is the process of finding a relation between documents. In other words, we can say that two documents are related if the two documents contain similar patterns. This subject is studied in many articles, but I found that the best one is "A Methodology for Cross-Document Coreference" by Amit Bagga and Alan W.Biermann. My algorithm may be helpful to generate patterns that can be taken to find a relation between patterns. The good point here is the very good speed of the algorithm, so it can be used for large numbers of documents like the web. However, directed patterns are better than random patterns to find documents' co-reference.

  3. DIPRE: Dual Iterative Pattern Relation Expansion

    This is an algorithm introduced by Sergey Brin to collect related information from scattered web sources. I like this idea very much and invite all of you to read his article and search the web for its implementation or even its flowchart. My algorithm can't be used here, as it collects patterns without any guided information about retrieved patterns. However, the FIPRE algorithm is a semi-directed algorithm, as it guides the initial pattern search with regular expressions to identify the required pattern, like "book title" and "author name." Alternatively, if the algorithm collects mails, it will include all regular expressions for mails like that:

    [^ \:\=@;,,$$**++\t\r\n""'<>/\\%??()&]+@[Hh]
    [^ \:\=@;,,$$**++\t\r\n""'<>/\\%??()&]+@[Yy]
    [^ \:\=@;,,$$**++\t\r\n""'<>/\\%??()&]+@(AOL|aol|Aol).com


  • 10/09/2007: Posted version v0.9000
  • 23/09/2007: Updated the source file vector.cpp to initialize the vector buffer with zeros
  • 31/10/2007: Updated the header file vector.h to solve "insert" function bug


Thanks to...



This article, along with any associated source code and files, is licensed under The GNU General Public License (GPLv3)


About the Author

You may also be interested in...


Comments and Discussions

GeneralMy vote of 5 Pin
manoj kumar choubey28-Mar-12 0:55
membermanoj kumar choubey28-Mar-12 0:55 
QuestionHow fast is your fast? Pin
Sanmayce7-Jan-12 10:06
memberSanmayce7-Jan-12 10:06 
QuestionArticle "Fast LZW compression" not found Pin
JohnPool23-Sep-11 4:58
memberJohnPool23-Sep-11 4:58 
AnswerRe: Article "Fast LZW compression" not found Pin
Hatem Mostafa24-Sep-11 1:59
memberHatem Mostafa24-Sep-11 1:59 
GeneralRe: Article "Fast LZW compression" not found Pin
Joao Araujo10-Jan-12 9:50
memberJoao Araujo10-Jan-12 9:50 
GeneralRe: Article "Fast LZW compression" not found Pin
Hatem Mostafa11-Jan-12 21:37
memberHatem Mostafa11-Jan-12 21:37 

Please, if you have a copy of my LZW article send it to me, as I don't have a backup of it.

My mail:

Many thanks,
GeneralMy vote of 5 Pin
RedDK1-Jun-11 10:51
memberRedDK1-Jun-11 10:51 
General~CPatternAlaysis(); // had to add this desctructor declaration to PE.h to avoid VS2010 compile error C2600 Pin
wycoder7-Apr-11 2:25
memberwycoder7-Apr-11 2:25 
GeneralC# implementation Pin
dde21-Jul-09 17:52
memberdde21-Jul-09 17:52 
GeneralMessage Removed Pin
Katekortez25-Oct-08 10:30
memberKatekortez25-Oct-08 10:30 
GeneralC# implementation Pin
navbas19-Sep-08 7:54
membernavbas19-Sep-08 7:54 
GeneralRe: C# implementation Pin
dde21-Jul-09 17:44
memberdde21-Jul-09 17:44 
GeneralBorn Analyst!! Pin
vamsidhar sunkari22-Feb-08 7:15
membervamsidhar sunkari22-Feb-08 7:15 
GeneralRe: Born Analyst!! Pin
Hatem Mostafa7-Aug-11 12:59
memberHatem Mostafa7-Aug-11 12:59 
GeneralGreat! Pin
Abu Abdillah30-Dec-07 9:27
memberAbu Abdillah30-Dec-07 9:27 
Questionmissing n+1 length patterns? Pin
b-yond18-Nov-07 7:58
memberb-yond18-Nov-07 7:58 
AnswerRe: missing n+1 length patterns? Pin
Hatem Mostafa18-Nov-07 9:24
memberHatem Mostafa18-Nov-07 9:24 
GeneralRe: missing n+1 length patterns? Pin
b-yond18-Nov-07 16:14
memberb-yond18-Nov-07 16:14 
GeneralRe: missing n+1 length patterns? Pin
yulin113-Dec-07 18:42
memberyulin113-Dec-07 18:42 
Generalplease help me!!!!!!!!!!!!11 Pin
allooba8-Nov-07 9:43
memberallooba8-Nov-07 9:43 
GeneralNice Pin
Paul Conrad4-Nov-07 15:06
memberPaul Conrad4-Nov-07 15:06 
QuestionDo you see a use for code breaking? Pin
D N Harris1-Oct-07 3:05
memberD N Harris1-Oct-07 3:05 
AnswerRe: Do you see a use for code breaking? !!! Pin
Hatem Mostafa1-Oct-07 23:02
memberHatem Mostafa1-Oct-07 23:02 
GeneralRe: Do you see a use for code breaking? Pin
D N Harris2-Oct-07 3:53
memberD N Harris2-Oct-07 3:53 
Generalvery good ! Pin
gogac26-Sep-07 12:25
membergogac26-Sep-07 12:25 
AnswerRe: very good ! Pin
Hatem Mostafa26-Sep-07 19:29
memberHatem Mostafa26-Sep-07 19:29 
Generallose file : ..\export\include\BinaryTree.h Pin
Stoney Tian11-Sep-07 5:17
memberStoney Tian11-Sep-07 5:17 
GeneralRe: lose file : ..\export\include\BinaryTree.h Pin
Hatem Mostafa11-Sep-07 7:17
memberHatem Mostafa11-Sep-07 7:17 
GeneralRhâââ... Hatem Mostafa ! Again... Pin
Kochise10-Sep-07 21:32
memberKochise10-Sep-07 21:32 
GeneralRe: Rhâââ... Hatem Mostafa ! Again... Pin
Steve_pqr19-Sep-07 6:16
memberSteve_pqr19-Sep-07 6:16 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

| Advertise | Privacy | Terms of Use | Mobile
Web02 | 2.8.170118.1 | Last Updated 31 Oct 2007
Article Copyright 2007 by Hatem Mostafa
Everything else Copyright © CodeProject, 1999-2017
Layout: fixed | fluid