Implementing a std::map Replacement that Never Runs Out of Memory and Instructions on Producing an ARPA Compliant Language Model to Test the Implementation

Roy, Philippe

Rate me:

5.00/5 (10 votes)

14 Dec 2008CPOL9 min read

54K

276

An article on improving STL containers to cache to disk in order to lift memory limitation issues.

arpalanguagemodelgenerator.zip
- ARPALanguageModelGenerator
  - ARPALanguageModelGenerator.sln
  - ARPALanguageModelGenerator
    - ARPALanguageModelGenerator.cpp
    - ARPALanguageModelGenerator.vcproj
    - ClassDiagram1.cd
    - IndexStructure.h
    - LMEngine.cpp
    - LMEngine.h
    - shared_auto_ptr.h
    - SimpleTokenizer.cpp
    - SimpleTokenizer.h
    - Small test.txt
    - stdafx.cpp
    - stdafx.h
    - targetver.h
    - Tokenizer.cpp
    - Tokenizer.h
    - udis.txt
  - small.txt

#ifndef __TOKENIZER_H__
#define __TOKENIZER_H__

#include <istream>

using namespace std;

// CTokenizer is a base class defining how tokenization should proceed.

class CTokenizer
{

public:

	// Constructor:

	// REQUIREMENTS:
	// An istream successfully opened for reading.
	// PROMISES:
	// The object will be ready to return tokens with GetNextToken if HasMoreToken return true.

	CTokenizer(istream &inputStream) throw();

	// Destructor:

	// REQUIREMENTS:
	// None.
	// PROMISES:
	// None.

	virtual ~CTokenizer() throw();

	// GetNextToken():

	// REQUIREMENTS:
	// HasMoreToken() must have returned true for this call to return an actual token, otherwise, it returns an empty string.
	// PROMISES:
	// The next token from the input stream.

	virtual string GetNextToken() throw()  = 0;

	// HasMoreToken():

	// REQUIREMENTS:
	// None.
	// PROMISES:
	// If the return value is true, GetNextToken() will return a token, otherwise, no more tokens are available from the input stream.

	virtual bool HasMoreToken() throw() = 0;

protected:

	istream &m_stream;
};

#endif

By viewing downloads associated with this article you agree to the Terms of Service and the article's licence.

If a file you wish to view isn't highlighted, and is a text file (not binary), please let us know and we'll add colourisation support for it.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Written By

Roy, Philippe

Software Developer (Senior)

Canada

Philippe Roy was a key contributor throughout his 20+ years career with many high-profile companies such as Nuance Communications, IBM (ViaVoice and ProductManager), VoiceBox Technologies, just to name a few. He is creative and proficient in OO coding and design, knowledgeable about the intellectual-property world (he owns many patents), tri-lingual, and passionate about being part of a team that creates great solutions.

Oh yes, I almost forgot to mention, he has a special thing for speech recognition and natural language processing... The magic of first seeing a computer transform something as chaotic as sound and natural language into intelligible and useful output has never left him.

Implementing a std::map Replacement that Never Runs Out of Memory and Instructions on Producing an ARPA Compliant Language Model to Test the Implementation

License

Comments and Discussions