Click here to Skip to main content
Licence GPL3
First Posted 1 Mar 2004
Views 55,645
Bookmarked 13 times

Fast Textual Data Processing using Hashed Strings

By | 6 Mar 2004 | Article
Managing a textual numbering system for hashing words into numbers

Introduction

Strings are considered heavy nesting loop generators. Comparisons, searching, and sorting strings are considered time consuming operations. Indexing textual data is a good step, however setting searching key as string yields bad performance for search algorithms.

In this article we introduce an ADT abstracting strings with hash-codes. Comparisons will no more be done with byte comparisons. Each words in the alphabet will be assigned an invertible hash-code.

Hashing Function

Hashing functions aim to represent a collection of data in a single code. Codes may overlap, so a good hash function aims to minimise overlapping between codes. Systems using hashing should also manage overflow problems. What we represent here could be called as hashing. However, it is invertible and assures uniquness. Actually, it manages a numbering system for textual data.

Consider the word "baby", calculating hash-code is simply as converting from binary to decimal numbering systems. In this case, we convert from 26th numbring system to decimal, where 26 is the number of english alphbet characters.

("baba")26=0x260+1x261+0x262+1x263=(17626)10

("babb")26=1x260+1x261+0x262+1x263=(17627)10

This means that "baba" is less than "babb". Comparison is done with numbers. No more nesting loops is needed for comparisons.

Class hashedstring

class hashedstring

{

public:
    hashedstring():_pdata(0), _hashkey(0), _size(0){}
    hashedstring(const hashedstring& hstr):_pdata(strdup(hstr._pdata)), _
        hashkey(hstr._hashkey), _size(hstr._size){}
    hashedstring(const char* pdata):_pdata(strdup(pdata)), 
        _hashkey(0), _size(strlen(pdata))
    {
        _computehashkey();
    }
   
    hashedstring operator+(const hashedstring& s) const 
    {
        hashedstring rtn=*this;
        strcat(rtn._pdata, s._pdata);
        rtn._hashkey+=_hashkey*pow(26, s._size);
        rtn._size+=s._size;
        return rtn;
    }

 
    hashedstring operator+=(const hashedstring& s)
    { 
        strcat(_pdata, s._pdata);
        _hashkey=_hashkey*pow(26, s._size)+s._hashkey;
        _size+=s._size;
        return *this;
    }
    
    hashedstring operator=(const hashedstring& s) 
    {
        clear();
        _pdata=strdup(s._pdata);
        _hashkey+=s._hashkey;
        _size=s._size;
        return *this;
    }

 
    void clear()
    {
        if(_pdata)
            delete []_pdata;
        _pdata=0;
        _size=0;
        _hashkey=0;
    }

 
    ~hashedstring(){}
    
    operator const double() const {return _hashkey;}
    
    
protected:
    

    double _computehashkey()
    {
        unsigned int i=0;
        _hashkey=0;
        for(i=_size-1; i!=-1; i--)
            _hashkey+=(tolower(_pdata[i])-'a')*pow(26, _size-1-i);
        return _hashkey;
    }
    
    char* _pdata;
    double _hashkey;
    unsigned int _size;

};

Conclusions

Maintaining alphabet numbering system provides faster string processing. Calculating hash codes, an offline process, should be added in indexing huge textual data. Then the system will enjoy fast comparisons, search, and sort routines.

    
    hashedstring s1("baby");
    hashedstring s2("babybaba");
    hashedstring s3("babybabb");
    
    vprint(s1);
    
    s1+="baba";
    
    vprint(s1);
    vprint(s2);
    vprint(s3);
    
    vprint(s1<s2);
    vprint(s2<s3);
    vprint(s2==s3);
    
    vprint(s1==s2);


Results

    s1= 17626
    s1= 8054676578
    s2= 8054676578
    s3= 8054676579
    s1<s2= 0
    s2<s3= 1
    s2==s3= 0
    s1==s2= 1

License

This article, along with any associated source code and files, is licensed under The GNU General Public License (GPLv3)

About the Author

Mohammed Hossny

Other

Australia Australia

Member



Sign Up to vote   Poor Excellent
Add a reason or comment to your vote: x
Votes of 3 or less require a comment

Comments and Discussions

 
You must Sign In to use this message board. (secure sign-in)
 
Search this forum  
 FAQ
    Noise  Layout  Per page   
  Refresh
GeneralVERY Very bad article PinmemberSB20035:07 29 Dec '09  
GeneralRe: VERY Very bad article PinmemberMohammed Hossny15:49 28 Sep '10  
GeneralTurkmenistana = Turkmenistanb PinmemberZegee1:05 12 Jul '07  
QuestionWhere can I find a working implementation? PinsussRobert F.0:13 23 Sep '04  
AnswerRe: Where can I find a working implementation? PinmemberMohammed Hossny0:22 23 Sep '04  
GeneralRe: Where can I find a working implementation? PinsussRobert F0:38 23 Sep '04  
GeneralRe: Where can I find a working implementation? PinmemberMohammed Hossny1:15 23 Sep '04  
GeneralRe: Where can I find a working implementation? PinsussRobert F.3:23 23 Sep '04  
GeneralRe: Where can I find a working implementation? PinmemberMohammed Hossny12:16 23 Sep '04  
GeneralOptimizator : "Does it speaks to you ?" PinmemberKochise21:20 18 Mar '04  
GeneralRe: Optimizator : "Does it speaks to you ?" PinmemberMohammed Hossny20:56 20 Mar '04  
GeneralYes, that's it, hashing at compile time... PinmemberKochise4:32 21 Mar '04  
Generalaaa = aa = a = 0 PinmemberChris Roberts7:49 8 Mar '04  
GeneralRe: aaa = aa = a = 0 PinmemberMohammed Hossny23:27 8 Mar '04  
QuestionCould you fix the horizontal scrolling annoyance? PinmemberWREY10:35 3 Mar '04  
AnswerRe: Could you fix the horizontal scrolling annoyance? PinmemberMohammed Hossny11:45 7 Mar '04  
GeneralRe: Could you fix the horizontal scrolling annoyance? Pinmemberjongman17:09 7 Mar '04  
GeneralNice, but... it does not work. Pinmemberanickol21:57 2 Mar '04  
GeneralRe: Nice, but... it does not work. PinmemberJon23:02 2 Mar '04  
GeneralRe: Nice, but... it does not work. PinmemberMohammed Hossny9:40 5 Mar '04  
QuestionCollision space? Pinmembersdoyle6:25 2 Mar '04  
AnswerRe: Collision space? PinmemberMohammed Hossny7:20 2 Mar '04  
GeneralNice concept... not practical.. PinmemberPeter Mares2:53 2 Mar '04  
GeneralRe: Nice concept... not practical.. PinmemberMohammed Hossny3:04 2 Mar '04  

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

Permalink | Advertise | Privacy | Mobile
Web02 | 2.5.120517.1 | Last Updated 7 Mar 2004
Article Copyright 2004 by Mohammed Hossny
Everything else Copyright © CodeProject, 1999-2012
Terms of Use
Layout: fixed | fluid