 |
|
 |
Have you thought that pow(26, sizeof) or whatever you did there in that fked up code of urs will just die on you when you try to index megabytes of data?
Why write an article that only works with words like "abc" "adb"
Come on, this is a serious website. This is not a KinderGarden here!
Real programmers visit this site, you're just embarrasing yourself!
Create a 10 or 100MB text file and try to work with it in less than a sec. Have you done a search on google? Have you seen their result times?
A HASH is no way closer that what you did here!
This article is good for polynomial demos or something.
Speed is what HASH is all about. You did not even considered a MODULO (%) HASH_SIZE to loop around the table that contains your hash values.
PLEASE! Write your stupid articles somewhere else, man!
Or learn a bit a programming first.
|
|
|
|
 |
|
 |
Well. I am so sorry you did not like the article but where have you been since 2004. You are commenting on an almost 6 years old article. Anyway, i agree with you. The only explanation i can give you is that this module was part of a bigger project for indexing and mainly analyzing Arabic literature. As you know (I am assuming here that you know so much about search engines and textual processing) Arabic (and probably Persian, Urdu, and Hebrew but i am not sure) words are always derived from 3 letters roots and the rest are modifications, morphology, and additives that can alter the meaning of the word slightly but in the same time preserves its meaning categorization under the same 3-letters root. This will allow us to derive equations representing phrases, sentences, paragraphs, and even books with a numerical signals. Note that textual morphology in Arabic is not as simple as adding an "ed" for past tense. Anyway, this project was about mapping texts from characters into the 28th numbering system (i used the 26th numbering system because code project was not supporting Arabic text within code snippets back in 2004) in order to analyze Arabic literature numerically by means of numbering theory.
I know the code is not perfect and is not the fastest, but it does the trick for that research.
Thanks for the constructive comments though.
Regards.
Yours, M. Hossny
|
|
|
|
 |
|
 |
Does not work with long words
|
|
|
|
 |
|
 |
This method seems to be interesting but obviously the presented code does not work .
Are there any sites on Internet or other resources where I can find a correct implementation (and/or discussion of the method)?
Thank you!
R.F.
|
|
|
|
 |
|
 |
hi, can u just specify the problem!!
i downloaded it now and it works very well!!
thx alot for time, salam.
Yours, M. Hossny
|
|
|
|
 |
|
 |
Mohammed Hossny wrote:
hi, can u just specify the problem!!
1. There are 256 different char[acter]s, not 26.
2. How is overflow handled? A hash-key is usually a long not a double.
Best wishes,
R.F.
P.S. Any references?
|
|
|
|
 |
|
 |
Well,
Robert F wrote:
1. There are 256 different char[acter]s, not 26.
u r right. Actually this is a preprocessing step to representing a text file as a collection of numbers. This idea aims to minimize effort per word.
Representing the whole paragraph with a single number is not prctical nor the objective of this idea.
I m working on textual computations, i need to represent a corpuse of text as a vector of values. Mathematics could help then searching text neumerically.
It may be extended to represent meening too .
Robert F wrote:
2. How is overflow handled?
this hashing is applied to textual data only!! can u get a word with length larger than 10 characters!?
Robert F wrote:
A hash-key is usually a long not a double.
i agree u!! i used double as a temporary solution!! very long integers should be used later!!
Thx, alot for comment.
salam.
Yours, M. Hossny
|
|
|
|
 |
|
 |
Thank you for your quick answers!
After having thought about it: All you can do is to convert the first 4 characters of a char-string into a long value and compare these values. This is not much when your strings all start with e.g. "http"
Best wishes,
R.F.
|
|
|
|
 |
|
 |
hi, the hash value is computed for the whole string (not the first 4 characters only).
when this class is used instead of CString and std::string it led to amazing speedup!!
this class is already used in a "Document Indexing for Full-Text Search" with amazing results!! check this http://www.sigmasquared.com/files/difts_sdk_and_demo.zip[^]
thx alot for time, salam.
Yours, M. Hossny
|
|
|
|
 |
|
 |
int IPOW(int x, int n) {return (!n)?1: (n&1)?x*IPOW(x,n-1):IPOW(x*x,n>>1);}
int HASH(char* c, int s=0){return (*c!=0)?HASH(c+1,s+1)+(((*c|0x20)-'a'+1)*IPOW(26,s)):0; }
Kochise aka Optimizator
REQUEST : Someone that will turn this into meta-template, like CRC ( http://www.codeproject.com/cpp/crc_meta.asp ). I already started one, but it fail as we cannot give a pointer into a template :
#define _HASH(S) (H<(char*)S,0>::R) #define _TX ((*c|0x20)-'a'+1)*P<26,s>::R template<int x, int n> class P { public: enum { R=(!n)?1: (n&1)?x*P<x,n-1>::R:P<x*x,n/2>::R }; };
template<char* c, int s> class H { public: enum { R=(*c!=0)?H<c+1,s+1>::R+_TX: 0 }; };
class H<NULL,0>{ public: enum { R=0 };};
The idea is to hash strings at compile time for string comparaison in loops or so :
int l_nHash = HASH(l_oStrString.GetBuffer(0));
if(l_nHash == _HASH("Test"))...
You get the idea ?
In Code we trust !
|
|
|
|
 |
|
 |
gor it, so u wanna hash the string at compile time .
well i ve some trials in template meta-programming, i ll post a new article within days .
thx alot, salam.
Yours, M. Hossny
|
|
|
|
 |
|
 |
I written a data parser with multiple if(l_oStrData.CompareNoCase("Test") == 0)It was not really slow, reading and parsing a file of 250 MB took 2 minutes and 22 seconds (my parsing algorithm is already VERY neat, so it's quite good with even CString).
I then just done a little array of int with an enum list for each entry, and hashed all the text inside each cell. Then, instead of a CompareNoCase, I just hashed the command line, and done a if(l_anHashedString[eTEST] == m_nHashedCmdLine) test. Result : 1 minute and 34 seconds !
33% faster, just on the tests ! And using a correct CString replacement would even add another boost !
Thanks for the idea I'll see if a can also use CRC with lowering the case first
Kochise
In Code we trust !
|
|
|
|
 |
|
 |
Thanks for the work, this is an intersting piece of code, but, as a = 0, it works for everything except when the text is all a's.
I've modified a local version which uses _size as part of it's comparisons as well. I've had to override all the comparison operators to handle this, but at least it works consistantly.
-cpr
|
|
|
|
 |
|
 |
well, u r right man .
i guess we shall start from 1..26 and working for base 27 , and consider the white space as 0. Hashkey calculation becomes as follows,
double _computehashkey()
{
unsigned int i=0;
_hashkey=0;
for(i=_size-1; i!=-1; i--)
_hashkey+=(tolower(_pdata[i])-'a'+1)*pow(27, _size-1-i);
return _hashkey;
}
thx for time, salam .
Yours, M. Hossny
|
|
|
|
 |
|
 |
It would be much better and easier, reading your stimulating article if the horizontal scrolling annoyance could be fixed.
William
Fortes in fide et opere!
|
|
|
|
 |
|
 |
sorry, cant get ur point
Yours, M. Hossny
|
|
|
|
 |
|
 |
I guess he means you should break up your example code into lines that don't exceed 80 characters.
Willem.
|
|
|
|
 |
|
 |
In your example s2 and s1 have the same string : "babybaba"
but their hash cdes are different. Just add vprint(s1==s2); to the end.
I think, there's some problem with algo.
|
|
|
|
 |
|
 |
There's a mistake in the operator+ code:
_hashkey+=s._hashkey*pow(26, _size);
Adds the new string's hash to the start of the existing string - not the end...
Jon
|
|
|
|
 |
|
 |
well man,
u r right , i ve just corrected this as follows
hashedstring operator+(const hashedstring& s) const
{
hashedstring rtn=*this;
strcat(rtn._pdata, s._pdata);
rtn._hashkey+=_hashkey*pow(26, s._size); // fast hashkey recalculation
rtn._size+=s._size;
return rtn;
}
hashedstring operator+=(const hashedstring& s)
{
strcat(_pdata, s._pdata);
_hashkey=_hashkey*pow(26, s._size)+s._hashkey; // fast hashkey recalculation
_size+=s._size;
return *this;
}
thx for time, salam.
Yours, M. Hossny
|
|
|
|
 |
|
 |
Have you worked out the collision space / characteristics for this hashing algorithm?
|
|
|
|
 |
|
 |
this is not a hashing function. it is a conversion from textual numbering system into decimal.
it is guaranteed to be invertible with no collesions .
thx, salam
Yours, M. Hossny
|
|
|
|
 |
|
 |
Hey there...
Very nice concept... and I think it go somewhere with a little more thought on how to handle longer strings, or even paragraphs of text. Sooner or later you are going to overflow.
Any thoughts on that?
www.kinkycode.com
[Glossary Manager] [AfterThought Backup Lite]
99 little bugs in the code, 99 little bugs,
Fix 1 bug, recompile....
101 little bugs in the code...
|
|
|
|
 |
|
 |
Well,
u r right. Actually this is a preprocessing step to representing a text file as a collection of numbers. This idea aims to minimize effort per word.
Representing the whole paragraph with a number is not prctical nor the objective of this idea.
I m working on textual computations, i need to represent a corpuse of text as a vector of values. Mathematics could help then searching text neumerically.
It may be extended to represent meening too .
Anyway thx, alot for comment.
salam.
Yours, M. Hossny
|
|
|
|
 |