|
|
Comments and Discussions
|
|
 |

|
I have a text file for indexing with the content:
hello my dear
GenerateWordFreq functions returned Dictionary with three elements:
hello
my
dea
last word missed a letter.
ver of hOOT is 2.0
|
|
|
|

|
This work is very impressive and I look forward to experimenting with it. When the end user has selected a document, it appears that the only way an application using this engine can do highlighting is to highlight based upon individual words - is that correct?
|
|
|
|

|
If I understand your question then yes hOOt does not do the highlighting and it's up to the application.
Its the man, not the machine - Chuck Yeager
If at first you don't succeed... get a better publicist
If the final destination is death, then we should enjoy every second of the journey.
|
|
|
|

|
cool!
I can perform searches on documents stored in a database?
|
|
|
|

|
Yes you can!
This is currently working in RaptorDB the doc version.
Its the man, not the machine - Chuck Yeager
If at first you don't succeed... get a better publicist
If the final destination is death, then we should enjoy every second of the journey.
|
|
|
|

|
Hi again Mehdi,
I finally decided to fix the "not" functionality and many other problems of the query in hoot.cs. It only took about a half an hour to figure out but now it works absolutely without flaw in my own opinion. Here are the functions you need to change in hoot.cs:
private WAHBitArray ExecutionPlan(string filter, int maxsize)
{
DateTime dt = FastDateTime.Now;
string[] words = filter.Split(' ');
bool defaulttoand = false;
if (filter.IndexOfAny(new char[] { '+', '-' }, 0) >= 0)
defaulttoand = true;
WAHBitArray bits = null;
foreach (string s in words)
{
int c;
string word = s;
if (s == "") continue;
OPERATION op = OPERATION.OR;
if (defaulttoand)
op = OPERATION.AND;
if (s.StartsWith("+"))
{
op = OPERATION.AND;
word = s.Replace("+", "");
}
if (s.StartsWith("-"))
{
op = OPERATION.ANDNOT;
word = s.Replace("-", "");
}
if (word.Contains("*") || word.Contains("?"))
{
WAHBitArray wildbits = null;
Regex reg = new Regex("^" + word.Replace("*", ".*").Replace("?", "."), RegexOptions.IgnoreCase);
foreach (string key in _words.Keys())
{
if (reg.IsMatch(key))
{
_words.TryGetValue(key, out c);
WAHBitArray ba = await _bitmaps.GetBitmap(c);
wildbits = DoBitOperation(wildbits, ba, OPERATION.OR, maxsize);
}
}
if (wildbits != null)
{
bits = DoBitOperation(bits, wildbits, op, maxsize);
}
}
else if (_words.TryGetValue(word.ToLowerInvariant(), out c))
{
WAHBitArray ba = await _bitmaps.GetBitmap(c);
bits = DoBitOperation(bits, ba, op, maxsize);
}
}
if (bits == null)
return new WAHBitArray();
WAHBitArray ret;
if (_docMode)
ret = bits.AndNot(_deleted.GetBits());
else
ret = bits;
return ret;
}
private static WAHBitArray DoBitOperation(WAHBitArray bits, WAHBitArray c, OPERATION op, int maxsize)
{
if (op == OPERATION.ANDNOT)
{
c = c.Not(maxsize);
}
if (bits != null)
{
switch (op)
{
case OPERATION.ANDNOT:
case OPERATION.AND:
bits = c.And(bits);
break;
case OPERATION.OR:
bits = c.Or(bits);
break;
}
}
else
bits = c;
return bits;
}
Then don't forget to fix the splitting of the text and the putting of all texts to lower case.
private static char[] punctuation = new char[]
{
' ', '.', ',', ';', ':', '?', '!', '\'', '-', '_', '(', ')', '[', ']',
'{', '}', '/', '\\', '\"', '¿', '*', '¡', '«', '»', '=', '@', '¤', '%'
, '&', '+', '§', '|', '^', '¨', '~', '$','“','”','’','‘'
};
private void AddtoIndex(int recnum, string text)
{
if (text == "" || text == null)
return;
string[] keys;
if (_docMode)
{
//_log.Debug("text size = " + text.Length);
Dictionary<string, int> wordfreq = GenerateWordFreq(text);
//_log.Debug("word count = " + wordfreq.Count);
var kk = wordfreq.Keys;
keys = new string[kk.Count];
kk.CopyTo(keys, 0);
}
else
{
keys = text.Split(punctuation);
}
foreach (string key in keys)
{
if (key == "")
continue;
var keyLower = key.ToLower();
int bmp;
if (_words.TryGetValue(keyLower, out bmp))
{
(_bitmaps.GetBitmap(bmp)).Set(recnum, true);
}
else
{
bmp = _bitmaps.GetFreeRecordNumber();
_bitmaps.SetDuplicate(bmp, recnum);
_words.Add(keyLower, bmp);
}
}
}
Thanks again for the wonderful code.. By the way, you can test my code (your hoot!) by downloading my program www.cross-connect.se but wait a week or so if you want to test the "NOT" functionality because I haven't released that fix yet.
modified 24 Feb '13 - 9:13.
|
|
|
|

|
This update has it been archived in version 2.0?
|
|
|
|

|
Hi Mehdi. I would like to submit to you my version of hOOt that is compiled for WP8 and "Windows Store". You will then have the ONLY FTS library available for these environments. Converting hOOt to these environments was not trivial which you will see if you look at the code I have written. In fact it was rather drastic. I had to remove a lot of code to get it to work. The only reason I succeeded was because your code was so simple.
Let me know if you are interested...
|
|
|
|

|
Yes, thanks!
I would very much like to see the changes on WP8 since I don't have the resources to test myself.
Its the man, not the machine - Chuck Yeager
If at first you don't succeed... get a better publicist
If the final destination is death, then we should enjoy every second of the journey.
|
|
|
|

|
You can download it from my server. Please let me know when you have gotten these so I can then remove them from my server.
Here is the Windows Store that you can test if you have windows 8 and VS2012
http://cross-connect.se/bibles/mehdi/HootWindowsStore.zip[^]
Here is the WP8 code
http://cross-connect.se/bibles/mehdi/HootWp8.zip[^]
The only difference between these 2 are just one row of code I believe. Actually, they might be exactly the same codewise but different project file. The WP8 code could be simpler without the "async-await" system but since I already had it for "WindowsStore" it is just as well to leave it. That way the 2 codebases are basically the same.
|
|
|
|

|
one thing I forgot to mention is that I could not get the "OptimizeIndex" to work so I just ignored it since in my case it seems to not need it.
|
|
|
|

|
Thanks, I will check them out.
Its the man, not the machine - Chuck Yeager
If at first you don't succeed... get a better publicist
If the final destination is death, then we should enjoy every second of the journey.
|
|
|
|

|
I'm using the V2 code in doc mode and it doesn't appear to ever save the JSON file to disk. There are no DOCS files produced.
Have I missed a step? I have tried it with the Demo app and looked through the code...
Thanks
|
|
|
|

|
Strange!
I will check this, thanks!
Its the man, not the machine - Chuck Yeager
If at first you don't succeed... get a better publicist
If the final destination is death, then we should enjoy every second of the journey.
|
|
|
|

|
Finally I have implemented hOOt in the very restrictive "Windows Store" environment and it works fine after some important changes to the logic. I use hOOt to index bibles in my program. It is important to know the application to understand the problems I encountered and fixed.
I use the "database" functionality only as opposed to the "document" functionality. That was critical for me because I could not port the fastJSON code and therefore I could remove it. In the indexing process, you only split the text with 'space'. This does not work in real life. If you do that then any word touching a period does not get indexed properly. I now split with about 25 punctuation marks as well as space. It slows down indexing but that is life. You have to take hits in statistics to make things work. Then, you strangely don't put things into lower case in the index yet you claim to be case-insensitive. It makes it impossible to look for names because in some cases you do lower case and other times not.
Lastly your Query codes logic is wrong due primarily to a statement that does not work as you would expect. That is "if (filter.IndexOfAny(new char[] { '+', '-' }, 0) > 0)" simply doesn't work properly. It should be ">=". That results in that the "OR" statement will never work. Check it out yourself. I have tested this with totally your code in the normal windows environment and get the same bug. Everytime I look at the query logic I see another bug. Really it needs a major overhaul.
Basically I am terribly greatful you have given me such simple and beautiful code. Thanks. Let me know if you want my code...
modified 6 Feb '13 - 5:36.
|
|
|
|

|
Hi thanks for the details on this I also have the same problems with it splitting only on space chr - do you think you could post the change you made to pick up the punctation marks as well?
Also you mentioned not handling converting to lowercase - did you fix this? if so could you also share that?
thanks
|
|
|
|

|
for the punctuation and lowercase: in hoot.cs;
private static char[] punctuation = new char[]
{
' ', '.', ',', ';', ':', '?', '!', '\'', '-', '_', '(', ')', '[', ']',
'{', '}', '/', '\\', '\"', '¿', '*', '¡', '«', '»', '=', '@', '¤', '%'
, '&', '+', '§', '|', '^', '¨', '~', '$','“','”','’','‘'
};
private void AddtoIndex(int recnum, string text)
{
if (text == "" || text == null)
return;
string[] keys;
if (_docMode)
{
//_log.Debug("text size = " + text.Length);
Dictionary wordfreq = GenerateWordFreq(text);
//_log.Debug("word count = " + wordfreq.Count);
var kk = wordfreq.Keys;
keys = new string[kk.Count];
kk.CopyTo(keys, 0);
}
else
{
keys = text.Split(punctuation);
}
foreach (string key in keys)
{
if (key == "")
continue;
var keyLower = key.ToLower();
int bmp;
if (_words.TryGetValue(keyLower, out bmp))
{
(_bitmaps.GetBitmap(bmp)).Set(recnum, true);
}
else
{
bmp = _bitmaps.GetFreeRecordNumber();
_bitmaps.SetDuplicate(bmp, recnum);
_words.Add(keyLower, bmp);
}
}
}
Then what I have done to the query:
private WAHBitArray ExecutionPlan(string filter, int maxsize)
{
//_log.Debug("query : " + filter);
DateTime dt = FastDateTime.Now;
// query indexes
string[] words = filter.Split(' ');
bool defaulttoand = false;
if (filter.IndexOfAny(new char[] { '+', '-' }, 0) >= 0)
defaulttoand = true;
WAHBitArray bits = null;
foreach (string s in words)
{
but like I said, query needs help. The "not" function does not work at all for me....
modified 7 Feb '13 - 11:21.
|
|
|
|

|
Great thank you for sharing these - I will give them a go.
Si
|
|
|
|

|
Love the project - much easier to use compared with other feature rich systems!
I seem to have a problem with the index.mgbmp and index.mgbmr being left open after I called hoot.shutdown
I modified the shutdown to also call _bitmaps.shutdown and the problem seems to be solved - But then I am very new to this code so I might have missed something - I'm using docmode using version 2.
public void Shutdown()
{
Save();
_deleted.Shutdown();
if (_docMode)
{
_bitmaps.Shutdown();
_docs.Shutdown();
}
}
Thanks keep up the good work
|
|
|
|

|
Thanks!
I must have missed it! I will put it in the next release.
Its the man, not the machine - Chuck Yeager
If at first you don't succeed... get a better publicist
If the final destination is death, then we should enjoy every second of the journey.
|
|
|
|

|
Just wanted to say "Thanks." I have been pulling my hair out with Lucene trying to port it to 3 different C# environments. This is going to be a breeze with hoot. I'll let you know when I am done!
Some people just don't understand how important it is to keep things simple. I like that even your documentation is simple and easy to understand...
Less is MORE!!
|
|
|
|
|

|
Nice!
Although 128bit is a bit over the top for this use case, but still
Its the man, not the machine - Chuck Yeager
If at first you don't succeed... get a better publicist
If the final destination is death, then we should enjoy every second of the journey.
|
|
|
|

|
When generating words from string and adding them to dictionary there's a check if (char.IsLetter(word[l - 1]) == false) then new word created like new string(word.ToCharArray(), 0, l - 2);
In my case that's let's say 'team"', and a new word will be 'tea'. Is it a bug and there should be new string(word.ToCharArray(), 0, l - 1); or there's some reason to skip last two letters ?
private void AddDictionary(Dictionary<string, int> dic, string word)
{
int l = word.Length;
if (l > MaxStringLengthIgnore)
return;
if (l < 2)
return;
if (char.IsLetter(word[l - 1]) == false)
word = new string(word.ToCharArray(), 0, l - 2);
if (word.Length < 2)
return;
int cc = 0;
if (dic.TryGetValue(word, out cc))
dic[word] = ++cc;
else
dic.Add(word, 1);
}
|
|
|
|

|
Nice catch!
Probably a mistake on my part, I will check it out and post an update soon.
Its the man, not the machine - Chuck Yeager
If at first you don't succeed... get a better publicist
If the final destination is death, then we should enjoy every second of the journey.
|
|
|
|
 |
|
|
General News Suggestion Question Bug Answer Joke Rant Admin
|
Smallest full text search engine (lucene replacement) built from scratch using inverted WAH bitmap index, highly compact storage, operating in database and document modes
| Type | Article |
| Licence | CPOL |
| First Posted | 12 Jul 2011 |
| Views | 122,682 |
| Bookmarked | 242 times |
|
|