|
|
Comments and Discussions
|
|
 |

|
I have a text file for indexing with the content:
hello my dear
GenerateWordFreq functions returned Dictionary with three elements:
hello
my
dea
last word missed a letter.
ver of hOOT is 2.0
|
|
|
|

|
This work is very impressive and I look forward to experimenting with it. When the end user has selected a document, it appears that the only way an application using this engine can do highlighting is to highlight based upon individual words - is that correct?
|
|
|
|

|
If I understand your question then yes hOOt does not do the highlighting and it's up to the application.
Its the man, not the machine - Chuck Yeager
If at first you don't succeed... get a better publicist
If the final destination is death, then we should enjoy every second of the journey.
|
|
|
|

|
cool!
I can perform searches on documents stored in a database?
|
|
|
|

|
Yes you can!
This is currently working in RaptorDB the doc version.
Its the man, not the machine - Chuck Yeager
If at first you don't succeed... get a better publicist
If the final destination is death, then we should enjoy every second of the journey.
|
|
|
|

|
Hi again Mehdi,
I finally decided to fix the "not" functionality and many other problems of the query in hoot.cs. It only took about a half an hour to figure out but now it works absolutely without flaw in my own opinion. Here are the functions you need to change in hoot.cs:
private WAHBitArray ExecutionPlan(string filter, int maxsize)
{
DateTime dt = FastDateTime.Now;
string[] words = filter.Split(' ');
bool defaulttoand = false;
if (filter.IndexOfAny(new char[] { '+', '-' }, 0) >= 0)
defaulttoand = true;
WAHBitArray bits = null;
foreach (string s in words)
{
int c;
string word = s;
if (s == "") continue;
OPERATION op = OPERATION.OR;
if (defaulttoand)
op = OPERATION.AND;
if (s.StartsWith("+"))
{
op = OPERATION.AND;
word = s.Replace("+", "");
}
if (s.StartsWith("-"))
{
op = OPERATION.ANDNOT;
word = s.Replace("-", "");
}
if (word.Contains("*") || word.Contains("?"))
{
WAHBitArray wildbits = null;
Regex reg = new Regex("^" + word.Replace("*", ".*").Replace("?", "."), RegexOptions.IgnoreCase);
foreach (string key in _words.Keys())
{
if (reg.IsMatch(key))
{
_words.TryGetValue(key, out c);
WAHBitArray ba = await _bitmaps.GetBitmap(c);
wildbits = DoBitOperation(wildbits, ba, OPERATION.OR, maxsize);
}
}
if (wildbits != null)
{
bits = DoBitOperation(bits, wildbits, op, maxsize);
}
}
else if (_words.TryGetValue(word.ToLowerInvariant(), out c))
{
WAHBitArray ba = await _bitmaps.GetBitmap(c);
bits = DoBitOperation(bits, ba, op, maxsize);
}
}
if (bits == null)
return new WAHBitArray();
WAHBitArray ret;
if (_docMode)
ret = bits.AndNot(_deleted.GetBits());
else
ret = bits;
return ret;
}
private static WAHBitArray DoBitOperation(WAHBitArray bits, WAHBitArray c, OPERATION op, int maxsize)
{
if (op == OPERATION.ANDNOT)
{
c = c.Not(maxsize);
}
if (bits != null)
{
switch (op)
{
case OPERATION.ANDNOT:
case OPERATION.AND:
bits = c.And(bits);
break;
case OPERATION.OR:
bits = c.Or(bits);
break;
}
}
else
bits = c;
return bits;
}
Then don't forget to fix the splitting of the text and the putting of all texts to lower case.
private static char[] punctuation = new char[]
{
' ', '.', ',', ';', ':', '?', '!', '\'', '-', '_', '(', ')', '[', ']',
'{', '}', '/', '\\', '\"', '¿', '*', '¡', '«', '»', '=', '@', '¤', '%'
, '&', '+', '§', '|', '^', '¨', '~', '$','“','”','’','‘'
};
private void AddtoIndex(int recnum, string text)
{
if (text == "" || text == null)
return;
string[] keys;
if (_docMode)
{
//_log.Debug("text size = " + text.Length);
Dictionary<string, int> wordfreq = GenerateWordFreq(text);
//_log.Debug("word count = " + wordfreq.Count);
var kk = wordfreq.Keys;
keys = new string[kk.Count];
kk.CopyTo(keys, 0);
}
else
{
keys = text.Split(punctuation);
}
foreach (string key in keys)
{
if (key == "")
continue;
var keyLower = key.ToLower();
int bmp;
if (_words.TryGetValue(keyLower, out bmp))
{
(_bitmaps.GetBitmap(bmp)).Set(recnum, true);
}
else
{
bmp = _bitmaps.GetFreeRecordNumber();
_bitmaps.SetDuplicate(bmp, recnum);
_words.Add(keyLower, bmp);
}
}
}
Thanks again for the wonderful code.. By the way, you can test my code (your hoot!) by downloading my program www.cross-connect.se but wait a week or so if you want to test the "NOT" functionality because I haven't released that fix yet.
modified 24 Feb '13 - 9:13.
|
|
|
|

|
This update has it been archived in version 2.0?
|
|
|
|

|
Hi Mehdi. I would like to submit to you my version of hOOt that is compiled for WP8 and "Windows Store". You will then have the ONLY FTS library available for these environments. Converting hOOt to these environments was not trivial which you will see if you look at the code I have written. In fact it was rather drastic. I had to remove a lot of code to get it to work. The only reason I succeeded was because your code was so simple.
Let me know if you are interested...
|
|
|
|

|
Yes, thanks!
I would very much like to see the changes on WP8 since I don't have the resources to test myself.
Its the man, not the machine - Chuck Yeager
If at first you don't succeed... get a better publicist
If the final destination is death, then we should enjoy every second of the journey.
|
|
|
|

|
You can download it from my server. Please let me know when you have gotten these so I can then remove them from my server.
Here is the Windows Store that you can test if you have windows 8 and VS2012
http://cross-connect.se/bibles/mehdi/HootWindowsStore.zip[^]
Here is the WP8 code
http://cross-connect.se/bibles/mehdi/HootWp8.zip[^]
The only difference between these 2 are just one row of code I believe. Actually, they might be exactly the same codewise but different project file. The WP8 code could be simpler without the "async-await" system but since I already had it for "WindowsStore" it is just as well to leave it. That way the 2 codebases are basically the same.
|
|
|
|

|
one thing I forgot to mention is that I could not get the "OptimizeIndex" to work so I just ignored it since in my case it seems to not need it.
|
|
|
|

|
Thanks, I will check them out.
Its the man, not the machine - Chuck Yeager
If at first you don't succeed... get a better publicist
If the final destination is death, then we should enjoy every second of the journey.
|
|
|
|

|
I'm using the V2 code in doc mode and it doesn't appear to ever save the JSON file to disk. There are no DOCS files produced.
Have I missed a step? I have tried it with the Demo app and looked through the code...
Thanks
|
|
|
|

|
Strange!
I will check this, thanks!
Its the man, not the machine - Chuck Yeager
If at first you don't succeed... get a better publicist
If the final destination is death, then we should enjoy every second of the journey.
|
|
|
|

|
Finally I have implemented hOOt in the very restrictive "Windows Store" environment and it works fine after some important changes to the logic. I use hOOt to index bibles in my program. It is important to know the application to understand the problems I encountered and fixed.
I use the "database" functionality only as opposed to the "document" functionality. That was critical for me because I could not port the fastJSON code and therefore I could remove it. In the indexing process, you only split the text with 'space'. This does not work in real life. If you do that then any word touching a period does not get indexed properly. I now split with about 25 punctuation marks as well as space. It slows down indexing but that is life. You have to take hits in statistics to make things work. Then, you strangely don't put things into lower case in the index yet you claim to be case-insensitive. It makes it impossible to look for names because in some cases you do lower case and other times not.
Lastly your Query codes logic is wrong due primarily to a statement that does not work as you would expect. That is "if (filter.IndexOfAny(new char[] { '+', '-' }, 0) > 0)" simply doesn't work properly. It should be ">=". That results in that the "OR" statement will never work. Check it out yourself. I have tested this with totally your code in the normal windows environment and get the same bug. Everytime I look at the query logic I see another bug. Really it needs a major overhaul.
Basically I am terribly greatful you have given me such simple and beautiful code. Thanks. Let me know if you want my code...
modified 6 Feb '13 - 5:36.
|
|
|
|

|
Hi thanks for the details on this I also have the same problems with it splitting only on space chr - do you think you could post the change you made to pick up the punctation marks as well?
Also you mentioned not handling converting to lowercase - did you fix this? if so could you also share that?
thanks
|
|
|
|

|
for the punctuation and lowercase: in hoot.cs;
private static char[] punctuation = new char[]
{
' ', '.', ',', ';', ':', '?', '!', '\'', '-', '_', '(', ')', '[', ']',
'{', '}', '/', '\\', '\"', '¿', '*', '¡', '«', '»', '=', '@', '¤', '%'
, '&', '+', '§', '|', '^', '¨', '~', '$','“','”','’','‘'
};
private void AddtoIndex(int recnum, string text)
{
if (text == "" || text == null)
return;
string[] keys;
if (_docMode)
{
//_log.Debug("text size = " + text.Length);
Dictionary wordfreq = GenerateWordFreq(text);
//_log.Debug("word count = " + wordfreq.Count);
var kk = wordfreq.Keys;
keys = new string[kk.Count];
kk.CopyTo(keys, 0);
}
else
{
keys = text.Split(punctuation);
}
foreach (string key in keys)
{
if (key == "")
continue;
var keyLower = key.ToLower();
int bmp;
if (_words.TryGetValue(keyLower, out bmp))
{
(_bitmaps.GetBitmap(bmp)).Set(recnum, true);
}
else
{
bmp = _bitmaps.GetFreeRecordNumber();
_bitmaps.SetDuplicate(bmp, recnum);
_words.Add(keyLower, bmp);
}
}
}
Then what I have done to the query:
private WAHBitArray ExecutionPlan(string filter, int maxsize)
{
//_log.Debug("query : " + filter);
DateTime dt = FastDateTime.Now;
// query indexes
string[] words = filter.Split(' ');
bool defaulttoand = false;
if (filter.IndexOfAny(new char[] { '+', '-' }, 0) >= 0)
defaulttoand = true;
WAHBitArray bits = null;
foreach (string s in words)
{
but like I said, query needs help. The "not" function does not work at all for me....
modified 7 Feb '13 - 11:21.
|
|
|
|

|
Great thank you for sharing these - I will give them a go.
Si
|
|
|
|

|
Love the project - much easier to use compared with other feature rich systems!
I seem to have a problem with the index.mgbmp and index.mgbmr being left open after I called hoot.shutdown
I modified the shutdown to also call _bitmaps.shutdown and the problem seems to be solved - But then I am very new to this code so I might have missed something - I'm using docmode using version 2.
public void Shutdown()
{
Save();
_deleted.Shutdown();
if (_docMode)
{
_bitmaps.Shutdown();
_docs.Shutdown();
}
}
Thanks keep up the good work
|
|
|
|

|
Thanks!
I must have missed it! I will put it in the next release.
Its the man, not the machine - Chuck Yeager
If at first you don't succeed... get a better publicist
If the final destination is death, then we should enjoy every second of the journey.
|
|
|
|

|
Just wanted to say "Thanks." I have been pulling my hair out with Lucene trying to port it to 3 different C# environments. This is going to be a breeze with hoot. I'll let you know when I am done!
Some people just don't understand how important it is to keep things simple. I like that even your documentation is simple and easy to understand...
Less is MORE!!
|
|
|
|
|

|
Nice!
Although 128bit is a bit over the top for this use case, but still
Its the man, not the machine - Chuck Yeager
If at first you don't succeed... get a better publicist
If the final destination is death, then we should enjoy every second of the journey.
|
|
|
|

|
When generating words from string and adding them to dictionary there's a check if (char.IsLetter(word[l - 1]) == false) then new word created like new string(word.ToCharArray(), 0, l - 2);
In my case that's let's say 'team"', and a new word will be 'tea'. Is it a bug and there should be new string(word.ToCharArray(), 0, l - 1); or there's some reason to skip last two letters ?
private void AddDictionary(Dictionary<string, int> dic, string word)
{
int l = word.Length;
if (l > MaxStringLengthIgnore)
return;
if (l < 2)
return;
if (char.IsLetter(word[l - 1]) == false)
word = new string(word.ToCharArray(), 0, l - 2);
if (word.Length < 2)
return;
int cc = 0;
if (dic.TryGetValue(word, out cc))
dic[word] = ++cc;
else
dic.Add(word, 1);
}
|
|
|
|

|
Nice catch!
Probably a mistake on my part, I will check it out and post an update soon.
Its the man, not the machine - Chuck Yeager
If at first you don't succeed... get a better publicist
If the final destination is death, then we should enjoy every second of the journey.
|
|
|
|

|
Can we expect more lucene / solr alike features?
Especially stemming, relevance, faceted seacrh?
Thanks
|
|
|
|

|
Not in the near future, hOOt was meant for fulltext searching in RaptorDB.
Its the man, not the machine - Chuck Yeager
If at first you don't succeed... get a better publicist
If the final destination is death, then we should enjoy every second of the journey.
|
|
|
|

|
Hi,
I don't want to push you. However, did you find some time already for the new version or do you already know when it will be available? Many Thanks.
|
|
|
|

|
Finally updated!
Its the man, not the machine - Chuck Yeager
If at first you don't succeed... get a better publicist
If the final destination is death, then we should enjoy every second of the journey.
|
|
|
|

|
Great article
Have been able to consider the above.
|
|
|
|

|
Thanks!
These kinds on things are usually a separate layer on top of the search indexes and currently are out of the scope of features for hOOt.
Its the man, not the machine - Chuck Yeager
If at first you don't succeed... get a better publicist
If the final destination is death, then we should enjoy every second of the journey.
|
|
|
|

|
Hello, a really excellent project.
However,"Free Memory" causes exception in the test application. After debuging it for almost one week, I didn't solve the problem and even didn't kown why. Look forward for your new version.
modified 4 Dec '12 - 2:29.
|
|
|
|

|
I'm freeing up time this week for long overdue overhauls
Its the man, not the machine - Chuck Yeager
If at first you don't succeed... get a better publicist
If the final destination is death, then we should enjoy every second of the journey.
|
|
|
|

|
After searching words in one directory, I changed the Search Folder but didn't delete the files generated in the Index Storage Directory. And I just found that actually the index was not rebuilded. I think the project will be prefect if this problem is solved.Thank you!
|
|
|
|

|
Hi,
what a great tool. Thanks.
I tried to index RTF files and TEXT files. The IFilters are loaded by your tool. However, no RTF or TEXT files will be indexed. The files appear during the scan, but no results.
|
|
|
|

|
Make sure the IFilters are working and they output the text strings, hOOt needs an upgrade which will be ready in a couple of weeks.
Its the man, not the machine - Chuck Yeager
If at first you don't succeed... get a better publicist
If the final destination is death, then we should enjoy every second of the journey.
|
|
|
|

|
Thanks for your quick reply. By the way, is there a chance to vary the sensitivity of the search routine for found strings / entries. I am missing quite a lot of words within the index, which will be found by the Windows search routine. The iFilters are loaded OK and the text output of the files is OK as well.
|
|
|
|

|
You should have any words over 2 characters and less than 60 in the index.
Its the man, not the machine - Chuck Yeager
If at first you don't succeed... get a better publicist
If the final destination is death, then we should enjoy every second of the journey.
|
|
|
|

|
That is what I would expect from your code. I did not check the index dlls yet. Is it possible that the index routines have problems with German chars?
|
|
|
|

|
Probably not, try debugging the code and look at what LoadWords() is doing.
Its the man, not the machine - Chuck Yeager
If at first you don't succeed... get a better publicist
If the final destination is death, then we should enjoy every second of the journey.
|
|
|
|

|
HI, meanwhile I did little debuggin and even the words in RTF.files and TXT.files will be found and indexed correctly. All words are within the words.file and are loaded correctly. However, doing a search the words will be found but the referenced files are wrong. Any ideas? Hopefully, I don't bother you but I really would like to use the tool.
|
|
|
|

|
If you can hang on for a short time, I am freeing up some time to update hOOt this week, hopefully the fixes in the queue will be helpful.
Its the man, not the machine - Chuck Yeager
If at first you don't succeed... get a better publicist
If the final destination is death, then we should enjoy every second of the journey.
|
|
|
|

|
Great news. Thank you so much.
|
|
|
|

|
Hi, first sorry for my terrific bad english.
I appreciate your project but seem to be buggy.
I run your app but I found a bug with it. My only file contains this "canada canada". I launch indexing and after finished if I click button "count words", it`s tell me that he have 2 words. This word are "canada" and "canad" !!!. The bug seem to be in the function GenerateWordFreq but probably in function ParseString.
Have you test your index solidly
|
|
|
|

|
hOOt needs a long overdue overhaul...
I will add this to the list of things to look into.
Thanks.
Its the man, not the machine - Chuck Yeager
If at first you don't succeed... get a better publicist
If the final destination is death, then we should enjoy every second of the journey.
|
|
|
|

|
Yes, but when will that be ? Are we talking the next month or two or are we talking June/July next year for example ? Is there a timeframe at all ?
|
|
|
|

|
Hopefully in the next couple of weeks.
Its the man, not the machine - Chuck Yeager
If at first you don't succeed... get a better publicist
If the final destination is death, then we should enjoy every second of the journey.
|
|
|
|

|
Excellent as I have been waiting for an updated version before I start getting stuck in. I will leave any further questions until I see what you end up producing but already I think that this is an excellent project.
|
|
|
|
|

|
All you need to do is register/install the IFilter handler in your system and hOOt will automatically use it.
Its the man, not the machine - Chuck Yeager
If at first you don't succeed... get a better publicist
If the final destination is death, then we should enjoy every second of the journey.
|
|
|
|
 |
|
|
General News Suggestion Question Bug Answer Joke Rant Admin
|
Smallest full text search engine (lucene replacement) built from scratch using inverted WAH bitmap index, highly compact storage, operating in database and document modes
| Type | Article |
| Licence | CPOL |
| First Posted | 12 Jul 2011 |
| Views | 122,489 |
| Bookmarked | 242 times |
|
|