|
|
Comments and Discussions
|
|
 |

|
You never finish to amaze me with all your contributions... nice work Mehdi
|
|
|
|

|
Hi Mehdi,
Firstly thanks for a wonderful project, makes our job of indexing and searching very easy. I have been trying to index .msg files for my project but am not very successful. I have downloaded and installed the MSG iFilter from Microsoft site http://www.microsoft.com/en-us/download/details.aspx?id=1111[MSG iFilter].
I debugged and found that the program is able to pick all the values from registry correctly. But only while doing FilterReader.Read it is throwing exception. Could you please guide me on how to fix this.
Thanks and Regards,
Mohd Arabi
|
|
|
|

|
I have a text file for indexing with the content:
hello my dear
GenerateWordFreq functions returned Dictionary with three elements:
hello
my
dea
last word missed a letter.
ver of hOOT is 2.0
|
|
|
|

|
This work is very impressive and I look forward to experimenting with it. When the end user has selected a document, it appears that the only way an application using this engine can do highlighting is to highlight based upon individual words - is that correct?
|
|
|
|

|
cool!
I can perform searches on documents stored in a database?
|
|
|
|

|
Hi again Mehdi,
I finally decided to fix the "not" functionality and many other problems of the query in hoot.cs. It only took about a half an hour to figure out but now it works absolutely without flaw in my own opinion. Here are the functions you need to change in hoot.cs:
private WAHBitArray ExecutionPlan(string filter, int maxsize)
{
DateTime dt = FastDateTime.Now;
string[] words = filter.Split(' ');
bool defaulttoand = false;
if (filter.IndexOfAny(new char[] { '+', '-' }, 0) >= 0)
defaulttoand = true;
WAHBitArray bits = null;
foreach (string s in words)
{
int c;
string word = s;
if (s == "") continue;
OPERATION op = OPERATION.OR;
if (defaulttoand)
op = OPERATION.AND;
if (s.StartsWith("+"))
{
op = OPERATION.AND;
word = s.Replace("+", "");
}
if (s.StartsWith("-"))
{
op = OPERATION.ANDNOT;
word = s.Replace("-", "");
}
if (word.Contains("*") || word.Contains("?"))
{
WAHBitArray wildbits = null;
Regex reg = new Regex("^" + word.Replace("*", ".*").Replace("?", "."), RegexOptions.IgnoreCase);
foreach (string key in _words.Keys())
{
if (reg.IsMatch(key))
{
_words.TryGetValue(key, out c);
WAHBitArray ba = await _bitmaps.GetBitmap(c);
wildbits = DoBitOperation(wildbits, ba, OPERATION.OR, maxsize);
}
}
if (wildbits != null)
{
bits = DoBitOperation(bits, wildbits, op, maxsize);
}
}
else if (_words.TryGetValue(word.ToLowerInvariant(), out c))
{
WAHBitArray ba = await _bitmaps.GetBitmap(c);
bits = DoBitOperation(bits, ba, op, maxsize);
}
}
if (bits == null)
return new WAHBitArray();
WAHBitArray ret;
if (_docMode)
ret = bits.AndNot(_deleted.GetBits());
else
ret = bits;
return ret;
}
private static WAHBitArray DoBitOperation(WAHBitArray bits, WAHBitArray c, OPERATION op, int maxsize)
{
if (op == OPERATION.ANDNOT)
{
c = c.Not(maxsize);
}
if (bits != null)
{
switch (op)
{
case OPERATION.ANDNOT:
case OPERATION.AND:
bits = c.And(bits);
break;
case OPERATION.OR:
bits = c.Or(bits);
break;
}
}
else
bits = c;
return bits;
}
Then don't forget to fix the splitting of the text and the putting of all texts to lower case.
private static char[] punctuation = new char[]
{
' ', '.', ',', ';', ':', '?', '!', '\'', '-', '_', '(', ')', '[', ']',
'{', '}', '/', '\\', '\"', '¿', '*', '¡', '«', '»', '=', '@', '¤', '%'
, '&', '+', '§', '|', '^', '¨', '~', '$','“','”','’','‘'
};
private void AddtoIndex(int recnum, string text)
{
if (text == "" || text == null)
return;
string[] keys;
if (_docMode)
{
//_log.Debug("text size = " + text.Length);
Dictionary<string, int> wordfreq = GenerateWordFreq(text);
//_log.Debug("word count = " + wordfreq.Count);
var kk = wordfreq.Keys;
keys = new string[kk.Count];
kk.CopyTo(keys, 0);
}
else
{
keys = text.Split(punctuation);
}
foreach (string key in keys)
{
if (key == "")
continue;
var keyLower = key.ToLower();
int bmp;
if (_words.TryGetValue(keyLower, out bmp))
{
(_bitmaps.GetBitmap(bmp)).Set(recnum, true);
}
else
{
bmp = _bitmaps.GetFreeRecordNumber();
_bitmaps.SetDuplicate(bmp, recnum);
_words.Add(keyLower, bmp);
}
}
}
Thanks again for the wonderful code.. By the way, you can test my code (your hoot!) by downloading my program www.cross-connect.se but wait a week or so if you want to test the "NOT" functionality because I haven't released that fix yet.
modified 24-Feb-13 9:13am.
|
|
|
|

|
Hi Mehdi. I would like to submit to you my version of hOOt that is compiled for WP8 and "Windows Store". You will then have the ONLY FTS library available for these environments. Converting hOOt to these environments was not trivial which you will see if you look at the code I have written. In fact it was rather drastic. I had to remove a lot of code to get it to work. The only reason I succeeded was because your code was so simple.
Let me know if you are interested...
|
|
|
|

|
I'm using the V2 code in doc mode and it doesn't appear to ever save the JSON file to disk. There are no DOCS files produced.
Have I missed a step? I have tried it with the Demo app and looked through the code...
Thanks
|
|
|
|

|
Finally I have implemented hOOt in the very restrictive "Windows Store" environment and it works fine after some important changes to the logic. I use hOOt to index bibles in my program. It is important to know the application to understand the problems I encountered and fixed.
I use the "database" functionality only as opposed to the "document" functionality. That was critical for me because I could not port the fastJSON code and therefore I could remove it. In the indexing process, you only split the text with 'space'. This does not work in real life. If you do that then any word touching a period does not get indexed properly. I now split with about 25 punctuation marks as well as space. It slows down indexing but that is life. You have to take hits in statistics to make things work. Then, you strangely don't put things into lower case in the index yet you claim to be case-insensitive. It makes it impossible to look for names because in some cases you do lower case and other times not.
Lastly your Query codes logic is wrong due primarily to a statement that does not work as you would expect. That is "if (filter.IndexOfAny(new char[] { '+', '-' }, 0) > 0)" simply doesn't work properly. It should be ">=". That results in that the "OR" statement will never work. Check it out yourself. I have tested this with totally your code in the normal windows environment and get the same bug. Everytime I look at the query logic I see another bug. Really it needs a major overhaul.
Basically I am terribly greatful you have given me such simple and beautiful code. Thanks. Let me know if you want my code...
modified 6-Feb-13 5:36am.
|
|
|
|

|
Love the project - much easier to use compared with other feature rich systems!
I seem to have a problem with the index.mgbmp and index.mgbmr being left open after I called hoot.shutdown
I modified the shutdown to also call _bitmaps.shutdown and the problem seems to be solved - But then I am very new to this code so I might have missed something - I'm using docmode using version 2.
public void Shutdown()
{
Save();
_deleted.Shutdown();
if (_docMode)
{
_bitmaps.Shutdown();
_docs.Shutdown();
}
}
Thanks keep up the good work
|
|
|
|

|
Just wanted to say "Thanks." I have been pulling my hair out with Lucene trying to port it to 3 different C# environments. This is going to be a breeze with hoot. I'll let you know when I am done!
Some people just don't understand how important it is to keep things simple. I like that even your documentation is simple and easy to understand...
Less is MORE!!
|
|
|
|
|

|
When generating words from string and adding them to dictionary there's a check if (char.IsLetter(word[l - 1]) == false) then new word created like new string(word.ToCharArray(), 0, l - 2);
In my case that's let's say 'team"', and a new word will be 'tea'. Is it a bug and there should be new string(word.ToCharArray(), 0, l - 1); or there's some reason to skip last two letters ?
private void AddDictionary(Dictionary<string, int> dic, string word)
{
int l = word.Length;
if (l > MaxStringLengthIgnore)
return;
if (l < 2)
return;
if (char.IsLetter(word[l - 1]) == false)
word = new string(word.ToCharArray(), 0, l - 2);
if (word.Length < 2)
return;
int cc = 0;
if (dic.TryGetValue(word, out cc))
dic[word] = ++cc;
else
dic.Add(word, 1);
}
|
|
|
|

|
Can we expect more lucene / solr alike features?
Especially stemming, relevance, faceted seacrh?
Thanks
|
|
|
|

|
Hi,
I don't want to push you. However, did you find some time already for the new version or do you already know when it will be available? Many Thanks.
|
|
|
|

|
Great article
Have been able to consider the above.
|
|
|
|

|
Hello, a really excellent project.
However,"Free Memory" causes exception in the test application. After debuging it for almost one week, I didn't solve the problem and even didn't kown why. Look forward for your new version.
modified 4-Dec-12 2:29am.
|
|
|
|

|
Hi,
what a great tool. Thanks.
I tried to index RTF files and TEXT files. The IFilters are loaded by your tool. However, no RTF or TEXT files will be indexed. The files appear during the scan, but no results.
|
|
|
|

|
Hi, first sorry for my terrific bad english.
I appreciate your project but seem to be buggy.
I run your app but I found a bug with it. My only file contains this "canada canada". I launch indexing and after finished if I click button "count words", it`s tell me that he have 2 words. This word are "canada" and "canad" !!!. The bug seem to be in the function GenerateWordFreq but probably in function ParseString.
Have you test your index solidly
|
|
|
|
|

|
Hello. I have web-application on Windows server 2003 and I use MS Indexing Service for searchig in it.
In my ASP-code I use this for MS Indexing Service:
Conn.ConnectionString = “provider=msidxs;data source=F:\ MeIndexCataog; ”
How to request to HOOt in asp if I want MS Indexing Service to be replaced by HOOt?
Thanks.
|
|
|
|

|
Excellent replacement of Lucene.
I am trying to index 60,000 pieces of document and
I am getting out of memory when reaching 1,4 Gb of
memory used by application.
Is there any easy way to reduce the memory footprint
Thanks
Alan
|
|
|
|

|
Terrific work. Is this on codeplex or github?
Thanks
|
|
|
|

|
Just wondering when you expect to update this project as I am interested in using Hoot but I am reluctant at the moment as you mentioned in a reply to somebody there are a lot of changes to be made.
|
|
|
|
|

|
thank you for your reply!
|
|
|
|

|
Hi Mehdi,
Is it a good idea to use your porgram to perform searches based on the file names without the content search or even just index the file names..?
If so, what should I do..?
Have a nice day,
Iruka
|
|
|
|

|
FileLogger not thread safe...throws (Source array was not long enough. Check srcIndex and length, and the array's lower bounds exception) while logging...
you might consider using ConcurrentQueue instead of Queue in FileLogger...
|
|
|
|

|
How can I rebuild the indexes for the updated documents. Let's say I have finished indexing and after a while one of the documents was updated. I know the path for the document updated, how can I find a document by path to replace it and rebuild index.
|
|
|
|

|
FYI, your related project links (except RaptorDB) are broken.
Fortunately the resulting CodeProject error pages show the new correct links.
|
|
|
|
|

|
thank you
modified 23-Mar-12 7:10am.
|
|
|
|

|
I can save documents in the database and indexing fields of type blob with hoot ?
|
|
|
|

|
Excellent work!
However, until the "user defined document fields" feature is implemented, I won't really be able to use it.
Any idea of when this might be done?
Thanks again!
|
|
|
|

|
How to modify hOOt search engine to also index words start or contains numbers?
keep it in your mind, Code Project is the place to be!
modified 22-Jan-12 9:30am.
|
|
|
|
|

|
Very nice job !
Do you plan to release a new version including relevance ? It will be great !
If yes, do you have a little idea when ?
Thx by advance.
|
|
|
|

|
Excellent work. Actually I voted for your article(with 2 other articles) in survey but forgot to vote here.
|
|
|
|

|
With test application, "Free Memory" causes exception in hOOt. Not clear why you use the button "Load hOOt". Couldn't you load all you need on demand, in the style of lazy pattern? Or do I miss anything.
Also, the test application is not accurate enough: TAB navigation goes in some unexpected order, there are no keyboard shortcuts on controls, sizes is not accurate (at least, set correct MinimumSize to avoid hiding controls. You see, this is important enough: by your test applications, the users judge the quality of the library.
Also, I would recommend to include FASTJson in the solution as source code. Not everyone is ready to trust pre-compiled DLL (I don't), and adding the source from the different article is a little but hassle.
Could you fix all that?
Thank you very much for sharing this interesting code,
—SA
Sergey A Kryukov
|
|
|
|

|
How hard could be to add relevance to the search? I think that instead of storing the document, what can be stored is an array of "word positions" (along with the document filename/document ID, so you always have access to it) so, what word from the word index is in what position on the document, and then use that information to do a fast scan by relevance.
Not sure if that would be efficient, as you will have to scan every document returned by the search and then sort based on relevance. Could this be done directly on the query? Maybe creating another type of index?
|
|
|
|

|
Nice
|
|
|
|

|
How does it handle numeric compare? if I have a column that are all integer type. Will it be treated as numeric or string?
|
|
|
|
|
|

|
Very nice work (5)
I developed such engine at 1996-1998 with C++, and I understand every part of ur work. But, I have changed all my work after reading the document:
http://infolab.stanford.edu/~backrub/google.html[^]
My engine was so fast with the bitmaps workout, but it was suitable only for small scale search engines. Reasons for those limitations for example are:
- The headache of decompressing large bitmaps, and do Boolean operations while some of the bitmaps may include only one or two hits.
- The need to store another values with the bit of each document such as the rank of the relevance of the work to the document and the positions of the word in that document.
- You have to parse each document to get the relevancy of ur query to each document, which is not applicable for large scale engines. So, we keep the initial positions of each word in the document and do some fast intersections while doing first document filtering. Any way you can check that in the previous link.
...
The magic solution is to use vectors instead of bitmaps. Bitmaps used only in cases that we need to keep only single information about the key, and we are expecting a homogeneous distribution of bits between bitmaps of the Boolean operation. The good example for that is the wildcard indexing of the words lexicon.
You can check my article about query execution at:
Database Virtual Cursor[^]
Any way I gave u my 5, as I understand from ur introduction that it is a small scale.
"hOOt is a extremely small size and fast embedded full text search engine"
I don't know it u mean "small scale" with "small size" or not.
If not please let me know to change my 5 to 4.
Many thanks
Hatem
|
|
|
|

|
A well written and interesting article.
Just because the code works, it doesn't mean that it is good code.
|
|
|
|

|
Is it just me or does everyone else have a problem visiting the RaptorDB[^] page? I've been trying for a couple of days now and CP says "Page not Found". I thought it was just a temporary problem, but this is the only one of Mehdi's articles that I can't see.
|
|
|
|

|
I could imagine that this could be very useful for a trace parser which creates an index for each line which contains process, thread id, method name and the actual payload.
What I am not sure is if it is possible to handle time with a BitArray. If I want to search for a time range I would need to create a filter which would represent all possible times which becomes quickly a problem. Do you know of any ways around this or should I simply keep for each line the 64 bit DateTime value?
Yours,
Alois Kraus
|
|
|
|

|
Hi Mehdi,
the project is very impressing, congratulations. I am thinking about using it for the full text search feature of the project I work now. The problems that I face are as follows:
1. I need stemmers for european languages: german, italian, romanian. How can this be accomplished?
2. There are some special text constructs like code/name/year, let's say identifyable with a regex, which should not be separated, but found as a single entity. Is it possible by using hoot, or by extending it? This is more like a nice to have feature, but very useful.
Thanks in advance and great job again!
Ioan
|
|
|
|
|
 |
|
|
General News Suggestion Question Bug Answer Joke Rant Admin
|
Smallest full text search engine (lucene replacement) built from scratch using inverted WAH bitmap index, highly compact storage, operating in database and document modes
| Type | Article |
| Licence | CPOL |
| First Posted | 12 Jul 2011 |
| Views | 131,704 |
| Bookmarked | 266 times |
|
|