If only a field of the record (the name) needed to be checked,
I understood that each record would contain a (single) filter field, that's why i suggested as radix tree. A hash would likely be more approriate if you wanted to match several fields.
Member 4194593 wrote:
wouldn't that reduce the effort to create a hash
Yes. Sounds like you're in favour of using a hash table.
Member 4194593 wrote:
in fact wouldn't a Radix search be faster?
Yes, based on my op. Sounds like you're in favour of using a radix tree.
Original post: 'based on record names'
I assume this to mean strings, in which case my op stands.
The idea that I can be presented with a problem, set out to logically solve it with the tools at hand, and wind up with a program that could not be legally used because someone else followed the same logical steps some years ago and filed for a patent on it is horrifying.
- John Carmack
Good point. Some other pros and cons: A radix tree would require more memory accesses. But on the other hand, a radix tree would take less memory than a hash table, so would be more likely to fit entirely in a cache.
The hash-value calculation wouldn't necessarily be slow; it would be all integer operations on a key that could be in a single cache line. Also the number of collisions can be reduced in two ways:
1. Randomly-hashed values follow a Poisson distribution, so the expected number of collisions can be calculated for a given hash-table size. The table size can be increased to achieve an acceptable collision rate.
If it's necessary to check billions of records per day, that sounds like enough computer work--regardless of algorithm--that profiling will be necessary to achieve optimal performance. Radix trees can be good if the records to be searched for differ from other records in the characters that are examined first. They can be bad if records often differ only in the characters that are examined much later.
With hash functions, there is often going to be some trade-offs involving the time required to perform a hash, the hash table size, and the frequency of hash collisions and false matches. Increasing the complexity of a hash function such that the time required increases by more than half the cost of a direct record comparison won't make much sense if doing so unless it saves an average of one direct record comparison for every two records. Having a hash table so large that every access entails a cache miss may slow things down more than would the occasional hash collision or false match.
One approach that may be helpful would be to compute a somewhat long hash of a record, hash that to produce an index into a hash table, and have the table itself store the 'long' hash in addition to the actual record. One could thus at very little expense test the long hash of the record to be checked against the long hash in the table. This would eliminate a large fraction of the direct record comparisons that would otherwise be needed.
BTW, how fast are modern machines at 32x32 integer multiplication? If the strings are zero-padded as well as zero-terminated and multiplies are cheap, it might be good to do something like:
unsigned long *ptr;
unsigned long hash,temp;
/* Point ptr at the start of the data, preferably word aligned */do
temp = *ptr++;
hash += hash >> 23;
hash ^= temp;
hash *= some_big_prime_number;
That would gobble up 32 bits of string data per loop iteration; I would expect the multiply, even if it takes multiple cycles, could be done concurrently with the next word fetch. Not the world's most sophisticated hash function by any stretch, but the execution time should be pretty minimal. If you want to use the hash function to yield a table index, use the upper bits. Taking the modulus of a prime number (distinct from the one used for the multiply) may be better, but modulus operations are more expensive than shifts.
Interesting points. Your comment about better performance if the records differ in the characters examined first reminded me of the ID3 algorithm http://en.wikipedia.org/wiki/ID3_algorithm[^]. Basically it builds a decision tree that first examines the attributes of a record that differ the most. This avoids wasting time examining attributes that don't differ much among the records.
Your point about cache misses is good; for this volume of data the best data structure is one that fits in the cache.
I need help from you who have experience with wumpus problem.
I would like to ask that at one step, how can we choose the adjacent field to go ?
I know one method that we choose the adjacent OK field (without mummies or pits) that has fewest visited times. My teacher said that it's not the most optimal method...
Would you mind helping me. Thanks in advance.
I think it was a way to indicate to you that you should look to other currently available sources before you ask others to spend their time trying to help you. Did you google "Wumpus"? Many available references. If you have checked those and still do not have an answer, then ask for individual help, but maybe on a gaming forum.
Use an "if" "elseif" structure since words like "tooth" and "teeth" are exceptions in English. Basically you'll need a lookup dictionary. To handle other cases you test if the last (and second to last) letter of the word is a vowel or not.
Looks quite useful, Richard. I know when I write papers the biggest challenge is trying not to be repetitive. Too many phrases begin: "the model suggests that...", or "so and so, in their paper find that...."
Any resource (free!) that can improve the quality of academic publications is good value in my opinion. In fact, any resource that can improve the quality of writing in university courses is well worth it - even more so if the product is good quality and freely available. Others differ, but I don't mind open source/free software for educational purposes.
I would guess that something, more like SED, is what you need. As I remember it from way back (many years), it would search for differences down to single characters. Its name means Stream EDitor.
I'm not too sure whether this would work or not, depending on the implementation. Usually DIFF algos are based on line differences and the whole line is marked as different. Depending on how large the files are, this could take a lot of time (order O(nm)). The method I developed talking to Skippums (in the referenced thread) speeded this up with some tricks.
Later, Skippums had another problem. He wanted an archive method for huge files (exceeding 512 MB) where the files were prior versions of documents, or programs, etc, etc - what it amounted to was he needed a small SED type of file created in order to save only the changes. Once you get into massive files, comparisons like this are extremely difficult and time consuming. I developed a method to minimize this time and was testing this. I asked for a sample file or two of this massive size (I could only come up with 9 and 11 MB libraries). I had sent him this data (the algorithm - my code was in MASM) and he finally responded to my Email about a month later explaining that he had been on vacation and would get back to me That was in July. I have checked his message log, and that is about when his last message was posted here.
I have also worked with DIFF algos.
If any of this algo info would be of use to you, I would be happy to help you.
There is a question I have been wondering about for ages and I was hoping someone could give me an answer to rest my mind.
Lets assume I have an input stream (like a file/socket/pipe) and want to parse the incoming data. Lets assume each block of incoming data is split by a newline like most common internet protocols. This just however just as well be parsing html/xml or any other smart data structure. The point is that the data is split into logical blocks by a delimiter rather than a fixed length. How can I buffer the data to wait for the delimiter to appear?
The question seems simple enough: just have a large enough byte/char array to fit the entire thing.
But what if the delimiter come after the fixed size buffer is full? This is actually a question about how to fit a dynamic block of data in a fixed size block. I can only really think of a few alternatives:
1) increase the buffer size when needed. This may require heavy memory reallocation, and may lead to resource exhaustion from specially crafted streams (or perhaps even denial of service in the case of sockets where we want to protect ourselves against exhaustion attacks and drop connections that try to exhaust resources...and an attacker starts sending fake, oversized, packets to trigger the protection)
2) start overwriting old data by using a circular buffer. Perhaps not an ideal method since the logical block would become incomplete
3) dump new data when the buffer is full. However, this way the delimiter will never be found so this choice is obviously not a good option
4) just make the fixed size buffer damn large and assume all incoming logical data blocks is within its bounds...and if ever full just interpret the full buffer as a logical block...
In either case I feel we must just assume that the logical blocks will never exceed a certain size...
Any thoughts on this topic? Obviously it must be a way since the higher level languages offer some sort of buffering mechanisms with their readLine() stream methods.
Is there any "best way" to solve this or is there always a tradeoff? I really appreciate all thoughts and ideas on this topic since this question has been haunting me everytime I have needed to write a parser of some sort.
You're right reallocation can be very costly, one solution I've seen to be quiet effective is to have a memory pool (set of buffers of common length) and have your stream maintain the chaining, as more memory is required (using a linked-list like approach) chain-in more blocks of memory from the pool.
Finally (socket example) when the frame, packet or token has arrived, provide per item(char usually) iterator access to the chain of buffers so as to make the chain transparent to the end user.
This obviously has its limits as well, specifically the largest block you can read will be the size of the pool of memory, however due to the design choice, if you ever reach that limit all you need to do is create more memory blocks in the pool which is much cheaper than realloc'ing everything accumulated so far.
It sounds like you need to make a policy decision that's algorithm-independent: How large can a block be before you decide it's a denial-of-service attack?
This is domain-dependent. It depends on the knowledge you have of blocks' contents from your particular domain, and on the resources of your system. If you have lots of memory, you can afford to accept a big block on the chance it may be real.
The memory-reallocation approach wastes time because each reallocation requires you to recopy all the data received for that block so far. This starts to approach an O(n^2) running time for what should be a linear algorithm.
It seems like a buffer pool would use up memory waiting for big blocks that may never come. It would also limit the maximum size of a block you could accept.
The linked-list approach is the standard way of dealing with this type of situation; don't preallocate anything, and dynamically allocate blocks as you need them.
I know that I am opening Pandora's box, and I fully expect to be met with an angry hoard of rabid Bats, but I have an idea for constructing a convex hull from a data set. What I was looking for was a "Best" algorithm (here come the Bats) so that I might code it as a baseline to see if my algorithm is any faster. I was actually looking for triangulation, but wanted to start with the convex hull first.
I have already googled convex hull, but there is not too much info about relative performance.
The data will be just 2D (x,y) where there will be 64K-1 unsigned shorts for both x and y, no repeats (randomly select an integer from the set of (1 to (64k-1)), then remove the integer from the set, decrement the set size, and randomly select another entry from the remainder of the set). x and y will be independently randomized. The x y pair will not both be equal (swap the y value with the immediately following or previously defined pair whichever does not violate x!=y for either resultant pair). The set of points will be saved on a file as binary pairs, each algorithm will then be used to read the points and create the convex hull.
I will code my algorithm and the base algorithm in MASM, single threaded (the method should be able to be implemented as multi threaded, but for simplicity and timing, I wanted single threaded). I will use the MASM32 timing macros to set the CPU into Real Time Priority to try to get as accurate a timing as possible.
I'm not looking for code, just a detailed algorithm which would allow me to implement this sought after "Best" algorithm to use as a baseline.
Anyone have a guess how long this 64K-1 set might take to encircle? Should I reduce the size of the set?
Yes, I have read all of the CP articles and Wikis related to "Convex Hull" and read all of their references, but there is not too much reference to relative speeds under the same set of conditions, the touted advantages are mainly about how the algorithm extends to three or more dimensions, etc.
My attempt will be along the lines of "divide and conquer" with several twists I dreamed up (literally - went to bed thinking about the problem and woke up with what I thought was a good solution).
Sorry for the double post, but please forgive me. I did not thank you for your reply (I did vote you for a good answer to my question). I actually had expected you to be the first to respond, and my initial reply was a attempt to be humorous (think back to 1941 and "Casablanca").
The limit for a particular class of functions or an arbitrary function?
If you want to enter any given function and return the mathematical limit, then you'll need a parser and some kind of symbolic math engine in order to carry out operations like L'Hopital's rule, etc...
I am wondering what kind of answer you are looking for. Are you looking for a numerical answer such as 3.14159 or a symbolic answer like PI? Also, do you want this function to work any function? For example the limit as x goes to 0 of log(x) is not defined. I believe I could give you a better answer if I had a better idea of the real problem you are solving.
If you already have some for drawring, you should be able to adapt that. Lim x->c = slope of a function at point c, right? (change in y over change in x) If you add some boundary checking, you might come close without too much code required.
I think you need to be more clear about what, exactly, you want. For certain classes of functions calculating the limit is quite involved and requires differentiation. If you want to do this for any input function then you'll need to get into symbolic manipulation. That's a project in itself. Calculating mathematical limits is not an easy task.
Thank you all
In fact yes, I want to write a program for calculating all limits, for example Lim(sin(2/x)) when x->0, or even x->Infinite
so I must write a parser I think
but maybe I can find something that helps me how to start!
The problem is not just the parser. You will also need both a symbolic manipulation and numeric component. The symbolic package will be needed to calculate derivatives for certain limits that require l'Hopital's rule[^]. Other methods used in computer-based limit calculations are Wynn's epsilon algorithm[^] and generalized Euler transforms.
As you can see, it is a non-trivial project and requires an understanding of some rather advanced mathematics.
Yeah, but don't black bodies absorb and then re-emit photons of the same frequency? Melanin absorbs UV and then re-emits about 99.9% of it as infra-red. How does it do THAT? Where does the rest of the energy go?
Richard A. Abbott wrote:
Are you querying this because of interest in Anti-Cancer agents or ???
If I can remember back 40 years to my Chemistry/Physics classes, the energentic UV hits the atoms of the Melanin and causes the electrons to jump several states (all jumps are discrete jumps of energy levels), and then the electrons drop back down one level at a time releasing the energy in the form of IR (lower frequency, lower energy). There is no extra energy, but the IR will cause an increase in temperature in the local area.
I'm not expecting a wodge of code handed to me on a plate (though I won't say no...), more of a string hint on where to go looking.
I have a set of data points (successive Y values evenly spaced along the X axis). This is crudely a flat line of 0, with a bump part of the way a long, and then some noise added. The bump could well have a flat top to it.
By eye, it's easy to work out where the peak of the underlying curve should be, but a lot less easy to just grab a crude maximum value and expect it to be useful.
From experience, the curve is pretty close to a gaussian that someone has sliced the top off of. I'm trying to find some method of writing / reusing / blatantly stealing a black box function I can give a series of Y values, and get the parameters (centre X, magnitude and 'sharpness') of a gaussian curve.
Can you give me any pointers to useful web pages you've found?
I'm not averse to writing my own, wiggling values and doing fit comparisons, but I'm sure someone made a wonderful algorithm with three names in the 70s that would be vastly superior.