Algorithms

Data Sample Synchronization

30-Aug-07 8:00

You have only to write a simple recoursive function.
The input is the list of numbers.
it takes the first number on the right that is different from 1 (like N) and split it to (N-1 1), then you can end the function passing the new string to the function again (recursively).
Stop if the string contains only 1s.

prasadbuddhika wrote:
3 1
2 2

here happens something strange, ... so probally the rules are more complicated.

good luck Big Grin | :-D

Russell

Anthony988729-Aug-07 7:52

Anthony9887

29-Aug-07 7:52

Anyone know how I can implement this in code?

Trying to synchronize data samples from two different hardware devices. One device is sampling around 10pt/s and the other device is sampling at 50pt/s. These samples are displayed on a graph in realtime.

Re: Data Sample Synchronization

Russell^'29-Aug-07 8:16

Substring Matching (Harder than it sounds) [modified]

29-Aug-07 8:16

I think you can simply 'decimate' (factor = 5) the second signal to have the same rate ... and then syncronize they. To do so you can use correlation on all the signal or in a small part of the signal(not in the case of big shifts) to correct little offsets.
If it is in real-time you can develop a dynamic-delay that shift one of the two signal to get the syncronization before to start to record. And of course it can work also during the recording to adjust the shift value.
It is a big chapter on signal processing. Cry | :((

Russell

Skippums29-Aug-07 5:54

29-Aug-07 5:54

I need an algorithm that returns the longest exact matching substring from two strings. For example, passing in the strings "abacadaba" and "dbaadabb" would return "adab". The best way that I can think of would be to construct a boolean table of size m by n, and check each character for a match and put a true value if they match and false if they don't. Then I could look along each diagonal for the largest consecutive set of true values. However, this algorithm takes O(n*m) memory, O(n*m) time to build the table, then O(n*m) time to look for the longest diagonal. There is definitely an algorithm that runs faster, but I cannot figure out what it is. I was wondering if anyone has a better idea as to how I can accomplish this task. Thanks

-- modified at 12:19 Wednesday 29th August, 2007

Russell^'29-Aug-07 6:21

29-Aug-07 6:21

s1 vs s2
(s1 is the shorter)

It seems that the longest matcing string can be found comparing
* s1 inside s2
* then s1(0:end-1) vs s2
* then s1(0:end-2) vs s2
* ...
* Stop the research at the first match: it is the longer one

This could be quite slow if you compare two very long string (like 2 books Unsure | :~

), but it don't need memory: only m+n (or 0 if you use pointers to the original strings)

Skippums wrote:
Levenshtein table

It is also very slow Unsure | :~

Russell

Skippums29-Aug-07 6:30

29-Aug-07 6:30

Doesn't that algorithm assume that s2 contains the start of the string s1? This is not actually the case. The longest substring could occur anywhere in either string. So for that algorithm to work, I would have to try s1, then s1 without the first character, then s1 without the second character, then s1 without the third character, etc. I think it would actually be faster (not really sure) to do it the way I proposed with the boolean table.

Luc Pattyn29-Aug-07 6:32

Re: Substring Matching (Harder than it sounds) [modified]

29-Aug-07 6:32

Hi,

I can offer a few ideas that may reduce the size of the problem:

1. remove in string1 all chars that don't occur in string2 and vice versa;
this process typically shortens the strings, and reduces the problem to a
larger number of much smaller problems.
In your example it replaces ("abacadaba", "dbaadabb")
by ("aba", "dbaadabb") and ("adaba", "dbaadabb") based on 'c'
and the first of these can be reduced further based on 'd'.

2. search for small patterns that occur in one and not in the other string.
Again in your example that would be "aa", allowing a further reduction (here
you can split in between the two a's).

3. Once you have sufficiently reduced you could consider your quadratic cost
approach.

4. but as an alternative you could do a kind of binary search on the match length:
- start with an arbitrary length, in the shorter string, select some substring
with that length, and try to find it in the other string. If you find one,
the match can only be that size or larger. If you don't find one, you could
try all other substrings or fall back to a smaller length.
It all depends on the kind of data you want to handle well. Your example is not
a real indication I guess.

Smile | :)

Luc Pattyn [Forum Guidelines] [My Articles]

this weeks tips:
- make Visual display line numbers: Tools/Options/TextEditor/...
- show exceptions with ToString() to see all information
- before you ask a question here, search CodeProject, then Google

Skippums29-Aug-07 7:42

29-Aug-07 7:42

OK. I have thought about it some more, and have come up with the following solution:

Let lstr be the longer string and sstr be the shorter string...

First, I will build a dictionary with characters as keys and a singly linked list of integer indexes as the value, which will represent lstr. This table will take O(lstr.len) memory and O(lstr.len) time to build, but will be keyed in O(1) time.

Then I will make the following recursive function, which is sent the lookup table, the two strings, and an integer representing the longest string found thus far:

1. If the longest string found is longer than sstr sent in, return empty string.
2. Initialize result to be the empty string.
3. Choose character c from sstr to be the center character.
4. For each index in the dictionary for c, find the matching substring (match).
5. If match.len from (3) > result.len declared in (1), result = match. If more indexes occur for c in the dictionary, go back to (3)
6. If result.len > sstr.len / 2, return result.
Else
resultLow = recursively call this using sstr.substr(0, sstr.len / 2) and
resultHigh= recursively call this using sstr.substr(sstr.len / 2)
return the longest of (result, resultLow, resultHigh)

This will limit my searches so I don't have to waste my time comparing things that are already found. I think this algorithm will turn out to be O(lstr.len * log(lstr.len)). Any thoughts?

-- modified at 14:28 Wednesday 29th August, 2007

Russell^'29-Aug-07 8:06

29-Aug-07 8:06

I like very much the idea of build the dictionary! But I mean a dictionary of all the words inside s1 and s2.(be carefull with tabulations: " " "," "." ...)
In that way you can convert the long array of char on a short array of integer:
the integer are simply the substitution of the words with the index of that word inside the dictionary (.... the dictionary could not exist really: we don't need to came back to the original text).
I think that this way is used also from some compressing algorithm.
Now you can apply the algoruthm that you want, but surely the comparison must run faster (it depends only on how much time you spend on build the dictionary)

Russell

Skippums29-Aug-07 8:28

29-Aug-07 8:28

It is not a dictionary of words, but a dictionary of characters. That way, I know there are only about 40 characters that I have to worry about (a-z (case insensitive), 0-9, and a few other special characters). I am in this instance doing character matching, not word matching, and I am only building the dictionary for the longer of the two strings s1 or s2.

Russell^'29-Aug-07 9:02

29-Aug-07 9:02

Skippums wrote:
It is not a dictionary of words, but a dictionary of characters

My dictionary is different from yours: if you have little string that are repeated you can think that that are 'words' and proceed with my method (only to reduce the length of the array) Big Grin | :-D

Only you can decide, I don't know what kynd of strings you are working on and what kynd of semplification can be done.

Russell

PIEBALDconsult29-Aug-07 18:03

PIEBALDconsult

29-Aug-07 18:03

Sounds like part of doing a diff of two file.

Mark Churchill29-Aug-07 18:26

Mark Churchill

29-Aug-07 18:26

I get the feeling its going to end up On^2ish no matter how you code around it. How big are your strings?

Brute force without the ram overhead would be to scan index pointers thru A and B, looking ahead in A as you get matches in B and keeping the highest match. Then incrementing the A pointer to the next character etc.

Depending on what your text looks at you might be able to hack in some optimisations (like checking if A[ia]+longestmatchlength != B[ib]+longestmatchlength -> increment ib by longestmatchlength).

In fact that small optimisation could speed an individual scan of the B array up over time - possibly even flattening the On^2 back down to closer to On.

Mark Churchill
Director
Dunn & Churchill

cp987629-Aug-07 23:55

cp9876

29-Aug-07 23:55

There has been a lot of work done on this for diff'ing files and many algorithms published. Just google on diff algorithms, there are a couple of articles on CP, e.g. here[^] and here[^]

Peter
"Until the invention of the computer, the machine gun was the device that enabled humans to make the most mistakes in the smallest amount of time."

Skippums30-Aug-07 4:04

30-Aug-07 4:04

The second link that you gave proposes the exact same idea as to how to do this that I came up with yesterday. I will only perform step #1, but that page does confirm to me that my idea on how to tackel this is a valid one. Using that algorithm, I think that I will eliminate most of the checks, which will yield an O(n*log(n)) algorithm (I think). Thank you very much for posting this so I could confirm that I am not going down a dead end! Thanks to everyone else who posted responses for me as well!

Skippums4-Sep-07 11:23

4-Sep-07 11:23

Here is the code (in c#) that I finally ended up implementing. As I said before, I think it is O(n*log(n))...

public static string LongestMatchingSubstring(string s1, string s2, int minMatchLength, bool caseSensitive) {
if (string.IsNullOrEmpty(s1) || string.IsNullOrEmpty(s2))
return string.Empty;
string shorter, longer;
if (s1.Length > s2.Length) {
shorter = s2;
longer = s1;
} else {
shorter = s1;
longer = s2;
}
if (shorter.Length < minMatchLength)
return string.Empty;
Dictionary<char, IList<int>> hashTable = new Dictionary<char, IList<int>>(
Math.Min(caseSensitive ? 96 : 64, longer.Length));
if (caseSensitive) {
for (int i = 0; i < longer.Length; ++i) {
char c = longer[i];
if (!hashTable.ContainsKey(c))
hashTable.Add(c, new List<int>());
hashTable[c].Add(i);
}
} else {
for (int i = 0; i < longer.Length; ++i) {
char c = char.ToLower(longer[i]);
if (!hashTable.ContainsKey(c))
hashTable.Add(c, new List<int>());
hashTable[c].Add(i);
}
}
string result = string.Empty;
--minMatchLength;
FindLongestMatchingSubstring(shorter, longer,
hashTable, ref result, ref minMatchLength, caseSensitive);
return result;
}

private static void FindLongestMatchingSubstring(string str1, string str2,
Dictionary<char, IList<int>> ht, ref string result, ref int minMatchLength, bool caseSensitive) {
if (str1.Length <= minMatchLength)
return;
int midpoint = str1.Length / 2;
char key = caseSensitive ? str1[midpoint] : char.ToLower(str1[midpoint]);
if (ht.ContainsKey(key)) {
foreach (int index in ht[key]) {
string temp = GetFullMatch(str1, midpoint, str2, index, caseSensitive);
if (temp.Length > minMatchLength) {
minMatchLength = temp.Length;
result = temp;
}
}
}
FindLongestMatchingSubstring(str1.Substring(0, midpoint),
str2, ht, ref result, ref minMatchLength, caseSensitive);
FindLongestMatchingSubstring(str1.Substring(midpoint + 1),
str2, ht, ref result, ref minMatchLength, caseSensitive);
}

private static string GetFullMatch(string str1, int str1Idx, string str2, int str2Idx, bool caseSensitive) {
int minOffset = -Math.Min(str1Idx, str2Idx);
int maxOffset = Math.Min(str1.Length - str1Idx, str2.Length - str2Idx) - 1;
int lowOffset = -1, highOffset = 1;
if (caseSensitive) {
for (; lowOffset >= minOffset && str1[str1Idx + lowOffset] ==
str2[str2Idx + lowOffset]; --lowOffset)
;
for (; highOffset <= maxOffset && str1[str1Idx + highOffset] ==
str2[str2Idx + highOffset]; ++highOffset)
;
} else {
for (; lowOffset >= minOffset && char.ToLower(str1[str1Idx + lowOffset]) ==
char.ToLower(str2[str2Idx + lowOffset]); --lowOffset)
;
for (; highOffset <= maxOffset && char.ToLower(str1[str1Idx + highOffset]) ==
char.ToLower(str2[str2Idx + highOffset]); ++highOffset)
;
}
++lowOffset;
--highOffset;
return str2.Substring(str2Idx + lowOffset, highOffset - lowOffset + 1);
}

Hope this helps some people!

Jeff

Luc Pattyn4-Sep-07 13:06

4-Sep-07 13:06

Skippums wrote:
I think it is O(n*log(n))...

I did not look at the code, but why don't you measure it?
It suffices to perform three runs with complexity c, 10*c and 100*c
where c should take a measurable time (say 1 second)
From the measured times, you can figure out the curve that fits...

Smile | :)

Luc Pattyn [Forum Guidelines] [My Articles]

Skippums4-Sep-07 13:26

4-Sep-07 13:26

It is slightly more difficult, though, because the time it takes depends on the data. For example, if the longest substring matches immediately and is longer than half the length of the shorter substring, then it takes only O(n) time. If none of the charaters from either string match, it also only takes O(n) time. If the smaller string is shorter than the minimum result you are willing to accept, it takes O(1) time to inform me that I am an idiot for trying. I would like to test it for the worst case senario, but I am having trouble figuring out what that is. Anyone have any insight to a case when this will perform worse than O(n)?

Luc Pattyn5-Sep-07 9:00

5-Sep-07 9:00

Seems like you need a lot of test cases (random data could be considered) and a good
definition of what your primary measure of size "n" is.

Lacking these, your "I think it is O(n*log(n))..." seems a bit premature.

Smile | :)

Luc Pattyn [Forum Guidelines] [My Articles]

Skippums5-Sep-07 10:46

5-Sep-07 10:46

Worst case that I can think of is if you have a string like 0101010...10 of length m = 2^i-1 where i is a + int, and another of all 1's of length n where n is any + int > m. Then this algorithm will perform approximately (m / 2) * 2 * n * log2(m + 1) = m*n*i operations, or O(n*n*log(n)). So I guess that solves whether or not it is O(n * log(n))! It still should outperform almost any other algorithm when 1) The alphabet over which you are performing the search is reasonably large, 2) The expected overlap between the two strings is reasonably large. Ultimately, I don't really care what the final time is on this algorithm. I am really only looking for any advice on how I can improve it. Thanks,

Jeff

Need some intelligent people...

Member 419459329-Mar-08 9:55

Member 4194593

29-Mar-08 9:55

Sorry for this late posting, but I have just been reading old forums and ran into this.

If you are still interested in this, I have an algorithm that runs pretty quickly, abet, written in MASM, but the method could be written in C# (I think it could, I have never coded in C#). I have a complete working program - source and EXE and a method to enter test cases via a file with the strings. It is quite lengthy, but mostly commentary. The source with comments is about 2250 lines. The MaxSubstring function is 472 lines with comments. Assembled with MASM 6.15.

Dave Augustine

Paddy Boyd29-Aug-07 2:20

Paddy Boyd

29-Aug-07 2:20

It's been a long time since i did any really hard maths and a friend of mine just asked me the following question:

"How can I easily find all possible solutions for

3125r = 1024x - 8404

where r and x are both integers, and r is divisible by 5"

Any very intelligent people out there who know of an answer?

Re: Need some intelligent people...

Luc Pattyn29-Aug-07 2:56