Click here to Skip to main content
15,867,756 members
Please Sign up or sign in to vote.
3.50/5 (2 votes)
See more:
OK, seems like I got stupid, as I can't embrace myself on the following:

I have a set of data y(x). Basically, x is discrete time (seconds), y is some data values, dependent on that x. Scaling y is simple: ScaledY = ln(y - min(y) + 1). If I need to compress data, let's say, by minute, it's also easy: calc average y for each minute, then calc ScaledY using the above formula. Now, the problem I'm trying to solve is - instead of Y, scale X logarithmically. The only idea that I have so far is this:

- divide my data set in two parts. Calculate average y for all x on the left part; divide right part in two parts, and repeat this divide/calc average logic until all is processed.

The reason I don't like this approach is an "average" on the left part, because this is not correct from a math point of view.

Is there any other approach or may be a standard algo?
Posted
Comments
CPallini 26-Sep-13 12:33pm    
Why do you need to average?
Kosta Cherry 26-Sep-13 14:16pm    
Because in (any) left part data values of y could be very different, and that part results in one y value. Basically, imagine an ln graph, but rotated by 90 degrees.
Sergey Alexandrovich Kryukov 26-Sep-13 15:54pm    
The question is not 100% clear, but what's the problem, school algebra? Then remember one thing: if you have some equation, you can take logarithm of left and right part; it will leave expression the same for the points where the expression under the logarithms is not 0. :-)
—SA
Kosta Cherry 26-Sep-13 16:25pm    
Nope, it's not a school algebra, and not an assignment either :)
It's a real-world problem. I have a data set - very big one. It can be presented as an array, where index is a second, and value is data value for that second. I need to compress data for future analysis, but in a special way - the older data is less "influential", and newer data is more "important". So, I decided to do logarithmic compression, so the older data is, the more it's gets compressed. For example, for the recent minute I want to keep all 60 values, but for a minute that was an hour ago I need just one value. For data a day ago I need one value for the whole hour, and so forth. Thing is, this is not exactly logarithmic, and averaging is not exactly the way to do it, because, within one compression interval values are also not equally important - the closer data is to present, the more "weight" it should have. I mean, I can go with the idea I currently have, but what bugs me is this - this should be pretty standard scaling algorithm, it's I'm already 20 year past the university :(, and never needed this before, and just don't know how to properly formulate a problem to google it. If someone can just point me to the right resource (preferably with readily available algo, language is not important), this would be very helpful.
CPallini 26-Sep-13 16:58pm    
I guess you are looking for a distribution. Suppose your data is continuos, say y(x).
Then, you could choose, rather arbitrarly, w(x) in such a way that contribution of interval {x0,x1} is given by the integral of w(x)y(x)dx, computed in the range {x0,x1} divided by the integral of w(x)dx, computed on the whole x range (latter integral is a normalization factor). That would work provided the integrals converge. Of course,in your actual algorithm, you have to discretize, using sums instead of integrals, but I suppose you got the idea.

1 solution

Thanks to CPallini, who gave me a good hint, I came up with this function if anyone is interested. (This is first, rough code, but it produced exactly what I was looking for):

C++
void scaleXLn(std::vector<double>& data, std::vector<double>& result)
{
   size_t oldSize = data.size();
   size_t newSize = log((double)oldSize);
   ++newSize; // because it is always truncated during conversion
   result.resize(newSize, 0.0);
   // now we have "newSize" number of "new" points.
   // Let's loop by them:
   size_t leftIdx = 0; 
   for (int i=newSize - 1; i>=0; --i)
   {
      // get the right index:
      size_t rightIdx = i ? oldSize - exp(i) : oldSize - 1;
      // now we have "old" interval "left" to "right".
      // Our w(x) is 1/log(x + 1)
      // Let's calculate weighted average:
      double wxSum = 0.0f;
      double avg = 0.0f;
      for (size_t j = leftIdx; j<=rightIdx; ++j)
      {
         double wx = 1.0f/log(oldSize - j + 1);
         wxSum += wx;
         avg += wx*data[j];
      }
      result[newSize - i - 1] = avg/wxSum;
      // next
      leftIdx = rightIdx + 1;
   }
}
 
Share this answer
 
v2
Comments
Philippe Mori 26-Sep-13 21:10pm    
Why are you initializing a double with a float (ex.: double wxSum = 0.0f;)? Here it does not matters as the fractionnal part is 0 but otherwise you might loose some precision.
Kosta Cherry 26-Sep-13 22:27pm    
No diff in this case, and, as I mentioned, fast, rough first code. Also forgot "const" to first argument, the first "for" loop can be improved, etc, etc. By no means this is the final variant, and posted here just to illustrate the idea of the algo.
Philippe Mori 27-Sep-13 8:23am    
It is a bad habit to add f at the end of floating point number that are not of float type. If you do it for 0.0f and 1.0f, I'm pretty sure that you would do it for 0.1.

By the way, with VC++ (debug), if you do double x = 0.1f - 0.1; std::cout << x;, you will get 1.49012e-009 which indeed prove that the extra f cause a lost of precision for any number that cannot be represented exactly.

On a positive side, you haven't added a f for the 1 in that expression: log(oldSize - j + 1) which is good because using 1.0f at this place cause the compiler to use the float version. But it would be even better to be explicit and write 1.0.

For the missing const, you are right but as many users are new programmer, I think it is nice to avoid those little mistakes as they might copy the code verbatim in their own application.
CPallini 27-Sep-13 3:20am    
You did it the other way around, fine. My 5.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900