C++

void scaleXLn(std::vector<double>& data, std::vector<double>& result) { size_t oldSize = data.size(); size_t newSize = log((double)oldSize); ++newSize; // because it is always truncated during conversion result.resize(newSize, 0.0); // now we have "newSize" number of "new" points. // Let's loop by them: size_t leftIdx = 0; for (int i=newSize - 1; i>=0; --i) { // get the right index: size_t rightIdx = i ? oldSize - exp(i) : oldSize - 1; // now we have "old" interval "left" to "right". // Our w(x) is 1/log(x + 1) // Let's calculate weighted average: double wxSum = 0.0f; double avg = 0.0f; for (size_t j = leftIdx; j<=rightIdx; ++j) { double wx = 1.0f/log(oldSize - j + 1); wxSum += wx; avg += wx*data[j]; } result[newSize - i - 1] = avg/wxSum; // next leftIdx = rightIdx + 1; } }

—SA

It's a real-world problem. I have a data set - very big one. It can be presented as an array, where index is a second, and value is data value for that second. I need to compress data for future analysis, but in a special way - the older data is less "influential", and newer data is more "important". So, I decided to do logarithmic compression, so the older data is, the more it's gets compressed. For example, for the recent minute I want to keep all 60 values, but for a minute that was an hour ago I need just one value. For data a day ago I need one value for the whole hour, and so forth. Thing is, this is not exactly logarithmic, and averaging is not exactly the way to do it, because, within one compression interval values are also not equally important - the closer data is to present, the more "weight" it should have. I mean, I can go with the idea I currently have, but what bugs me is this - this should be pretty standard scaling algorithm, it's I'm already 20 year past the university :(, and never needed this before, and just don't know how to properly formulate a problem to google it. If someone can just point me to the right resource (preferably with readily available algo, language is not important), this would be very helpful.

distribution. Suppose your data is continuos, say`y(x)`

.Then, you could choose, rather arbitrarly,

`w(x)`

in such a way thatcontribution of intervalgiven by the integral of`{x0,x1}`

is`w(x)y(x)dx`

, computed in the range`{x0,x1}`

divided bythe integral of`w(x)dx`

, computed on the whole`x`

range (latter integral is a normalization factor). That would work provided the integrals converge. Of course,in your actual algorithm, you have to discretize, using sums instead of integrals, but I suppose you got the idea.`x`

is 'how many time elapsed', that is very past events have large positive`x`

, the`w(x)=1/x`

looks a good candidate to me.I'll post the resulting function in a solution in case anyone is interested.

Come on, I hope you are almost there already, make some tiny effort... :-)

—SA

—SA