Click here to Skip to main content
15,893,923 members
Articles / Programming Languages / Visual C++ 10.0

Precise and safe calculation method for the average of large integer arrays

Rate me:
Please Sign up or sign in to vote.
4.91/5 (12 votes)
13 Aug 2014CPOL4 min read 25.8K   11   15
How to avoid overflow and still retain precision on large integer calculations

Introduction

Integer calculations can be tricky, if done excessively. Even with values in an unproblematic range, repeated additions and other operations can lead to integer overflow. When this problem arises, even trivial algorithms can be difficult to implement.

The usual solution is to do the calculations using a higher precision data type for the intermediate values, such as long long or double.

This article offers an alternate algorithm to calculate an average for a large number of integer values under the premise that the calculation needs to be exact, and, therefore, using doubles for the intermediate values is not an option. It is inspired by a question posted here: http://www.codeproject.com/Messages/4878957/Calculute-average-value-of-large-array.aspx.

Background

A link posted by user k5054 in the thread mentioned points to the Cumulative Moving Average method: http://en.wikipedia.org/wiki/Moving_average#Cumulative_moving_average. This method provides an alternate way to calculate the average of a series of numbers. While the method is not concerned with the aspect of integer overflow, it offers a way to incrementally determine averages of a part of the series, and thus obviates the need for an intermediary value holding a considerably larger value, which may by subject to an overflow.

Modified Algorithm

The algorithm incrementally calculates the cumulative moving average (CMA) of all values xi up to a point. To avoid having to revisit previous values for each iteration, the calculation is using an alternate formula based on the previous average:

CMAn+1 = CMAn + (xn+1 - CMAn) / (n+1)

In this equation, all values and intermediate results are on the same order of magnitude as the original values. The largest intermediate value you can get is the difference between the largest and smallest input value: e. g. if these two values are stored in x1 and x2, then the above formula will calculate the difference x2 - x1 on the second iteration.

This formula does assume an exact calculation however: if the division of the second summand is an integer division, some fractions will be cut off, and these can add up to a significant amount. Therefore we will need to keep track of the cumulated remainder (CMR) of the division:

CMAn+1 = CMAn + (xn+1 - CMAn + CMRn) div (n+1)

CMRn+1 = (xn+1 - CMAn + CMRn) mod (n+1)

The biggest intermediate value here is the divisor used in both formulas, (xn+1 - CMAn + CMRn) . The subtraction will at most deliver the difference between the smallest and largest value of the xi, and CMR can be any value between -n and n (the remainder can be negative if the divisor is). Thus there will be no overflow as long as the range of input values doesn't cover most of the integer range to start with, or the number of elements gets close to the maximum integer value of the integer type being used.

Using the code

The code uses the C++11 features decltype and std::remove_reference. I have implemented and tested it in VS2010. Note that I deliberately did not use a size type for n_values, even though it can't be negative. The reason is that the division and modulo operators will not always provide the expected results for mixed signed/unsigned arguments. Besides, n_values must be in the base type range to avoid overflow.

C++
#include <utility>

template <class container_iterator>
auto average(const container_iterator start, container_iterator end)
   -> typename std::remove_reference<decltype(*start)>::type
{
   typedef std::remove_reference<decltype(*start)>::type basetype;

   basetype cumulated_average = 0;
   basetype cumulated_remainder = 0;
   basetype addendum = 0;
   basetype n_values = 0;
   for (auto pvalue = start; pvalue != end; ++pvalue) {
      ++n_values;
      addendum = *pvalue - cumulated_average + cumulated_remainder;
      cumulated_average += addendum / n_values;
      cumulated_remainder = addendum % n_values;
   }
   return cumulated_average;
}

The code provides a single function that can be used like many of the algorithms of the STL library. It takes two arguments that define a range of values within a container. The function will automatically derive the type of the values inside the container from those arguments. If your compiler does not support the C++11 features decltype and std::remove_reference, you can add the value type as an argument to the template argument list instead.

You can call this function by simply passing it two iterators or pointers that point to the first and beyond the last elements of the container, that you wish to calculate the average of. Here is an example:

C++
void test_average() {
   char cvalues[] = { 13,7,-27, 34, -3, 22, 33, -1, 18, 29,
                      13,7,-27, 34, -3, 22, 33, -1, 18, 29,
                      13,7,-27, 34, -3, 22, 33, -1, 18, 29,
                      13,7,-27, 34, -3, 22, 33, -1, 18, 29,
                      13,7,-27, 34, -3, 22, 33, -1, 18, 29
                    };
   auto cavg = average(cvalues, cvalues+50);
}

While debugging this example I've found the range of values to be:

  • cvalues: -27 to +34
  • cumulated_remainder: -28 to +25
  • cumulated_average: -2 to +13

All values were in the predicted range. This is only one test case, but the original code does contain more debug code that helps assert that the values of cumulated_average and cumulated_remainder were always accurate at all times.

Points of Interest

This is the first time I've actually used decltype, and, while trying to figure out how to use it correctly, the first time I've even heard of std::remove_reference. These new language features can be quite useful, and it's well worth reading up on them, if you haven't done so already. In this example, using them saved me one template argument. but, more importantly, it saves the caller the effort to provide it, and me the effort to verify that the iterators passed to the function actually point to values of the exact same type.

I'd like to point out as well that some other algorithms relevant to statistics also exist in iterative form, e. g. for determining the standard deviation. However, I can't imagine a reason why anyone would want to restrict himself on integer arithmetic for these. But the iterative formulas can still be useful when you're facing huge amounts of data.

History

2014-08-13 Article first version

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)


Written By
Software Developer (Senior)
Switzerland Switzerland
Graduated at TU Darmstadt in Math & CS, with a heavy focus on CAD/CAM

Programming and designing applications in C++ in the areas AI, real-time programming, client-server applications and CAD/CAM since 1985.

Personal interests: AI, computer graphics, games, reading

Comments and Discussions

 
GeneralMy vote of 5 Pin
Shao Voon Wong3-Jan-20 17:09
mvaShao Voon Wong3-Jan-20 17:09 
GeneralMy vote of 1 Pin
Yoldas Askan15-Sep-14 7:05
Yoldas Askan15-Sep-14 7:05 
GeneralRe: My vote of 1 Pin
Stefan_Lang15-Sep-14 20:01
Stefan_Lang15-Sep-14 20:01 
GeneralRe: My vote of 1 Pin
Yoldas Askan15-Sep-14 20:12
Yoldas Askan15-Sep-14 20:12 
GeneralRe: My vote of 1 Pin
Stefan_Lang15-Sep-14 20:25
Stefan_Lang15-Sep-14 20:25 
GeneralMessage Closed Pin
15-Sep-14 20:30
Yoldas Askan15-Sep-14 20:30 
GeneralRe: My vote of 1 Pin
Stefan_Lang15-Sep-14 21:10
Stefan_Lang15-Sep-14 21:10 
GeneralRe: My vote of 1 Pin
Yoldas Askan15-Sep-14 21:16
Yoldas Askan15-Sep-14 21:16 
GeneralRe: My vote of 1 Pin
Yoldas Askan15-Sep-14 21:27
Yoldas Askan15-Sep-14 21:27 
GeneralRe: My vote of 1 Pin
Yoldas Askan15-Sep-14 21:41
Yoldas Askan15-Sep-14 21:41 
GeneralRe: My vote of 1 Pin
Stefan_Lang16-Sep-14 0:33
Stefan_Lang16-Sep-14 0:33 
As you should be able to see in my profile I live (and work) in Switzerland, so a personal meeting wouldn't make sense for practical reasons alone.

As for sharing further contact and personal info, that is for me to decide: your spamming all of my contributions on CP with unjustified votes of 1 is abusive, and your aggressive tone indicates that at this time it is not possible to lead a meaningful discussion with you.


Yoldas Askan wrote:
I do not owe you anything

In fact you do. You did write an article and thereby abided by the rules on CP. You should therefore be able to accept criticism gracefully and not insult commentors. You therefore, at the very least, owe me an apology. I would ask for an explanation of the things I asked about as well, but since the article is marked as deleted now, that would be pointless.

Btw., you did not state whether you deleted the article yourself, or whether it was deleted. Either way, I have no idea why. I definitely did not report your article or in any way cause or request it's deletion. And in spite of the things I criticised, I do not see a reason why I or anyone should.
GOTOs are a bit like wire coat hangers: they tend to breed in the darkness, such that where there once were few, eventually there are many, and the program's architecture collapses beneath them. (Fran Poretto)

GeneralMessage Closed Pin
16-Sep-14 17:36
Yoldas Askan16-Sep-14 17:36 
GeneralRe: My vote of 1 Pin
Yoldas Askan16-Sep-14 22:19
Yoldas Askan16-Sep-14 22:19 
GeneralMy vote of 1 Pin
Yoldas Askan15-Sep-14 6:59
Yoldas Askan15-Sep-14 6:59 
GeneralMy vote of 1 Pin
Yoldas Askan15-Sep-14 6:57
Yoldas Askan15-Sep-14 6:57 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.