Floating-point error mitigation


Floating-point error mitigation is the minimization of errors caused by the fact that real numbers cannot, in general, be accurately represented in a fixed space. By definition, floating-point error cannot be eliminated, and, at best, can only be managed.
H. M. Sierra noted in his 1956 patent "Floating Decimal Point Arithmetic Control Means for Calculator":
The Z1, developed by Zuse in 1936, was the first computer with floating-point arithmetic and was thus susceptible to floating-point error. Early computers, however, with operation times measured in milliseconds, were incapable of solving large, complex problems and thus were seldom plagued with floating-point error. Today, however, with super computer system performance measured in petaflops, floating-point operations per second, floating-point error is a major concern for computational problem solvers. Further, there are two types of floating-point error, cancellation and rounding. Cancellation occurs when subtracting two similar numbers, and rounding occurs when significant bits cannot be saved and are rounded or truncated. Cancellation error is exponential relative to rounding error.
The following sections describe the strengths and weaknesses of various means of mitigating floating-point error.

Numerical error analysis

Though not the primary focus of numerical analysis, numerical error analysis for the analysis and minimization of floating-point rounding error. Numerical error analysis generally does not account for cancellation error.

Monte Carlo arithmetic

Error analysis by Monte Carlo arithmetic is accomplished by repeatedly injecting small errors into an algorithm's data values and determining the relative effect on the results.

Extension of precision

Extension of precision is the use of larger representations of real values than the one initially considered. The IEEE 754 standard defines precision as the number of digits available to represent real numbers. A programming language can include single precision, double precision, and quadruple precision. While extension of precision makes the effects of error less likely or less important, the true accuracy of the results are still unknown.

Variable length arithmetic

represents numbers as a string of digits of variable length limited only by the memory available. Variable length arithmetic operations are considerably slower than fixed length format floating-point instructions. When high performance is not a requirement, but high precision is, variable length arithmetic can prove useful, though the actual accuracy of the result may not be known.

Use of the error term of a floating-point operation

The floating-point algorithm known as TwoSum or 2Sum, due to Knuth and Møller, and its simpler, but restricted version FastTwoSum or Fast2Sum, allow one to get the error term of a floating-point addition rounded to nearest. One can also obtain the error term of a floating-point multiplication rounded to nearest in 2 operations with a FMA, or 17 operations if the FMA is not available. These error terms can be used in algorithms in order to improve the accuracy of the final result, e.g. with floating-point expansions or compensated algorithms.
Operations giving the result of a floating-point addition or multiplication rounded to nearest with its error term have been standardized and recommended in the IEEE 754-2019 standard.

Choice of a different radix

Changing the radix, in particular from binary to decimal, can help to reduce the error and better control the rounding in some applications, such as financial applications.

Interval arithmetic

is an algorithm for bounding rounding and measurement errors. The algorithm results in two floating-point numbers representing the minimum and maximum limits for the real value represented.
"Instead of using a single floating-point number as approximation for the value of a real variable in the mathematical model under investigation, interval arithmetic acknowledges limited precision by associating with the variable
a set of reals as possible values. For ease of storage and computation, these sets are restricted to intervals."
The evaluation of interval arithmetic expression may provide a large range of values, and may seriously overestimate the true error boundaries.

Gustafson's unums

are an extension of variable length arithmetic proposed by John Gustafson. Unums have variable length fields for the exponent and significand lengths and error information is carried in a single bit, the ubit, representing possible error in the least significant bit of the significand.
The efficacy of unums is questioned by William Kahan.