Floating Point Arithmetic#
In this chapter, we will introduce some basics on the real number system for modern computers and discuss the arithmetic operations of the number system.
Representation of Real Numbers#
Any nonzero real number \(x\in \mathbb{R}\) can be accurately represented with an infinite sequence of digits. This can be understood as the consequence that rational numbers are dense on any interval. Therefore, with the binary representation, we can write
where \(e\) is an integer exponent and \(d_1=1\), the other binary digits \(d_i\in \{0, 1\}\). The mantissa part
Note
In order to guarantee the uniqueness of the above representation, we need further assumption that there exists an infinite subset \(S\subset \mathbb{N}\) that \(d_j\neq 1\) for \(j\in S\). For example, under binary representation
then we will take the latter representation.
Floating Point Numbers#
The floating point numbers generally refer to a set of real numbers with finite mantissa length. More precisely, we consider the set of real numbers \(\mathbb{F} = \mathbb{F}(t, e_{\min}, e_{\max})\subset \mathbb{R}\) that
It can be seen that there are only finite numbers in \(\mathbb{F}\) with the smallest positive element \(x_{\min} = 2^{e_{\min}-1}\) and the largest element \(x_{\max} = ( 1- 2^{-t} )\times 2^{e_{\max} }\). Therefore
Note
The elements in \(\mathbb{F}\) are called normalized. If we allow \(d_1 = 0\) in the definition of \(\mathbb{F}\), then the numbers in the set are called denormalized.
Theorem 1 (Distribution of Floating Numbers)
For any \(e_{\min} \le e\le e_{\max}\), the distribution of the floating point number system \(\mathbb{F}\) on interval \([2^{e-1}, 2^e]\) is equidistant with distances of length \(2^{e-t}\).
Proof. For any \(x \in \mathbb{F}\cap [2^{e-1}, 2^e]\), it can be represented by
where \(d_1 = 1\). The mantissa is equidistantly distributed with distance \(2^{-t}\), therefore the floating point numbers are equidistantly distributed with distances of length \(2^{e-t}\).
To understand the approximation to real numbers by the floating point number system \(\mathbb{F}\), it is important to consider the maximal relative distance between the numbers in \(\overline{\mathbb{F}}\) and their respective closest element in \(\mathbb{F}\), which is the following quantity:
The following holds:
Theorem 2 (Machine Precision)
The number \(\mathrm{u} := 2^{-t}\) is also called rounding unit or machine precision.
Proof. Without loss of generality, we only need to consider the positive numbers in \(\overline{\mathbb{F}}\), then one can represent any nonzero \(x\in [x_{\min}, x_{\max}]\) by
Since the floating point numbers are equidistantly distributed on \([2^{e-1}, 2^e]\) from Theorem 1, one can find \(z^{\ast}\in\mathbb{F}\) such that
therefore
Note
On modern computers, the following two floating point number systems
are supported, they are often called single precision and double precision, respectively.
Rounding#
The rounding operation \(\textrm{fl}\) is to map any real numbers of \(\overline{\mathbb{F}}\) into the floating point number system \(\mathbb{F}\) with smallest error. Such rounding operation can be written out explicitly, let \(x = \pm (0.d_1 d_2\dots d_t d_{t+1}\dots )\times 2^e\), then
It is clear that rounding \(\textrm{fl}\) is monotone and idempotent, which means
\(x\le y \Rightarrow \textrm{fl}(x) \le \textrm{fl}(y)\).
\(\textrm{fl}(z) = z\) if \(z\in \mathbb{F}\).
Theorem 3
For any \(x\in \overline{\mathbb{F}}\), \(|\textrm{fl}(x) - x| = \min_{z\in \mathbb{F}} |z - x|\). If \(x\neq 0\), then
Proof. The special case that \(x = 0\) is trivial, we only consider \(x\in [x_{\min}, x_{\max}]\), it can be seen that
where \(\tilde{d}_t\) is the rounding bit, therefore
Corollary 1
For any \(x\in\overline{\mathbb{F}}\), \(\textrm{fl}(x) = x(1+\delta)\) with \(|\delta|\le \mathrm{u}\).
Arithmetic Operations#
Let \(\mathbb{F} = \mathbb{F}(t, e_{\min}, e_{\max})\) be a given floating point number system and we consider the basic binary operation \(\circ\in \{ +, -, *, /\}\) on \(\mathbb{F}\), in order to represent the outcome in \(\mathbb{F}\), a straightforward realization is to define the binary operation \(\boxed{\circ}\) as the following (for the case of division, we assume \(y\neq 0\)):
then for any \(x, y\in\mathbb{F}\), if \(x\circ y\in \overline{\mathbb{F}}\), then \(x \,\boxed{\circ}\, y = (x\circ y)(1 + \delta)\) with \(|\delta| \le \mathrm{u}\) from the Corollary 1.
Remark 1 (Catastrophic Cancellation)
If \(x, y\in\mathbb{R}\), then the relative error from the following binary operation \(\textrm{fl}(x)\,\boxed{+}\,\textrm{fl}(y)\) can be estimated by
When \(x\) and \(y\) are close in magnitude but with opposite signs, the cancellation error will be significant. Sometimes the catastrophic cancellation can be avoided in computation.
Example 1
The subtraction of two square numbers \(x^2 - y^2\), where \(x\approx y\in \mathbb{F}\), since
it will possibly lead to catastrophic cancellation when there is a rounding error in calculating \(x^2\) or \(y^2\). However, if we write
then the catastrophic cancellation can be avoided. Because
where \(|\delta|\le \mathrm{u}\). Therefore the relative error is bounded by \(3\mathrm{u}\).
Error Accumulation: Multiplication#
For complicated computations on modern computers, the errors from arithmetic operations will accumulate towards the final result (we do not consider techniques such as fused multiply-add (FMA) here). To quantify the accumulation effect, we will need the following lemma.
Lemma 1
For real numbers \(a_1, a_2, \dots, a_n\) with \(|a_k|\le \delta\) for \(k=1,\dots, n\), then for \(n\delta < 1\), the following holds
where \(|b_n| \le \frac{n\delta}{1 - n\delta}\).
Proof. The proof is quite easy with induction. When \(n=1\), \(|b_1| = |a_1|\le \delta \le \frac{\delta}{1 - \delta}\). Suppose the claim holds for \(n = m\), then for \(n = m +1\), we could see that
which implies that \(b_{m+1} = b_m + a_{m+1} + a_{m+1}b_m\), with the given bounds on \(a_{m+1}\) and \(b_m\), we can estimate
In the following, we consider the naive floating point product \(P_n\) of \(n\) real numbers \(\{x_j\}_{j=1}^n \subset \mathbb{R}\) with assumption that \((2n-1)\textrm{u} < 1\) by the following iteration
Let \(\fl{x_k} = x_k(1 + \tau_k)\), then \(|\tau_k|\le \textrm{u}\). From the \(n\)-th iteration step
such that \(|\delta_n|\le \textrm{u}\). Since \(P_{n-1}\in \mathbb{F}\), \(\fl{P_{n-1}} = P_{n-1}\), then
where \(|\eta_n|\le \frac{(2n-1)\textrm{u}}{1 - (2n-1)\textrm{u}}\).
Error Accumulation: Addition#
For the naive floating point summation \(S_n\) of \(n\) real numbers \(\{x_j \}_{j=1}^n\) by the iteration
we can carry out a similar analysis. Let \(S^{\ast}_j = \sum_{k=1}^j x_k\) and \(\fl{x_k} = x_k ( 1 + \tau_k)\) for \(|\tau_k|\le \textrm{u}\), denote \(\Delta S_j = S_j^{\ast} - S_j\), then
where \(|\delta_j|\le \textrm{u}\). Therefore
Using the previous lemma will provide an estimate of \( ((1 + \textrm{u})^j - 1)\) as long as \(j\textrm{u} < 1\).
Exercises#
Assume \(n\in \mathbb{N}\) and \(n\textrm{u} < 1\), let \(\{x_j\}_{j=1}^n\) be a sequence of real numbers, \(\textrm{fl}:\overline{\mathbb{F}}\mapsto \mathbb{F}\) is the rounding operation, \(\textrm{u}\) is the machine epsilon. In the following, we briefly discuss the rounding error for floating point summation (sum reduction).
Definition 1
A reduction \(\Pi\) of the floating point summation
is an evaluation order for the \(~\boxed{+}~\) operations. For example, \(n = 4\), then the reduction \(\Pi = (3,1,2)\) is corresponding to the following calculation
where \(y_i\) denotes the result of \(i\)-th \(\boxed{+}\) operation in the reduction. Obviously the final summation will be \(y_{n-1}\). We denote \(T_n(\Pi)\) be the result of floating point summation with reduction order \(\Pi\).
Theoretical Part#
Problem 1
Prove that the native summation has following error bound
where \(S_n = \sum_{j=1}^n x_j\) and \(\Pi = (1,2,\dots, (n-1))\).
Problem 2
Prove that
where \(H = \lceil\log_2 n \rceil\) and \(T_n(\Pi)\) is the floating point summation with reduction order \(\Pi\).
Problem 3 (Horner’s scheme)
The evaluation of polynomial
is mostly using Horner’s scheme, which writes the polynomial in ‘nested’ form:
Please find an upper bound of the rounding error for this scheme.
Computational Part#
Problem 4 (Archimedes’ formula for \(\pi\))
Archimedes’ formula for \(\pi\) is given by calculating the perimeters of regular polygons inscribing and circumscribing a circle of unit diameter. Starting from hexagon, \(P_0 = \frac{1}{\sqrt{3}}\). The iterative formula can be written in two equivalent forms:
and
where \(P_n\) is the length of each side of the regular polygon with \(6 \times 2^n\) sides. The approximation of \(\pi\) is computed by \(\lim_{n\to\infty} 6\times 2^n\times P_n\).
Implement both iterative formula and compare the results with the exact value of \(\pi\) at different \(n\). Explain the difference.
Problem 5 (Pairwise summation)
Based on theoretical part, implement an algorithm for the summation \(\sum_{j=1}^n x_j\) which has \(\mathcal{O}(\textrm{u}\log_2 n)\) rounding error.
Problem 6 (Kahan compensated summation)
Suppose \(a, b\in\mathbb{R}\), the rounding error for the sum \(s = \fl{\fl{a}+\fl{b}}\) (\(a\ge b\)) can be computed using
Based on this property, one can keep tracking of the rounding error.
Algorithm 1 (Kahan Compensated Summation)
Inputs \(\{x_j\}_{j=1}^n\subset \mathbb{R}\)
Outputs \(s_n = \sum_{j=1}^n x_j\)
\(j\gets 1\), \(e_j \gets 0\), \(s_j \gets x_j\) // initialization;
While \(j < n\):
\(j\gets j + 1\)
\(y_j = x_j - e_{j-1}\) //remove compensated error;
\(s_j = s_{j-1} + y_j\) //perform summation;
\(e_j = (s_j - s_{j-1}) - y_j\) //restore the rounding error;
Implement the Algorithm 1 described above and compare the accuracy with naive summation and pairwise summation with test cases.
Problem 7
Suppose the inputs \(\{x_j\}_{j=1}^n\subset \mathbb{R}\) are randomly distributed (say \(x_j\sim U(0,1)\) i.i.d), what is growth of the expected rounding error with respect to total number of inputs \(n\) for the naive summation and pairwise summation? Please provide an explanation of your result. You can use Kahan sum as the accurate result approximately.
Extended Reading#
See [Higham, 1993, Higham and Mary, 2019, Muller, 2006].
Nicholas J Higham. The accuracy of floating point summation. SIAM Journal on Scientific Computing, 14(4):783–799, 1993.
Nicholas J Higham and Theo Mary. A new approach to probabilistic rounding error analysis. SIAM Journal on Scientific Computing, 41(5):A2815–A2835, 2019.
Jean-Michel Muller. Elementary functions. Springer, 2006.