Finite Precision & Error... in General
Thomas J. Kennedy
We have examined numbers in base 2:
- determined the smallest and largest numbers that can be represented
- derived an upper bound for the absolute error $|x - x^{*}|$.
- derived an upper bound for the relative error $\frac{|x - x^{*}|}{|x|}$.
Now… we need to generalize. Why? It turns out that these analyses are also useful in base 10. Think back to significant digits, rounding, and scientific notation from your prior math and science coursework.
1 Representing Numbers… in General
In any base (greater than 2), the digits $b_i$ (ith mantissa digit/bit) and $s_i$ (ith exponent digit/bit) will fall in the range $[0, \beta-1]$. It is useful to use ordered set notation:
$b_i, s_i \in <0, 1, 2, \ldots, \beta-1>$
Consider base 2 and base 10:
- base 2 - $<0, 1>$
- base 10 - $<0, 1, 2, 3, 4, 5, 6, 7, 8, \beta-1>$ where $\beta-1 = 9$.
1.1 Smallest Mantissa
We are interested in the smallest possible mantissa for any base, $\beta \ge 2$. Naively, one might write $0.0$. However, we are interested in the smallest non-zero mantissa. This leads to
$\frac{1}{\beta} = 0.10000000000\ldots$
The normalization constraint requires that the first digit (i.e., $b_{-1}$) after the decimal place be non-zero. For all numeric bases two or greater, the smallest digit is $1$.
1.2 Largest Mantissa
Think about the largest number in…
- Base 2 - $0.1111\ldots$
- Base 8 - $0.7777\ldots$
- Base 10 - $0.9999\ldots$
- Base 16 - $0.FFFF\ldots$
Do you notice a pattern? The largest digit in any base is always the base minus 1 (i.e., $\beta - 1$).
1.3 Representing Real Numbers
Let $x^{*}$ (read as “x-star” or “x-chop”) represent a number that is subject to finite precision (i.e., must be stored using a finite number of digits/bits).
$x^{*} = \pm \left( \sum\limits_{i=1}^t b_{-i} \beta^{-i} \right) * \beta^{e^{*}}$ where $e^{*} = \pm \sum\limits_{i=1}^s s_{i} \beta^{i}$
Let us break this notation down…
$$\sum\limits_{i=1}^t b_{-i} \beta^{-i}$$
represents the mantissa (i.e., the digits to the right of the decimal place). The number $t$ represents the number of mantissa digits.
$$\pm \sum\limits_{i=1}^s s_{i} \beta^{i} $$
represents the exponent digits. The number $s$ represents the number of exponent digits. Keep in mind that similar to traditional base-10 scientific notation the sign (positive or negative) determines whether a left or right shift should occur.
1.4 Convenient Notation
We can use a convenient notation to capture $x$ and $x^{*}$.
$$ x^{*} \in \mathbb{R}(t,s) $$
$x^{*}$ is a real number that can be represented by $t$ mantissa digits and $s$ exponent digits. $x$ is more interesting…
$$ x \in \mathbb{R}(\infty,\infty) $$
$x$ is a real number for which we have access to infinite precision (i.e., infinite digits).
2 Smallest Number and Largest Number
We know that $x^{*}$ is defined as
$x^{*} = \pm \left( \sum\limits_{i=1}^t b_{-i} \beta^{-i} \right) * \beta^{e^{*}}$ where $e^{*} = \pm \sum\limits_{i=1}^s s_{i} \beta^{i}$
Our goal is to determine the lower limit (smallest number) and upper limit (largest number) that can be represented. Let us find four (4) values:
- smallest possible mantissa
- largest possible mantissa
- largest positive exponent, $|e|$
- “smallest” possible exponent, $-|e|$
2.1 The Mantissa Pieces
The smallest possible non-zero mantissa is
$$ \beta^{-i} = \frac{1}{\beta} $$
by the normalization constraint. That is the first piece of the puzzle. The largest possible mantissa is next. Let us start with…
$$ \sum\limits_{i=1}^t b_{-i} \beta^{-i} $$
To obtain the largest number… we need the largest possible digit in every position. That leads to $b_{-i} = (\beta - 1) \phantom{2} \forall i$… which leads to…
$$(\beta - 1 ) \left(\sum\limits_{i=1}^t \beta^{-i}\right)$$
The sum can be evaluated using the geometric series formula.
$$ \frac{\alpha(1-r^k)}{(1-r)} $$
where
- $k = t$
- $r = \beta^{-1}$
- $\alpha = \beta^{-1}$
Applying the geometric series formula results in…
$$ \begin{eqnarray} (\beta - 1 ) \left(\sum\limits_{i=1}^t \beta^{-i}\right) &=& \frac{\beta^{-1}(1- \beta^{-t})(\beta - 1 )}{1 - \beta^{-1}}\\ &=& \frac{\beta^{-1}(\beta - 1)(1- \beta^{-t})}{1 - \beta^{-1}}\\ &=& \frac{(1 - \beta^{-1})(1- \beta^{-t})}{1 - \beta^{-1}}\\ &=& (1- \beta^{-t}) \end{eqnarray} $$
2.2 Mantissa Bounds
The mantissa ($f$) bounds can be written succinctly.
$$ \beta^{-1} \le f \le 1 - \beta^{-t} $$
2.3 The Exponent Pieces
The exponent portion of a real number is defined as…
$\beta^{e^{*}}$ where $e^{*} = \pm \sum\limits_{i=1}^s s_{i} \beta^{i} $
We are interested in $max(|e^{*}|)$. Let us set $s_i = \beta - 1$ for all $i$ (or using math notation… $\forall_i$ let $s_i = \beta - 1$).
$$ \begin{eqnarray} |e^{*}| &=& |\pm \sum\limits_{i=1}^s s_{i} \beta^{i}| \\ |e^{*}| &=& \sum\limits_{i=1}^s s_{i} \beta^{i} \\ |e^{*}| &\le& \sum\limits_{i=1}^s (\beta - 1) \beta^{i} \\ |e^{*}| &\le& \frac{(\beta - 1)(1 - \beta^{s})}{1 - \beta} \\ |e^{*}| &\le& \frac{(\beta - 1)(1 - \beta^{s})}{-(\beta - 1)} \\ |e^{*}| &\le& -(1 - \beta^{s})\\ |e^{*}| &\le& \beta^{s} - 1 \\ \end{eqnarray} $$
Note how the geometric series formula was applied:
- $k = s$
- $\alpha = -1$
- $\beta - 1 = -(1 - \beta)$
Note the result…
$$ max(|e|) = \beta^{s} - 1 $$
This is the largest magnitude for an exponent. This leads to…
$$ min(\beta^{e^*}) = \beta^{-(\beta^{s} - 1)} $$
and
$$ max(\beta^{e^*}) = \beta^{\beta^{s} - 1} $$
as the largest and smallest possible $\beta^{e^*}$ terms, respectively
2.4 Exponent Bounds
The exponent ($\beta^{e^*})$ bounds can be written succinctly.
$$ \beta^{-(\beta^{s} - 1)} \le \beta^e \le \beta^{\beta^{s} - 1} $$
2.5 Putting the Pieces Together
Combining the minimum and maximum terms from the mantissa analysis and exponent analysis leads to…
$$ \beta^{-1}\beta^{-(\beta^{s} - 1)} \le x^{*} \le (1 - \beta^{-t})\beta^{\beta^{s} - 1} $$
If we let $\beta = 2$… the result matches our previous base-2 analysis.
$$ 2^{-1}2^{-(2^{s} - 1)} \le x^{*} \le (1 - 2^{-t})2^{2^{s} - 1} $$
3 Bounding Relative Error
Now… it is time to derive an upper bound for the error between $x$ and $x^*$. We know that (by definition):
$$ x^{*} = \pm \left( \sum\limits_{i=1}^t b_{-i} \beta^{-i} \right) * \beta^{e^{*}} $$
where
$$ e^{*} = \pm \sum\limits_{i=1}^s s_{i} \beta^{i} $$
and
$$ x = \pm \left( \sum\limits_{i=1}^{\infty} b_{-i} \beta^{-i} \right) * \beta^{e^{*}} $$
where
$$e = \pm \sum\limits_{i=1}^{\infty} s_{i} \beta^{i} $$
The difference comes down to finite (i.e., limited) precision, i.e., $x^* \in \mathbb{R}(t, s)$ vs $x \in \mathbb{R}(\infty, \infty)$.
3.1 Absolute Error
Relative error is defined as $\frac{|x-x^*|}{|x|}$. Let us start with the numerator… which happens to be absolute error!
$$ \left| x - x^{*} \right| = \left| \sum\limits_{i=0}^{\infty} b_{-i}\beta^{-i} * \beta^{e} - \sum\limits_{i=0}^{k} b_{-i}\beta^{-i} * \beta^{e^{*}} \right| $$
Let…
- $b_{-i} = b_{-i}^{*}$ for $i \le k$ (truncation)
- $e = e^{*}$ (same magnitude)
These two “small” observations lead to…
$$ \begin{eqnarray} \left| x - x^{*} \right| &=& \left| \sum\limits_{i=0}^{k} b_{-i}\beta^{-i} * \beta^{e} + \sum\limits_{i=k+1}^{\infty} b_{-i}\beta^{-i} * \beta^{e} - \sum\limits_{i=0}^{k} b_{-i}\beta^{-i}* \beta^{e} \right| \\ &=& \left| \sum\limits_{i=0}^{k} b_{-i}\beta^{-i} + \sum\limits_{i=k+1}^{\infty} b_{-i}\beta^{-i} - \sum\limits_{i=0}^{k} b_{-i}\beta^{-i} \right| \beta^{e} \\ &=& \left| \sum\limits_{i=k+1}^{\infty} b_{-i}\beta^{-i} \right| \beta^{e} \\ \end{eqnarray} $$
Now… we need to bound the error by using the largest possible mantissa. Letting $b_{-i} = (\beta - 1)$ for all $i$ leads to
$$ \begin{eqnarray} \left| x - x^{*} \right| &=& \left| \sum\limits_{i=k+1}^{\infty} b_{-i}\beta^{-i} \right| \beta^{e} \\ &\le& \left| \sum\limits_{i=k+1}^{\infty} (\beta - 1) \beta^{-i} \right| \beta^{e} \\ &\le& \left| (\beta - 1) \sum\limits_{i=k+1}^{\infty}\beta^{-i} \right| \beta^{e} \\ &\le& \left| (\beta - 1) \beta^{-k} \sum\limits_{i=1}^{\infty}\beta^{-i} \right| \beta^{e} \\ &\le& \left| (\beta - 1) \beta^{-k - 1} \sum\limits_{i=0}^{\infty}\beta^{-i} \right| \beta^{e} \\ \end{eqnarray} $$
Now we need to tackle the sum
$$ \sum\limits_{i=0}^{\infty}\beta^{-i} $$
We can use the geometric series formula for a convergent infinite series..
$S_{\infty} = \frac{1}{1-r}$ iff $|r| < 1$
In this case… $r = \beta^{-1}$.
$$ \begin{eqnarray} S_{\infty} &=& \frac{1}{1-r} \\ &=& \frac{1}{1 - \frac{1}{\beta}} \frac{\beta}{\beta} &=& \frac{\beta}{\beta - 1} \end{eqnarray} $$
Using this result leads to…
$$ \begin{eqnarray} \left| x - x^{*} \right| &\le& \left| (\beta - 1) \beta^{-k - 1} \left(\frac{\beta}{\beta - 1}\right) \right| \beta^{e} \\ &\le& \left| \beta^{-k - 1} \beta \right| \beta^{e} \\ &\le& \left| \beta^{-k} \right| \beta^{e} \\ &\le& \beta^{-k}\beta^{e} \\ \end{eqnarray} $$
The worst case error (i.e., upper bound for error) can be written as…
$$ \left| x - x^{*} \right| \le \beta^{-k}\beta^{e} $$
3.2 Relative Error
We know that relative error is defined as
$$ \frac{|x - x^*|}{|x|} $$
From the absolute error derivation, we know that
$$ \left| x - x^{*} \right| \le \beta^{-k}\beta^{e} $$
That leads to…
$$ \frac{|x - x^*|}{|x|} \le \frac{\left|\beta^{-k}\beta^{e}\right|}{min(|x|) * \beta^{e}} $$
Notice how $|x|$ became $min(|x|)$. We are bounding the bound. As |x| gets smaller, the error gets larger. The smallest legal non-zero mantissa is $\beta^{-1}$.
$$ \begin{eqnarray} \frac{|x - x^*|}{|x|} &\le& \frac{\left|\beta^{-k}\beta^{e}\right|}{min(|x|) * \beta^e} \\ &\le& \frac{\left|\beta^{-k}\beta^{e}\right|}{\beta^{-1}\beta^e} \\ &\le& \frac{\left|\beta^{-k}\right|}{\beta^{-1}} \\ &\le& \beta\left|\beta^{-k}\right| \\ &\le& \beta\beta^{-k} \\ &\le& \beta^{-k + 1} \\ \end{eqnarray} $$