Finite Precision & Error... in General

Now… we need to generalize. Why? It turns out that these analyses are also useful in base 10. Think back to significant digits, rounding, and scientific notation from your prior math and science coursework.

1 Representing Numbers… in General

In any base (greater than 2), the digits $b_i$ (i^th mantissa digit/bit) and $s_i$ (i^th exponent digit/bit) will fall in the range $[0, \beta-1]$. It is useful to use ordered set notation:

$b_i, s_i \in <0, 1, 2, \ldots, \beta-1>$

Consider base 2 and base 10:

base 2 - $<0, 1>$
base 10 - $<0, 1, 2, 3, 4, 5, 6, 7, 8, \beta-1>$ where $\beta-1 = 9$.

1.1 Smallest Mantissa

We are interested in the smallest possible mantissa for any base, $\beta \ge 2$. Naively, one might write $0.0$. However, we are interested in the smallest non-zero mantissa. This leads to

$\frac{1}{\beta} = 0.10000000000\ldots$

The normalization constraint requires that the first digit (i.e., $b_{-1}$) after the decimal place be non-zero. For all numeric bases two or greater, the smallest digit is $1$.

1.2 Largest Mantissa

Think about the largest number in…

Base 2 - $0.1111\ldots$
Base 8 - $0.7777\ldots$
Base 10 - $0.9999\ldots$
Base 16 - $0.FFFF\ldots$

Do you notice a pattern? The largest digit in any base is always the base minus 1 (i.e., $\beta - 1$).

1.3 Representing Real Numbers

Let $x^{*}$ (read as “x-star” or “x-chop”) represent a number that is subject to finite precision (i.e., must be stored using a finite number of digits/bits).

$x^{*} = \pm \left( \sum\limits_{i=1}^t b_{-i} \beta^{-i} \right) * \beta^{e^{*}}$ where $e^{*} = \pm \sum\limits_{i=1}^s s_{i} \beta^{i}$

Let us break this notation down…

$$\sum\limits_{i=1}^t b_{-i} \beta^{-i}$$

represents the mantissa (i.e., the digits to the right of the decimal place). The number $t$ represents the number of mantissa digits.

$$\pm \sum\limits_{i=1}^s s_{i} \beta^{i} $$

represents the exponent digits. The number $s$ represents the number of exponent digits. Keep in mind that similar to traditional base-10 scientific notation the sign (positive or negative) determines whether a left or right shift should occur.

1.4 Convenient Notation

We can use a convenient notation to capture $x$ and $x^{*}$.

$$ x^{*} \in \mathbb{R}(t,s) $$

$x^{*}$ is a real number that can be represented by $t$ mantissa digits and $s$ exponent digits. $x$ is more interesting…

$$ x \in \mathbb{R}(\infty,\infty) $$

$x$ is a real number for which we have access to infinite precision (i.e., infinite digits).

2 Smallest Number and Largest Number

We know that $x^{*}$ is defined as

$x^{*} = \pm \left( \sum\limits_{i=1}^t b_{-i} \beta^{-i} \right) * \beta^{e^{*}}$ where $e^{*} = \pm \sum\limits_{i=1}^s s_{i} \beta^{i}$

Our goal is to determine the lower limit (smallest number) and upper limit (largest number) that can be represented. Let us find four (4) values:

smallest possible mantissa
largest possible mantissa
largest positive exponent, $|e|$
“smallest” possible exponent, $-|e|$

2.1 The Mantissa Pieces

The smallest possible non-zero mantissa is

$$ \beta^{-i} = \frac{1}{\beta} $$

by the normalization constraint. That is the first piece of the puzzle. The largest possible mantissa is next. Let us start with…

$$ \sum\limits_{i=1}^t b_{-i} \beta^{-i} $$

To obtain the largest number… we need the largest possible digit in every position. That leads to $b_{-i} = (\beta - 1) \phantom{2} \forall i$… which leads to…

$$(\beta - 1 ) \left(\sum\limits_{i=1}^t \beta^{-i}\right)$$

The sum can be evaluated using the geometric series formula.

$$ \frac{\alpha(1-r^k)}{(1-r)} $$

where

$k = t$
$r = \beta^{-1}$
$\alpha = \beta^{-1}$

Applying the geometric series formula results in…

$$ \begin{eqnarray} (\beta - 1 ) \left(\sum\limits_{i=1}^t \beta^{-i}\right) &=& \frac{\beta^{-1}(1- \beta^{-t})(\beta - 1 )}{1 - \beta^{-1}}\\ &=& \frac{\beta^{-1}(\beta - 1)(1- \beta^{-t})}{1 - \beta^{-1}}\\ &=& \frac{(1 - \beta^{-1})(1- \beta^{-t})}{1 - \beta^{-1}}\\ &=& (1- \beta^{-t}) \end{eqnarray} $$

2.2 Mantissa Bounds

The mantissa ($f$) bounds can be written succinctly.

$$ \beta^{-1} \le f \le 1 - \beta^{-t} $$

2.3 The Exponent Pieces

The exponent portion of a real number is defined as…

$\beta^{e^{*}}$ where $e^{*} = \pm \sum\limits_{i=1}^s s_{i} \beta^{i} $

We are interested in $max(|e^{*}|)$. Let us set $s_i = \beta - 1$ for all $i$ (or using math notation… $\forall_i$ let $s_i = \beta - 1$).

$$ \begin{eqnarray} |e^{*}| &=& |\pm \sum\limits_{i=1}^s s_{i} \beta^{i}| \\ |e^{*}| &=& \sum\limits_{i=1}^s s_{i} \beta^{i} \\ |e^{*}| &\le& \sum\limits_{i=1}^s (\beta - 1) \beta^{i} \\ |e^{*}| &\le& \frac{(\beta - 1)(1 - \beta^{s})}{1 - \beta} \\ |e^{*}| &\le& \frac{(\beta - 1)(1 - \beta^{s})}{-(\beta - 1)} \\ |e^{*}| &\le& -(1 - \beta^{s})\\ |e^{*}| &\le& \beta^{s} - 1 \\ \end{eqnarray} $$

Note how the geometric series formula was applied:

$k = s$
$\alpha = -1$
$\beta - 1 = -(1 - \beta)$

Note the result…

$$ max(|e|) = \beta^{s} - 1 $$

This is the largest magnitude for an exponent. This leads to…

$$ min(\beta^{e^*}) = \beta^{-(\beta^{s} - 1)} $$

and

$$ max(\beta^{e^*}) = \beta^{\beta^{s} - 1} $$

as the largest and smallest possible $\beta^{e^*}$ terms, respectively

2.4 Exponent Bounds

The exponent ($\beta^{e^*})$ bounds can be written succinctly.

$$ \beta^{-(\beta^{s} - 1)} \le \beta^e \le \beta^{\beta^{s} - 1} $$

2.5 Putting the Pieces Together

Combining the minimum and maximum terms from the mantissa analysis and exponent analysis leads to…

$$ \beta^{-1}\beta^{-(\beta^{s} - 1)} \le x^{*} \le (1 - \beta^{-t})\beta^{\beta^{s} - 1} $$

If we let $\beta = 2$… the result matches our previous base-2 analysis.

$$ 2^{-1}2^{-(2^{s} - 1)} \le x^{*} \le (1 - 2^{-t})2^{2^{s} - 1} $$

3 Bounding Relative Error

Now… it is time to derive an upper bound for the error between $x$ and $x^*$. We know that (by definition):

$$ x^{*} = \pm \left( \sum\limits_{i=1}^t b_{-i} \beta^{-i} \right) * \beta^{e^{*}} $$

where

$$ e^{*} = \pm \sum\limits_{i=1}^s s_{i} \beta^{i} $$

and

$$ x = \pm \left( \sum\limits_{i=1}^{\infty} b_{-i} \beta^{-i} \right) * \beta^{e^{*}} $$

where

$$e = \pm \sum\limits_{i=1}^{\infty} s_{i} \beta^{i} $$

The difference comes down to finite (i.e., limited) precision, i.e., $x^* \in \mathbb{R}(t, s)$ vs $x \in \mathbb{R}(\infty, \infty)$.

3.1 Absolute Error

Relative error is defined as $\frac{|x-x^*|}{|x|}$. Let us start with the numerator… which happens to be absolute error!

$$ \left| x - x^{*} \right| = \left| \sum\limits_{i=0}^{\infty} b_{-i}\beta^{-i} * \beta^{e} - \sum\limits_{i=0}^{k} b_{-i}\beta^{-i} * \beta^{e^{*}} \right| $$

Let…

$b_{-i} = b_{-i}^{*}$ for $i \le k$ (truncation)
$e = e^{*}$ (same magnitude)

These two “small” observations lead to…

$$ \begin{eqnarray} \left| x - x^{*} \right| &=& \left| \sum\limits_{i=0}^{k} b_{-i}\beta^{-i} * \beta^{e} + \sum\limits_{i=k+1}^{\infty} b_{-i}\beta^{-i} * \beta^{e} - \sum\limits_{i=0}^{k} b_{-i}\beta^{-i}* \beta^{e} \right| \\ &=& \left| \sum\limits_{i=0}^{k} b_{-i}\beta^{-i} + \sum\limits_{i=k+1}^{\infty} b_{-i}\beta^{-i} - \sum\limits_{i=0}^{k} b_{-i}\beta^{-i} \right| \beta^{e} \\ &=& \left| \sum\limits_{i=k+1}^{\infty} b_{-i}\beta^{-i} \right| \beta^{e} \\ \end{eqnarray} $$

Now… we need to bound the error by using the largest possible mantissa. Letting $b_{-i} = (\beta - 1)$ for all $i$ leads to

$$ \begin{eqnarray} \left| x - x^{*} \right| &=& \left| \sum\limits_{i=k+1}^{\infty} b_{-i}\beta^{-i} \right| \beta^{e} \\ &\le& \left| \sum\limits_{i=k+1}^{\infty} (\beta - 1) \beta^{-i} \right| \beta^{e} \\ &\le& \left| (\beta - 1) \sum\limits_{i=k+1}^{\infty}\beta^{-i} \right| \beta^{e} \\ &\le& \left| (\beta - 1) \beta^{-k} \sum\limits_{i=1}^{\infty}\beta^{-i} \right| \beta^{e} \\ &\le& \left| (\beta - 1) \beta^{-k - 1} \sum\limits_{i=0}^{\infty}\beta^{-i} \right| \beta^{e} \\ \end{eqnarray} $$

Now we need to tackle the sum

$$ \sum\limits_{i=0}^{\infty}\beta^{-i} $$

We can use the geometric series formula for a convergent infinite series..

$S_{\infty} = \frac{1}{1-r}$ iff $|r| < 1$

In this case… $r = \beta^{-1}$.

$$ \begin{eqnarray} S_{\infty} &=& \frac{1}{1-r} \\ &=& \frac{1}{1 - \frac{1}{\beta}} \frac{\beta}{\beta} &=& \frac{\beta}{\beta - 1} \end{eqnarray} $$

Using this result leads to…

$$ \begin{eqnarray} \left| x - x^{*} \right| &\le& \left| (\beta - 1) \beta^{-k - 1} \left(\frac{\beta}{\beta - 1}\right) \right| \beta^{e} \\ &\le& \left| \beta^{-k - 1} \beta \right| \beta^{e} \\ &\le& \left| \beta^{-k} \right| \beta^{e} \\ &\le& \beta^{-k}\beta^{e} \\ \end{eqnarray} $$

The worst case error (i.e., upper bound for error) can be written as…

$$ \left| x - x^{*} \right| \le \beta^{-k}\beta^{e} $$

3.2 Relative Error

We know that relative error is defined as

$$ \frac{|x - x^*|}{|x|} $$

From the absolute error derivation, we know that

$$ \left| x - x^{*} \right| \le \beta^{-k}\beta^{e} $$

That leads to…

$$ \frac{|x - x^*|}{|x|} \le \frac{\left|\beta^{-k}\beta^{e}\right|}{min(|x|) * \beta^{e}} $$

Notice how $|x|$ became $min(|x|)$. We are bounding the bound. As |x| gets smaller, the error gets larger. The smallest legal non-zero mantissa is $\beta^{-1}$.

$$ \begin{eqnarray} \frac{|x - x^*|}{|x|} &\le& \frac{\left|\beta^{-k}\beta^{e}\right|}{min(|x|) * \beta^e} \\ &\le& \frac{\left|\beta^{-k}\beta^{e}\right|}{\beta^{-1}\beta^e} \\ &\le& \frac{\left|\beta^{-k}\right|}{\beta^{-1}} \\ &\le& \beta\left|\beta^{-k}\right| \\ &\le& \beta\beta^{-k} \\ &\le& \beta^{-k + 1} \\ \end{eqnarray} $$