Review of Machine Precision

Thomas J. Kennedy

Contents:

1 A Familiar Problem

2 Estimating Machine Precision

3 Scientific Notation

In Chapter 1, we will cover four main topics:

Finite Precision
Arithmetic Error
Cancellation Error
Condition of a Problem

1 A Familiar Problem

The first topic, Finite Precision, is one you have encountered before. Consider a short C++ code snippet:


int main(int argc, char** argv)
{
    double one_third = 1.0 / 3.0;
    double one = one_third + one_third + one_third;

    cout << one << "\n";

    return 0;
}

In a math course, with infinite precision, we would end up with 1 as out output. However, from previous coursework we know that this code snippet will yield something close to, but not exactly, 1.

Let us imagine that one_third and one store values in base-10 with user specified precision. Consider this table:

	Precision
Variable	1	2	3	4	8	$\infty$
`one_third`	0.3	0.33	0.333	0.3333	0.33333333	0.33333333$\ldots$
`one`	0.9	0.99	0.999	0.9999	0.99999999	1.0

This table should be reminiscent of a math course, especially if the phrase “Round to n places.” still reverberates in your mind.

2 Estimating Machine Precision

When writing code, you have used float and double in C++ and, possibly, f32 and f64 in Rust. There is a well known algorithm attributed to Cleve Moler for estimating machine precision.

Example 1: Precision Estimation Pseudocode
let a ← 4.0 / 3.0
let b ← a - 1
let c ← b + b + b

return |1 - c|

Let us stick with our example base-10 computer (i.e., pencil and paper). Let us select a few precisions (i.e., number of mantissa digits).

	Mantissa Digits
Step	1	2	4	$\ldots$	128	$\ldots$	$\infty$
`let a ← 4.0 / 3.0`	1.3	1.33	1.3333	$\ldots$		$\ldots$	1.3333$\ldots$
`let b ← a - 1`	0.3	0.33	0.3333	$\ldots$		$\ldots$	0.3333$\ldots$
`let c ← b + b + b`	0.9	0.99	0.9999	$\ldots$	1 - $1\times 10^{128}$	$\ldots$	1
`return \| c -1 \|`	0.1	0.01	0.0001	$\ldots$	$1\times 10^{128}$	$\ldots$	0

Using a paper-and-pencil approach, we would not write out an infinite number of decimal places, we would use fractions. If we wanted to use fractions for this algorithm, the C++ std::ratio library or Python fractions module are the most readily available. However, I would end up playing with the Rust fraction crate.

3 Scientific Notation

In this discussion, we have used base-10 scientific notation (i.e., scientific notation as used in math and science). We need not be so restrictive. We can generalize scientific notation to any base, $\beta$, where $ \beta > 1$. Let us define some notation:

$$ x \times \beta^e $$

In this expression:

$x$ represents some real number.
$\beta$ represents the base (e.g., 2 or 10).
$e$ represents an integer exponents (positive, negative, or zero).

If we were working in base 10, we could have:

$$ 0.5 \times 10^{0} = 0.5 = \frac{1}{2} $$

In base-2, this would become:

$$ 1_{2} \times 2^{-1} = 0.1_{2} $$

Note the subscript $2$. When working with bases other than $10$, such subscripts specify the base.