Review of Machine Precision

Thomas J. Kennedy

Contents:

In Chapter 1, we will cover four main topics:

  1. Finite Precision
  2. Arithmetic Error
  3. Cancellation Error
  4. Condition of a Problem

1 A Familiar Problem

The first topic, Finite Precision, is one you have encountered before. Consider a short C++ code snippet:


int main(int argc, char** argv) { double one_third = 1.0 / 3.0; double one = one_third + one_third + one_third; cout << one << "\n"; return 0; }

In a math course, with infinite precision, we would end up with 1 as out output. However, from previous coursework we know that this code snippet will yield something close to, but not exactly, 1.

 

Let us imagine that one_third and one store values in base-10 with user specified precision. Consider this table:

Precision
Variable 1 2 3 4 8 $\infty$
one_third 0.3 0.33 0.333 0.3333 0.33333333 0.33333333$\ldots$
one 0.9 0.99 0.999 0.9999 0.99999999 1.0

This table should be reminiscent of a math course, especially if the phrase “Round to n places.” still reverberates in your mind.

2 Estimating Machine Precision

When writing code, you have used float and double in C++ and, possibly, f32 and f64 in Rust. There is a well known algorithm attributed to Cleve Moler for estimating machine precision.

Example 1: Precision Estimation Pseudocode
let a ← 4.0 / 3.0
let b ← a - 1
let c ← b + b + b

return |1 - c|

Let us stick with our example base-10 computer (i.e., pencil and paper). Let us select a few precisions (i.e., number of mantissa digits).

Mantissa Digits
Step 1 2 4 $\ldots$ 128 $\ldots$ $\infty$
let a ← 4.0 / 3.0 1.3 1.33 1.3333 $\ldots$ $\ldots$ 1.3333$\ldots$
let b ← a - 1 0.3 0.33 0.3333 $\ldots$ $\ldots$ 0.3333$\ldots$
let c ← b + b + b 0.9 0.99 0.9999 $\ldots$ 1 - $1\times 10^{128}$ $\ldots$ 1
return | c -1 | 0.1 0.01 0.0001 $\ldots$ $1\times 10^{128}$ $\ldots$ 0

Using a paper-and-pencil approach, we would not write out an infinite number of decimal places, we would use fractions. If we wanted to use fractions for this algorithm, the C++ std::ratio library or Python fractions module are the most readily available. However, I would end up playing with the Rust fraction crate.

3 Scientific Notation

In this discussion, we have used base-10 scientific notation (i.e., scientific notation as used in math and science). We need not be so restrictive. We can generalize scientific notation to any base, $\beta$, where $ \beta > 1$. Let us define some notation:

$$ x \times \beta^e $$

In this expression:

If we were working in base 10, we could have:

$$ 0.5 \times 10^{0} = 0.5 = \frac{1}{2} $$

In base-2, this would become:

$$ 1_{2} \times 2^{-1} = 0.1_{2} $$

Note the subscript $2$. When working with bases other than $10$, such subscripts specify the base.