Reliability

Contents:

1 Who should be accountable for failures, and to what extent?

2 Why do we, as members of society, need to study failures of computer systems?

As we have studied in previous sections, computers are controlling or supporting many aspects of our lives. Fly-by-wire aircraft, patient monitoring and care administration, financial transactions, telephone networks, military surveillance and responses are just a few examples.

The usage of computers in aspects of our lives that fall into the category of safety critical applications continues to grow. In the last decade, designers have used computers extensively in applications where personal injury or environmental damage could result if the computer systems do not perform as intended. The failure of a critical computer system designed to meet safety requirements can lead to significant economic loss, injury, or death. For instance, the reliability of a system that controls a nuclear power plant protection system, or that determines the orbit of a space shuttle, is much more critical than that of a system that, say, tracks warehouse inventory. Critical applications require highly reliable computer systems that have a known set of acceptable failure modes.

Our dependency upon these computerized systems for more and more aspects of our live could be potentially damaging. For example, if the 911 system in the city of New York were to go down for even an hour, what could happen? When an ATM machine is broken, it could have a catastrophic impact upon people. How about those communication satellites? A breakdown in this system would affect cell phones, GPS systems, etc. This would have a very negative impact on many people, ranging from a minor inconvenience to a loss of life.

1 Who should be accountable for failures, and to what extent?

The ubiquitous use of computers and computer-based systems means that modern society is heavily reliant upon the safe, reliable and predictable operation of the underlying technology and the software used to implement these systems. Computer systems are used in many life threatening real-time systems where absolute confidence in the properties of the total system have to be guaranteed, for example, in the automobile and airline industries. Computer systems are also used in other applications which even though not immediately life threatening could cause a catastrophe if they were to fail, for example in the financial and retail sectors. Computer systems are also used in applications that are “business-critical” and if the system fails then the company cannot function as a business, for example, telecommunication providers. So, it is important to study what can go wrong!

There are three categories of failures:

Problems for individuals
System failures that affect large numbers of people, or cost large financial loss.
Problems in safety-critical applications

2 Why do we, as members of society, need to study failures of computer systems?

As users of these systems we should appreciate:

The limitations of computers and the associated hardware/software
The need for proper training of all levels of users
The need for responsible use
The difference between good products, and poor ones

We hope that computers are always functioning correctly, especially when we need them! When they are down, or are not functioning correctly - there are a number of possible reasons for the failures. We will look at some of these reasons, and why it is important to study them.

Possible States of a Computer:

Functioning correctly
Functioning incorrectly
Down
Intentionally turned off

Categories of computer failure:

Faulty design of software or hardware
Sloppy implementation of designed solution
Careless or insufficiently trained users
Poor user interfaces
Hardware/software malfunctions
Errors in the specification documents
Scope/application inconsistency

3 Key issues for the professional

The need for safe, reliable and predictable systems covers many key issues of computer science if the resultant application is to have the desired properties, including:

How to construct hardware installations about which we can reason and prove operational properties.
How to construct software systems about which we can reason and prove operational properties.
How to demonstrate, beyond reasonable doubt, that the composition of several such hardware and software systems maintains these properties.
How to determine the precise requirements of an application, both at the time of its inception and during subsequent enhancement, in terms that both the user and implementer can understand.
How to construct systems, which even though they are operating outside their design envelope, still work in a way that does not surprise the user.
How to construct systems that are able to automatically repair, log and monitor themselves after failure.
How to construct systems that are physically safe with respect to external users of that system.