Reliability

Contents:

As we have studied in previous sections, computers are controlling or supporting many aspects of our lives. Fly-by-wire aircraft, patient monitoring and care administration, financial transactions, telephone networks, military surveillance and responses are just a few examples.

The usage of computers in aspects of our lives that fall into the category of safety critical applications continues to grow. In the last decade, designers have used computers extensively in applications where personal injury or environmental damage could result if the computer systems do not perform as intended. The failure of a critical computer system designed to meet safety requirements can lead to significant economic loss, injury, or death. For instance, the reliability of a system that controls a nuclear power plant protection system, or that determines the orbit of a space shuttle, is much more critical than that of a system that, say, tracks warehouse inventory. Critical applications require highly reliable computer systems that have a known set of acceptable failure modes.

Our dependency upon these computerized systems for more and more aspects of our live could be potentially damaging. For example, if the 911 system in the city of New York were to go down for even an hour, what could happen? When an ATM machine is broken, it could have a catastrophic impact upon people. How about those communication satellites? A breakdown in this system would affect cell phones, GPS systems, etc. This would have a very negative impact on many people, ranging from a minor inconvenience to a loss of life.

1 Who should be accountable for failures, and to what extent?

The ubiquitous use of computers and computer-based systems means that modern society is heavily reliant upon the safe, reliable and predictable operation of the underlying technology and the software used to implement these systems. Computer systems are used in many life threatening real-time systems where absolute confidence in the properties of the total system have to be guaranteed, for example, in the automobile and airline industries. Computer systems are also used in other applications which even though not immediately life threatening could cause a catastrophe if they were to fail, for example in the financial and retail sectors. Computer systems are also used in applications that are “business-critical” and if the system fails then the company cannot function as a business, for example, telecommunication providers. So, it is important to study what can go wrong!

There are three categories of failures:

2 Why do we, as members of society, need to study failures of computer systems?

As users of these systems we should appreciate:

We hope that computers are always functioning correctly, especially when we need them! When they are down, or are not functioning correctly - there are a number of possible reasons for the failures. We will look at some of these reasons, and why it is important to study them.

Possible States of a Computer:

Categories of computer failure:

3 Key issues for the professional

The need for safe, reliable and predictable systems covers many key issues of computer science if the resultant application is to have the desired properties, including: