Reliability
As we have studied in previous sections, computers are controlling or supporting many aspects of our lives. Fly-by-wire aircraft, patient monitoring and care administration, financial transactions, telephone networks, military surveillance and responses are just a few examples.
The usage of computers in aspects of our lives that fall into the category of safety critical applications continues to grow. In the last decade, designers have used computers extensively in applications where personal injury or environmental damage could result if the computer systems do not perform as intended. The failure of a critical computer system designed to meet safety requirements can lead to significant economic loss, injury, or death. For instance, the reliability of a system that controls a nuclear power plant protection system, or that determines the orbit of a space shuttle, is much more critical than that of a system that, say, tracks warehouse inventory. Critical applications require highly reliable computer systems that have a known set of acceptable failure modes.
Our dependency upon these computerized systems for more and more aspects of our live could be potentially damaging. For example, if the 911 system in the city of New York were to go down for even an hour, what could happen? When an ATM machine is broken, it could have a catastrophic impact upon people. How about those communication satellites? A breakdown in this system would affect cell phones, GPS systems, etc. This would have a very negative impact on many people, ranging from a minor inconvenience to a loss of life.
1 Who should be accountable for failures, and to what extent?
The ubiquitous use of computers and computer-based systems means that modern society is heavily reliant upon the safe, reliable and predictable operation of the underlying technology and the software used to implement these systems. Computer systems are used in many life threatening real-time systems where absolute confidence in the properties of the total system have to be guaranteed, for example, in the automobile and airline industries. Computer systems are also used in other applications which even though not immediately life threatening could cause a catastrophe if they were to fail, for example in the financial and retail sectors. Computer systems are also used in applications that are “business-critical” and if the system fails then the company cannot function as a business, for example, telecommunication providers. So, it is important to study what can go wrong!
There are three categories of failures:
-
Problems for individuals
-
System failures that affect large numbers of people, or cost large financial loss.
-
Problems in safety-critical applications
2 Why do we, as members of society, need to study failures of computer systems?
As users of these systems we should appreciate:
-
The limitations of computers and the associated hardware/software
-
The need for proper training of all levels of users
-
The need for responsible use
-
The difference between good products, and poor ones
We hope that computers are always functioning correctly, especially when we need them! When they are down, or are not functioning correctly - there are a number of possible reasons for the failures. We will look at some of these reasons, and why it is important to study them.
Possible States of a Computer:
-
Functioning correctly
-
Functioning incorrectly
-
Down
-
Intentionally turned off
Categories of computer failure:
-
Faulty design of software or hardware
-
Sloppy implementation of designed solution
-
Careless or insufficiently trained users
-
Poor user interfaces
-
Hardware/software malfunctions
-
Errors in the specification documents
-
Scope/application inconsistency
3 Key issues for the professional
The need for safe, reliable and predictable systems covers many key issues of computer science if the resultant application is to have the desired properties, including:
-
How to construct hardware installations about which we can reason and prove operational properties.
-
How to construct software systems about which we can reason and prove operational properties.
-
How to demonstrate, beyond reasonable doubt, that the composition of several such hardware and software systems maintains these properties.
-
How to determine the precise requirements of an application, both at the time of its inception and during subsequent enhancement, in terms that both the user and implementer can understand.
-
How to construct systems, which even though they are operating outside their design envelope, still work in a way that does not surprise the user.
-
How to construct systems that are able to automatically repair, log and monitor themselves after failure.
-
How to construct systems that are physically safe with respect to external users of that system.