Fall 2000: CS 771/871 Operating Systems

[ Home | Class Roster | Syllabus | Status | Glossary | Search | Course Notes]


Failure Recovery (Chapter 12)

An error is a manifestation of a fault that could lead to a system failure.

One strategy for recovery is to repair the erroneous state BEFORE it causes a system failure. (fault recovery)


Classification of Failures


Backward/Forward Error Recovery


Backward error recovery: Basic approaches


Recovery in concurrent systems

Failure of a process after information exchange will require recovery actions by communications partners.

SCENARIO: process A sends message to Process B then fails and rollbacks to state before sending of message.
INCONSISTENT global state (Process A does not know message has been sent by Process B has received the message.

Must not get into an inconsistent global state. So may need to rollback no-failed processes as well.


Consistent Set of Checkpoints

Consistent global state problem.

Example (as powerpoint)


Consistent Set Algorithms


Atomic Actions

AKA transactions.
All or Nothing semantics

Stable storage (usually implemented as disk mirroring (RAID - Redundant arrays of Independent Disks))

AA Primitives

 


Properties of Transactions

Need to preserve even in face of nested transactions.
Permanence only applies to top level.


Private Workspace

Conceptually transaction has copy of shared objects
Does this accomplish the above requirements?

Real copy is expensive
Question: Do we need to copy read-only objects?

Write only copies blocks actually changed (called shadow blocks)

Can use Unix's i-node concept to mix private and public blocks 

Commit has to atomically update public space.


Writeahead Log

Changes made to public storage but audit log is kept of old values

log can be used to rollback or rollforward (after a crash).

Question: How to handle access by other processes?


Two Phase Commit

Atomic commit in a distributed system requires all parties to agree to commit at the same time (or not to commit).

It is the log entry at end by coordinator which is actual commit.

Question: What if non-coordinator process crashes and never commits?
Question: What if coordinator process crashes?
Question:
is a transaction mutually exclusive?

PROBLEM: this is a blocking protocol - if coordinator fails

Assignment 6: 

  • QUESTION:  how to do synchronous strong consistency checkpointing?

  • QUESTION: how to do asynchronous strong consistency checkpointing?

 

Guidelines for Assignments: Assignments must be posted on your web sites. Obviously you should not copy another student's assignment and treat it as your own. Assignments are an extension of the class dialog. Basically you get credit for doing an assignment and no credit for not doing it. Really exception answers get extra credit.
Most assignments require only short answers - but they should be well thought out answers showing insight into the problem.

 

 

 

 


Copyright chris wild 2000.
For problems or questions regarding this web contact [Dr. Wild].
Last updated: 15 Oct 2000 .