[ Home | Class Roster | Syllabus | Status | Glossary | Search | Course Notes]
failure: system does not perform as specified
erroneous state: A state that could lead to failure by a sequence of valid state transitions
error: part of the system that differs from its intended value
fault: physical or design problem that could lead to a erroneous state
An error is a manifestation of a fault that could lead to a system failure.
One strategy for recovery is to repair the erroneous state BEFORE it causes a system failure. (fault recovery)
Process failures: deadlocks, timeouts, protection
violation, wrong input, consistency violations.
ACTION: abort or restart from previous state
System failures: processor fails => all processes
fail,
Assume fail-stop
amnesia: start in predefined state (not depending on previous states before failure)
partial amnesia: partial restoration of previous state (checkpoint)
pause: restarts in same state when failed
halting: system never restarts
Secondary storage failures: restore from archive plus
log OR mirrored disks (RAID - Redundant Array of Independent Disks) Example
of redundancy to mask failures.
Question: How to detect a failure?
Communications Medium failure: common - somewhat
alleviated by recovery protocols (TCP).
Failure of physical link separating processors is alleviated by redundancy.
Buffer overruns an issue
Forward error recovery: remove errors and proceed
(assumes knowledge of semantics in order to establish error -free state and
a "good" application in which such a state can be determined and
is acceptable.
There are a lot of error-free states that do not make acceptable progress.
(Presumably the initial state is error free - but this extreme mimics
backward recovery).
Redundancy and vote/mask of errors is one non-semantic way to proceed
- but only works with certain kinds of faults.
Backward error recovery: restore process to a previous state - more general technique. BUT
Expensive
No guarantee of future success (fault may be elsewhere and repeated - or may have error in backuped state.
May not be able to back up (firing a missile)
Recovery point: previous state of the system to which a process can return on error recovery
Stable storage: storage that will not be lost in the face of system crashes.
operational-based approach:
record details systems operation so can do/undo any action in stable storage as an audit trail or log
restore previous state by undoing operations on current state
Can commit an action to secondary or stable storage
Updating-in-place:
Record name of object, old state and new state (for redo)
undo to recovery, redo to restore
problem if crash occurs between operation and logging
Write-Ahead-Log:
Update after undo log is recorded
Before committing updates, record redo and undo logs
state-based approach:
Complete state is saved at recovery point (called checkpoint)
Can use shadow pages (original copies kept) of only those parts of the system state which are changed, rest need not be copied.
After committing, discard shadow pages.
Failure of a process after information exchange will require recovery actions by communications partners.
SCENARIO: process A sends message to Process B then fails and rollbacks to
state before sending of message.
INCONSISTENT global state (Process A does not know message has been sent by
Process B has received the message.
Must not get into an inconsistent global state. So may need to rollback no-failed processes as well.
Consistent global state problem.
Example (as powerpoint)
Naive: take checkpoint after sending every message
guarantee that any message recorded as received has already been recorded as
sent
Synchronous checkpoint:
channels are FIFO with no communications failure
uses permanent and temporary checkpoints
Phase I:
Single process starts algorithm by making a temporary checkpoint
Requests all other process to make temporary checkpoints
Other send back message whether or not checkpoint was successful
Starting process decides, based on replies, to commit or abort
Phase 2:
Starting process informs others of its decision
Others follow decision
Cannot send messages while awaiting decision of starting
process
Example (as powerpoint)
Synchronous Recovery:
Single process starts recovery by sending "OK to recovery" message to others
Other reply (might be negative if already in recovery)
If all other reply OK, decides to recovery
Transmit decision to others
Cannot send messages while recovering
Optimization: if no message exchange, don't need to rollback
Disadvantages:
Additional messages
introduces synchronization delays
Additional processing overhead (and if failures are infrequent)
Asynchronous checkpointing:
Basic idea: each process takes its own checkpoints. Upon recovery, need
to find a consistent set among these
There is always at least one consistent set. - WHY
log incoming messages to stable storage
pessimistic: log before processing
optimistic: periodic logging - (need to checkpoint the message log)
Assume
communications reliable and ordered with infinite buffers, finite delays, monotonically numbered events
After each event, record {s,m,msgs_sent}, s = state of process, m is recently arrived message starting this event, msgs_sent as result of this event
Use this information to identify orphan messages
See example (as powerpoint)
AKA transactions.
All or Nothing semantics
Stable storage (usually implemented as disk mirroring (RAID - Redundant arrays of Independent Disks))
AA Primitives
Need to preserve even in face of nested transactions.
Permanence only applies to top level.
Conceptually transaction has copy of shared objects
Does this accomplish the above requirements?
Real copy is expensive
Question: Do we need to copy read-only objects?
Write only copies blocks actually changed (called shadow blocks)
Can use Unix's i-node concept to mix private and public blocks
Commit has to atomically update public space.
Changes made to public storage but audit log is kept of old values
log can be used to rollback or rollforward (after a crash).
Question: How to handle access by other processes?
Atomic commit in a distributed system requires all parties to agree to commit at the same time (or not to commit).
It is the log entry at end by coordinator which is actual commit.
Question: What if non-coordinator
process crashes and never commits?
Question: What if coordinator process crashes?
Question: is a transaction mutually exclusive?
PROBLEM: this is a blocking protocol - if coordinator fails
Assignment 6:
|
Guidelines for Assignments:
Assignments must be posted on your web sites. Obviously you should not
copy another student's assignment and treat it as your own. Assignments
are an extension of the class dialog. Basically you get credit for doing
an assignment and no credit for not doing it. Really exception answers get
extra credit. Most assignments require only short answers - but they should be well thought out answers showing insight into the problem. |