Amoeba Case Study

Fall 2000: CS 771/871 Operating Systems

Amoeba Distributed Operating System

System Architecture

Location Transparent Non-dedicated Processor Pool
(possibly different architectures)
X-terminals dedicated to individual users
Set of Services (some dedicated, some dynamic)
Underlying network (LAN or WAN)
Microkernel + CLIENT/SERVER = OS

Microkernel

Manages processes and threads (with synchronization)
Low-level memory management (Segments)
Communications (RPC and group)
Low-level I/O (But using client/server model)

Client Server Model

Objects managed by servers
Objects accessed by capabilities
General approach supporting
- files
- directories
- windows
- memory
- processors
- I/O devices
Files are immutable (bullet server)

.
Capabilities

CREATE-OBJECT RPC to Object server returns capability

Server Port

logical address of service (not a machine address)
Object

Like Unix i-node
Rights

Particular to an object type
Check

Validates capability

Object Protection

On CREATION, object is assigned a random Check kept at object server and put in capability.
Rights field is all bits set (OWNER CAPABILITY), returned to client
To RESTRICT capability, owner sends it back with bit mask
bit mask XOR with RIGHT field
Server XORs new rights mask with old check field and passes through a one way function. This is new check field.
The server returns new capability with new rights and check field to client, which may be safely passed to another process.

QUESTION: How safe is this?
How do we know it is owner trying to restrict the capability?
Can another process make up a capability with owner rights?
Can rights be increased?
Can an impostor steal the owner's capability?
Can a process with a restricted capability, increase the rights?
Is XOR reversible?
Give an example of a one way function.

Standard Operations on Objects

Call	Description

Age	Perform Garbage Collection Cycle
Copy	Duplicate and return capability to new object
Destroy	Destroy and reclaim storage
Getparams	Get parameters
Info	Get ASCII string describing
Restrict	Produce restricted capability to this object
Setparams	Set server parameters
Status	Get status from server
Touch	Pretend object was just used (garbage collection)

NOTES:

Age/Touch used in garbage collection of orphaned objects (like a LEASE)
Copy is done at server, no object-related traffic to client needed,
can have remote machines as targets
Get/Setparams: used by administrator to control object manager

Process Management

Process is an object
Parents get capability to child process objects
- suspend
- restart
- signal
- destroy
Differs from UNIX clone method of fork and exec
Three Levels
- RPC to process server kernel thread on specific machine
- library functions which call RPC
- Run Server which finds a processor
Process Descriptor
- architecture
- owner's capability (for reporting)
- memory segments
- thread descriptors
  - PC
  - register save area
  - stack pointer
  - other state info

/*
** Process descriptor.
** This is followed by pd_nseg segment descriptors (segment_d),
** reachable through PD_SD(p)[i], for 0 <= i < p->pd_nseg.
** The index in the segment array is also the segment identifier.
** Following the segments are pd_nthread variable-lenght thread descriptors.
** Sample code to walk through the threads:
**	thread_d *t = PD_TD(p);
**	for (i = 0; i < p->pd_nthread; ++i, t = TD_NEXT(t))
**		<here *t points to thread number i>;
*/
typedef struct {
	char		pd_magic[ARCHSIZE];	/* Architecture */
	capability	pd_self;	/* Process capability (if running) */
	capability	pd_owner;	/* Default checkpoint recipient */
	uint16		pd_nseg;	/* Number of segments */
	uint16		pd_nthread;	/* Number of threads */
} process_d;

API
- EXEC (capability of process server, process descriptor)
- GETLOAD (of a processors)
- STUN
  - normal: terminate outstanding RPCs
  - emergency: stops immediately, RPC's are orphans
- NEWPROC: high level APi builds process descriptor from binary file name, argument and environment

Thread Management

Initially one thread, but can start any number of additional ones
GLOCAL variables (locally global to that thread)
Synchronization by
- SIGNALS (asynchronous interrupts)
- MUTEX (binary semaphore), time-out LOCK
  fair
- SEMAPHORES, counting with a time-out WAIT

Memory Management

entirely in physical memory
- no page faults
- can read/write directly into user space
- Assumes cheap main memory
segments contiguous in address space
segments are objects
any number, any where
Any process with capability to segment could read/write it (with proper permissions)
- shared memory communications (need not be on same machine)
- main memory file server

Communications in Amoeba

Address is a 48 bit randomly chosen number by thread
this is the first field of a capability.

RPC, point to point, request/block/reply

get_request(&header, buffer, bytes)
(server listens to port), header Contains the PORT (6 bytes)
put_reply (&header, buffer, bytes)
server sends reply to send
trans (&header1, buffer1, bytes1, &header2, buffer2, bytes2)
send message from client to server

To prevent impersonating a server, ports are assigned in pairs

get-port (private) known only to the server
put-port (public) used by any client

These are related by a one-way function.

put-port = F(get-port)

Since get_request uses get-port, an impostor cannot issue one in place of a server.

Group Communications

Closed groups, but anyone can send RPC message to any member for group broadcast.

CreateGroup: specify degree of fault tolerance
JoinGroup: includes greeting message to existing members
LeaveGroup: includes good bye message to existing members
SendToGroup: atomic reliable broadcast with total ordering
implements sequential consistency model
ReceiveFromGroup: blocks waiting for message
ResetGroup: reestablishes group with minimum number of members

Reliable Broadcast: Initiation

User process traps to kernel, passing message
kernels blocks user process
kernel sends point-to-point message to SEQUENCER
1. kernel message contains unique number to detect duplicates
2. also contains number of last broadcast message received by kernel
  (piggybacked acknowledgment)
3. starts timer
Sequencer allocates next message number and broadcasts message
1H[ seeing broadcast, sending kernel
1. stops timer
2. unblocks user process

Failures Modes:

Sending kernel times out because sequencer did not receive
Just resends
Sending kernel times out because kernel did not receive broadcast
- Resends as above
- Sequencer notices duplicate request from unique number
- Sequencer notifies sending kernel only that all is OK
Sending kernel sees wrong broadcast
- graciously accepts that it was beaten to the sequencer

Reliable Broadcast: Sequencer

Checks unique number to catch retransmission
If retransmission, just notifies sending kernel all is OK
If new,
- Updates sequence number
- assigns to this broadcast
- stores message in history buffer
- updates acknowledge state of sending kernel
- broadcasts message
- sends message to an processes on this processor in that group

Reliable Broadcast:Receiving Kernel

Compares sequence number to last one received
If exactly one higher, the accepts
- If process in group is waiting, then copies into user process address space and unblocks
- If process not yet waiting, buffers message
- NOTE: there may be several processes in that group on this machine
If message is out of synch (sequence number too high).
- sends point-to-point message to sequencer notifying of lost message(s)
- sequencer transmits lost messages from history buffer.

Reliable Broadcast: History Buffer

To prevent overflow, sequencer needs to delete old messages
If all kernels involved in this group have acknowledge message "k", then sequencer can discard all messages from 0 to k.
Normally piggy back Ax keep status reasonably up to date
If no traffic out, processor sends status periodically
RequestforStatus by sequencer can also be used in rare cases

COMPLEXITY of GROUP COMMUNICATIONS: slightly more than 2, increasing slightly with N

Fault Tolerant Group Communications

Processor crash discovered by lack of ACKs by some processor
All subsequent group communications on that processor fail
User process getting error return, calls ResetGroup
Phase one, elects coordinator
- Upon ResetGroup, kernel sends message to all member kernels inviting participation in recovery.
- Upon receipt of recover invitation, processor sends back highest sequence number seen
- If contention, choose one with highest sequence number seen
- If still contention, choose one arbitrarily (highest network address)
Phase two, coordinator rebuilds group
- Gets any message it may have missed into its history buffer.
- Sends Results message, announcing
  - it is coordinator (and hence new sequencer)
  - members of reformed group
  - highest sequence number seen
- Each member can request unseen messages from sequencer
- Once ACK received from all members of new group, sequencer can discard history buffer and resume.

Fault tolerance of history buffer is achieved when setting up group by specifying how many machines maintain a copy ("k" fault tolerance).

To sync "k" copies:

User process kernel broadcasts message directly
Sequencer waits for "k" lowest-numbered kernels to ACK broadcast
Then sequencer broadcasts "ACCEPT" message
message is "official" only upon receipt of ACCEPT from sequence (which also includes the sequence number assigned).
INVARIANT: ACCEPT messages implies "k+1" machines have a copy of message

Measurements on 68030 CPUs 10Mbps ethernet, 800 reliable transmission per second.

FLIP (Fast Local Internet Protocol)

Why another protocol at the network layer?

Need to support RPC
Need to support group communications
Process migration should be location transparent at the address level
Processes should not impersonate others
Support automatic network reconfig
Should work on WANs

Each Process has a unique randomly chosen 64 bit FLIP address
this address migrates with process

For security, consists of public and private parts

Public-address = DES(private-address)

Use private address as a key to DES encode bit 0.

Servers listen on private addresses, but clients send on public ones.
(analogous to put/get ports but at lower level).

FLIP Functions

INIT: allocate slot in table with two call-back procedure addresses (interrupt handlers)
- Normal
- Abnormal
END: deallocates slot
REGISTER: sets FLIP address
UNREGISTER: unsets FLIP address
UNICAST, MULTICAST, BROADCAST: no guarantees on delivery
RECEIVE:
NOTDELIVER: messages sent back to this machine as undeliverable

FLIP Routing Table

FLIP
Address

Network
address

Hop
count

Trusted
bit

Age

Upon receipt of packet
- If new, generates new entry in routing table
- updates NETWORK ADDRESS and HOP COUNT
- TRUSTED BIT is managed by gateways as is HOP COUNT
- AGE is reset to 0 when packet from FLIP address is received
  periodically it is incremented (used to replacement algorithms)

Locating Put-Ports in Amoeba

Let's look at how client A communicates to server B.

When B is created it is assigned a random FLIP address which is registered with the FLIP layer
B does a get_request on its get_port, traps to kernel
kernel gets or computes put_port and notes that this process is listening to the put_port, blocks B
A does a TRANS on the put_port, traps to kernel
kernel looks up FLIP address for that put_port
If not found, RPC layer broadcasts request to find put_port FLIP
- RPC layers sets timer
- To limit impact over WANs, sets maximum HOP COUNT to broadcast out to.
- Gateways discard broadcast which have reached HOP COUNT
  else increases HOP COUNT and sends to next network
- If time-out, then rebroadcasts with higher HOP COUNT
At B's machine, RPC layer sends back FLIP address
Now A's machine knows network address at FLIP layer and RPC layer knows FLIP address
At A, RPC layer sends message to that FLIP

NOTE: for redundancy, there may be several processes listening to a put_port
if several respond to a broadcast, RPC layer chooses one.

Separating FLIP from put_ports:

Allows nonFLIP networks to be used
protects impostor servers from listening on a public put_port
Allows restarts of servers (which will have a different FLIP but the same put_port)
Because of new FLIP, RPC can detect restarts of servers and can abort transaction
This gives AT-MOST-ONCE semantics
Of course client can just try again, using new server if it chooses

Amoeba's File System

Amoebas allows arbitrary file servers to coexist.
Standard file systems consists of

BULLET server: handles file storage
Directory server: maps names to capabilities
Replication server: copies files

BULLET server

Designed for machines with large primary memories and huge disks
Files are IMMUTABLE
Files occupy a contiguous segment of memory
- Can be swapped to/from disk in one I/O transfer
- Can be sent to client in one RPC transfer
Conceptually, files are created fully loaded with information
Also allows UNCOMMITTED files to be created
- Allows changes until COMMITTED
- Cannot be seen by other processes until COMMITTED
- Size must still be known at CREATION
Files are accessed by capabilities not names

Implementation of BULLET Server

File Table is memory resident
- Pointer to file in main memory
- Pointer to file on disk
- Length of file
Object number in capability is used as index in file table
Randomly assigned check number kept in file table must match that in capability.
Files not in main memory are read in one access
Uncommitted files are deleted after 10 minutes of inactivity.

Garbage Collection

Every file has a counter, initialized to MAX_LIFETIME
Periodically, a daemon asks the bullet server to AGE the files (by decrementing the counter)
Any file whose counters goes to ) is deleted.
However, another process periodically issues a TOUCH command for all files in the directories which resets the counter to MAX_LIFETIME
In this manner, orphan files are garbage collected.

Directory Server

Maps ASCII names to capabilities using directory tables
Can implement different flavors of directory management
Typically, a UNIX like directory service

ASCII String
Capability Set: one for each copy of that object
Set of columns, one for each protection domain (shares everything but RIGHTS field of capability)
Capability in the directory server is for one protection domain (column)
Of course capabilities can be to other directory objects
(even of a different type of directory service!)
Allows general graph structures (which may be better suited to distributed systems anyway).
Can access other objects as well as files
- processor pools
- hosts
- printers

Every user has his on root, so system looks like a forest from the users point of view.

Directory Server Calls

Create: returns capability to directory object
Delete: deleting a directory or entry does not delete the object
Append: adds a new directory entry in an existing directory object
Replace: existing entry
Lookup: given capability of directory and ASCII string, returns capability set of object
GetMasks: return RIGHTS mask for object
Chmod: change RIGHTS mask

Implementation of Directory Server

Each directory object is stored twice in two bullet files on different bullet servers
Changes to directories are stored in new bullet files.
Primary copy is made first, background thread makes secondary copy
After both copies made, old directory objects are destroyed
Directory servers themselves are duplicated on separate disks

Replication Server

Use LAZY REPLICATION to provide multiple copies of objects
Also runs garbage collection system, tracing through directories to TOUCH all objects found there

Run Server

Decides which architecture/which machine
Manages a pool of processor, sorted by architecture
A program may be compiled for multiple architectures and so when it is looked up, finds a directory of executables
Run Server looks at appropriate pools
- Using GETLOAD calls, it knows approximate loads
- Each potential CPU estimates how much compute power it can spare to this process (using processor speed and number of threads running)
Server chooses processor with highest available processing bandwidth

Boot Server

Server interested in being automatically restarted
register with BOOT SERVER
BOOT SERVER periodically sends "are you alive" messages to server process
If no response after a certain time, tries to reboot on current processor
If fails to reboot, chooses new processor and tries to restart there.

Other Servers

TCP/IP Server
I/O Servers
Time Servers
Random Number Servers
Mail Servers