Lecture 1

CS 771/871 Operating Systems

Lecture 1

Who am I and what is my background in operating systems?

What are distributed systems?

A distributed system is a collection of independent computers that appear to the users of the system as a single computer (tanenbaum).

Distributed System should:

control network resource allocation to allow their use in the most effective way
Provide convenient virtual machine
hide distribution of resources
provide protection
provide secure communication (Goscinski)

This is in contrast to centralized computing (timesharing) or independent PC's.

Why study them? Because of perceived advantages:: better price/performance (PCs are cheap, supercomputers are not)
improved speed (Speed of light is limit, many operations in parallel)
naturally supports distributed application (banking); increased reliability (failures can be isolated); incremental growth (keep old machines, just add more); data sharing (computer supported cooperative work - or games); device sharing (printers, scanners, CD-Rom writers are expensive); communications (support groups of people working together); flexibility (match load to idle machines more easily).

But distributed systems have their disadvantages, chief among them is the complexity and relative unavailability of <<SOFTWARE>>. Most of the problems of non-distributed systems still exist, but we have added many new problems to solve in order to achieve some of the perceived advantages among them are network congestion and rerouting and security.

Also most distributed systems complicate the job of the user by forcing them to be aware of various aspects of the distributed system (Don't you love URL's).

Go over syllabus.

Taxonomy of Distributed Systems

Since a distributed systems is a collection of (usually geographically) distributed hardware which is made to look like a unified computing environment by clever software, we need to look at different hardware and software configurations.

Hardware Configurations

hardware configurations range from tightly coupled SIMD and shared memory parallel architectures to loosely coupled heterogeneous independent but cooperating machines (e.g. internet).

Flynn's classification is based on number of different instruction and data streams, ranging from SISD (Single Instruction Single Data stream - traditional single CPU machines) to MIMD (Multiple Instruction Multiple Data streams - most of what we are concerned with). Also includes the common parallel architecture SIMD and rather bizarre MISD possibilities.

MIMD machines can be further divided into those which share memory (multiprocessors) and those which do not (multicomputers). These can be further divided into bus or switched depending on the communications path from the CPUs to the memory. Common examples of bus communications is CABLE-TV (broadcast) , while telephone is a switched system (private).

Another distinction is between tightly coupled systems with short delays and high bandwidth and tight control between processors and loosely coupled with arbitrary delays, possibly low bandwidth and more autonomous control.

Bus architectures

With a shared memory accessible over the bus, it is possible to get memory coherence that property whereby all CPUs will access the same value at a particular memory location. Problem is contention for the bus and the shared memory. Common solution is cache memory kept local at the CPU, but can lead to incoherent memory. One solution is write-through cache in which all writes to the cache also write to memory. Any write to memory invalidate that cache location on other CPUs. Reads can be done local to the cache. Even so there is a practical limit to the number of CPUs which can share memory (32-64).

NOTE: this problem is not limited to multiprocessors shared main memory but is also a concern with workstations connected on a LAN. However there is common solution is to lock files, or records at the file server.

To get beyond the limitation imposed by contention on the shared bus, switches can be employed. Common switching networks are Crossbar and Omega. Crossbar switches achieve a high degree of parallel interconnectivity by having a path between every pair of communicating devices using a switch between every pair of devices (N*N cross points). Omega switches use fewer switches but with the possibility of contention. (Hybrid between bus and crosspoint). Figure 1-6 shows a 2x2 omega switch.

QUESTIONS
How many switches (sources of delays) between two devices in each network?
What about contention?
Is an n x n omega network the same as a crossbar switch?

Because multicomputers do not need to share (main) memory, the speed of access and contention problems are much alleviated (but not eliminated). In fact one can consider the main memory to be the cache of the disk memory at the file server. The common bus architecture is Ethernet LAN.

Tightly coupled multicomputers are frequently interconnected by a MESH or a HyperCube network (see figure 1-8).

QUESTIONS
What is max delay in each network?
How much contention (parallelism)?
How many I/O ports (interconnect busses) needed?

Software Concepts

The job of the operating system is to mold incalcitrant hardware into a beautiful virtual machine.

One distinction is based on the degree of autonomy between processors (tightly vs loosely coupled).

Combinations of hardware and software

Network OS: high degree of autonomy, possible different operating systems, few system wide resources (printers, network file system). Client Server protocols. User aware of distribution of resources.
RLOGIN, RSH, SETENV DISPLAY, FTP
are examples of explicit user actions which reflect lack of transparency of location of resources.
On the other hand, NFS makes the location of your files transparent within the network complex.
True Distributed OS: single system image presented to user: virtual uniprocessor. Note how this contrasts with traditional timesharing systems which make a single CPU look like many virtual CPUs. Requires:
- Global interprocess communications
- global protection
- global process management
- transparent distributed file access
Multiprocessor Timesharing OS: tightly coupled. common ready queue. Shared memory, file system is like single CPU version, possible specialization of processors.
QUESTION: why must scheduler run in a critical section?
QUESTION: What else must be run in a critical section?

See table in figure 1.12

Another way to look at differences is to consider the traditional hierarchical structure of a centralized OS:

File Management
I/O Device Management
Memory Management
Process Management

Now consider a network of resources consisting of

file servers
printers
plotters
scanners
name servers
personal computers or workstations
processors

Now consider different placements of the InterProcess Communications (IPC) module within the traditional hierarchy.
If between 1 and 2, then File Service can be provided remotely and transparently.
If between 2 and 3, then shared remote devices are supported transparently.
If between 3 and 4, then shared memory
If integrated into 4, true distributed OS.

NOTE: one can make access to remote resources appear transparent by adding software above the OS. This is particularly easy if the OS has a light weight kernel and exports many of the management to user level process (like file management, I/O management). then these modules can utilize the network to access remote resources.

One of the earliest attempts at a network OS was the National Software Works, undertaken in the middle 70's.
Consisted of heterogeneous computers connected by ARPANET. Implementation was entirely at the application level (reminds one of internet and web browsers, search engines, etc). But developed a IPC which provided common functionality on diverse systems. Dealt with addressing (naming) problems which is still a big issue in distributed systems.

This early attempt had performance problems.

Other early attempts were based on remote procedure calls (RPC) built on top of a centralized OS with network access (figure 2-6 Goscinski).
Consisted of the following steps:

User Process communicates using provided IPC to local Remote Access System (RAS) with a request.
Local RAS sends request to an appropriate Remote RAS.
Remote RAS acknowledges to local RAS
local RAS transmits acknowledgment to user process with information to set up direct communication path to remote RAS.
User process sends pertinent data to remote RAS.
which access appropriate resource on remote system
sends acknowledge to user process
remote RAS awaits completion of request
and send acknowledge back to user process.

Consider some of the design tradeoffs.
QUESTION: what other ways to solve?
QUESTION: what are the design issues?

The newcastle connection was an early attempt at developing a network OS based on the UNIX OS. Used an extension of the UNIX hierarchical file naming structure to tie different systems together (loosely coupled).
Draw out the naming structure.
Replaced library routines between user and kernel levels. this intermediate level communicated with other using RPC.
Because all processes are subjected to intermediate processing for kernel requests, slows down everybody.

There are no widely used commercially available distributed OS today, although there are many networked OS.

So why study? Because the design issues are important in general and the trend is towards more distributed systems (LAPLINK, mobile computing). It is the future and parts of it are already here.

Design Issues:

Transparency: Hide underlying distributed implementation. (easier to do at the user level than the programming level). Different levels of transparency:
1. Location: location of resources unknown
2. Migration: resources can move without changing their names
3. Replication: number of copies unknown (cache)
4. Concurrency: share resources automatically and unobtrusively
5. Parallelism: programmer unaware of parallel activity on his behalf
Flexibility: Unresolved issue: traditional vs micro kernel. Microkernel utilizes servers which provide higher level OS functions. Can customize a set of OS functions to application. For example, different file systems (DOS, UNIX, MAC) could be provided services.
Reliability: If probability of failure of a single CPU is p, then the probability of n CPUs failing at the same time is p**n. For example, if probability of failure is .1 for a single CPU, then the probability that 3 CPUs fail simultaneously is .001. Replicated hardware is the key to the reliability of mission critical systems.
In practice failures may not be independent and in fact if there are interdependencies between components, the reliability may even decrease. (in the worst case it is the probability that at least one component fails - consider a pipeline architecture which requires all CPUs to be working to get anything done. Then the probability that all CPUs are working is (1-p)**n which for our example is .729.
Availability: is the fraction of time the system is usable.
Reliability also includes data integrity and security concerns. (copies of key files increases availability but may compromise the integrity of the data).
Fault Tolerance: is the ability to provide service even in the face of system failures.
Performance: various metrics, some end user oriented (response time), others resource oriented (throughput, utilization). Raw numbers (speed of CPU or network) are often misleading. Benchmarks are frequently used to compare systems. But makeup of benchmark is application specific.
Communications is typically the bottleneck.
For parallelism to work must consider the appropriate grain size.
Scalability: From LANs to Internet. from home PCs to smart telephones. Scalability usually implies eliminating all centralized resource handling.
Distributed algorithms are distinguished by:
- No machine has complete information of system state
- machines make decisions based on local information
- failure of one machine will not cause system failure
- no global clock assumed

One of the fundamental problems in distributed OS is lack of global state and up-to-date information.

CS 771/871 Operating Systems

Lecture 1

Taxonomy of Distributed Systems

Hardware Configurations

Bus architectures

Software Concepts

Combinations of hardware and software

Design Issues:

Copyright chris wild 1996. For problems or questions regarding this web contact [Dr. Wild]. Last updated: August 29, 1996.

Copyright chris wild 1996.
For problems or questions regarding this web contact [Dr. Wild].
Last updated: August 29, 1996.