Lecture 9 FS

<body> <h2 align="center">CS 771/871 Operating Systems</h2> <p>[ <a href="../CS771">Home</a> | <a href="roster.htm">Class Roster</a> | <a href="syllabus.htm">Syllabus</a> | <a href="status.htm">Status</a><img src="images/new_small.gif" align="bottom" width="40" height="40"> | <a href="glossary.htm">Glossary</a> | <a href="search.htm">Search</a> | <a href="notes.htm">Course Notes</a>]</p> <hr> <h1 align="center">Distributed Memory</h1> <h2>Multiprocessors and MultiComputers Revisited</h2> <ul> <li><font size="4"><strong>Multiprocessors (share memory):</strong></font> <ul> <li><font size="4"><strong>Complicated hardware</strong></font></li> <li><font size="4"><strong>Does not scale well</strong></font></li> <li><font size="4"><strong>Expensive</strong></font></li> <li><font size="4"><strong>architectural design unified</strong></font></li> <li><font size="4"><strong>memory access communications model</strong></font></li> <li><font size="4"><strong>Easy to program (have process semantics)</strong></font> <ul> <li><font size="4"><strong>well understood synchronization</strong></font></li> <li><font size="4"><strong>can implement strict consistency</strong></font></li> </ul> </li> </ul> </li> <li><font size="4"><strong>Multicomputers (private memory):</strong></font> <ul> <li><font size="4"><strong>easy to build (buy as many as you need of the shelf)</strong></font></li> <li><font size="4"><strong>message passing communications model</strong></font></li> <li><font size="4"><strong>programming is usually more difficult (except RPC model)</strong></font></li> </ul> </li> <li><font size="4"><strong>Hybrid Models </strong></font> <ul> <li><font size="4"><strong>distributed shared memory</strong></font></li> <li><font size="4"><strong>shared variables</strong></font></li> <li><font size="4"><strong>shared objects</strong></font></li> </ul> </li> </ul> <hr> <h2>Virtual Distributed Memory</h2> <ul> <li><font size="4"><strong>Virtual Memory = not all memory space need be physically present in main memory.</strong></font></li> <li><font size="4"><strong>Disk less workstations have paging store remote</strong></font></li> <li><font size="4"><strong>Could share file server paging store among a set of processors</strong></font></li> <li><font size="4"><strong>Could distribute the file server paging store among a set of processors</strong></font></li> <li><font size="4"><strong>Thus distributed processes could share a virtual address space.</strong></font></li> </ul> <hr> <h2>Tightly to Loosely Coupled Memory Architectures</h2> <ul> <li><font size="4"><strong>On chip memory: direct connection between CPU and memory<br> could have multiple CPUs on chip sharing memory</strong></font></li> <li><font size="4"><strong>Bus accessed memory with no cache (no problem) no capacity</strong></font></li> <li><font size="4"><strong>Bus with caches (snoppy cache).</strong></font></li> <li><font size="4"><strong>Ring Based</strong></font></li> <li><font size="4"><strong>Switched</strong></font></li> <li><font size="4"><strong>NUMA</strong></font></li> <li><font size="4"><strong>Paged</strong></font></li> <li><font size="4"><strong>Shared Variable</strong></font></li> <li><font size="4"><strong>Object Based</strong></font></li> </ul> <hr> <h2>Write Through Protocol</h2> <ul> <li><font size="4"><strong>Maintains cache consistency by performing all writes over bus to memory as well as cache.</strong></font></li> <li><font size="4"><strong>Other processors see write request and invalidate or update their copy (if cached)</strong></font></li> <li><font size="4"><strong>Caching does not speed up writes.</strong></font></li> </ul> <p><font color="#FF0000"><font size="4"><strong>QUESTION: how does this fit with expected usage?<br> What do you need to know?<br> How could you measure? </strong></font></font></p> <hr> <h2>Write Once Protocol</h2> <ul> <li><font size="4"><strong>Multiple read caches allowed (CLEAN)</strong></font></li> <li><font size="4"><strong>First write invalidates (INVALID) read caches and makes writer owner (DIRTY)</strong></font></li> <li><font size="4"><strong>Subsequent writes are done to cache only</strong></font></li> <li><font size="4"><strong>Subsequent read from other processor</strong></font> <ul> <li><font size="4"><strong>must go to bus (because cache is invalidated) </strong></font></li> <li><font size="4"><strong>owner intercepts before memory can satisfy</strong></font></li> <li><font size="4"><strong>owner provides value from its cache</strong></font></li> <li><font size="4"><strong>owner invalidates its cache</strong></font></li> <li><font size="4"><strong>Reader's cache marked DIRTY (it is owner)</strong></font></li> </ul> </li> </ul> <p><font color="#FF0000"><font size="4"><strong>QUESTION: What if another processor writes without reading?<br> When is memory updated?<br> Does it ever need to be?<br> Does this suggest another shared memory protocol?</strong></font></font></p> <hr> <h2>Ring Based Multiprocessor (MEMNET)</h2> <ul> <li><font size="4"><strong>Address divided into private and shared.</strong></font></li> <li><font size="4"><strong>Shared memory is distributed.</strong></font></li> <li><font size="4"><strong>Token ring is 160Mbps</strong></font></li> <li><font size="4"><strong>memory is divided into 32byte blocks</strong></font></li> <li><font size="4"><strong>Each block has a home processor but can be cached elsewhere<br> Need not reside on the home processor.</strong></font></li> <li><font size="4"><strong>Multiple read copies but only one write copy (INVARIANT)</strong></font></li> <li><font size="4"><strong>Each processor has a block table with</strong></font> <ul> <li><font size="4"><strong>VALID: if cached on this machine</strong></font></li> <li><font size="4"><strong>EXCLUSIVE: if write is allowed</strong></font></li> <li><font size="4"><strong>HOME: is block's home is this machine</strong></font></li> </ul> </li> </ul> <hr> <h2>Memnet Protocol</h2> <ul> <li><font size="4"><strong>Read</strong></font> <ul> <li><font size="4"><strong>If cache VALID, local access possible</strong></font></li> <li><font size="4"><strong>If not, wait for token and send request on ring</strong></font></li> <li><font size="4"><strong>Upon receiving token, processor checks if block cached</strong></font> <ul> <li><font size="4"><strong>If cached, puts block in token</strong></font></li> <li><font size="4"><strong>Marks request as satisfied</strong></font></li> <li><font size="4"><strong>sends token</strong></font></li> <li><font size="4"><strong>clears exclusive bit (if set)</strong></font></li> </ul> </li> <li><font size="4"><strong>Eventually requester see token again with desired block<br> INVARIANT: every block exists in at least one processor</strong></font></li> <li><font size="4"><strong>If space needed, sends non-homed block home<br> </strong></font><font color="#FF0000"><font size="4"><strong>QUESTION: When? Where? How?</strong></font></font></li> </ul> </li> <li><font color="#000000"><font size="4"><strong>Write</strong></font></font> <ul> <li><font color="#000000"><font size="4"><strong>If local exclusive copy, just write</strong></font></font></li> <li><font color="#000000"><font size="4"><strong>If cached for read, send invalidate message with token<br> Upon complete circuit, set exclusive bit and write</strong></font></font></li> <li><font color="#000000"><font size="4"><strong>If not cached, sends request/invalidate message with token</strong></font></font> <ul> <li><font color="#000000"><font size="4"><strong>First machine with block, sends it and invalidates</strong></font></font></li> <li><font color="#000000"><font size="4"><strong>Other machines with copy invalidate</strong></font></font></li> <li><font color="#000000"><font size="4"><strong>When requester receives, copies, marks exclusive and writes</strong></font></font></li> </ul> </li> </ul> </li> </ul> <p><font color="#FF0000"><font size="4"><strong>QUESTION: how would you evaluate the effectiveness of this architecture?</strong></font></font></p> <hr> <h2>Gupta/Wild Ring Architecture</h2> <ul> <li><font size="4"><strong>Implements data flow computational model</strong></font></li> </ul> <p align="center"><font size="4"><strong><img src="images/l9f1.gif" align="bottom" width="426" height="199"></strong></font></p> <ul> <li> <p align="left"><font size="4"><strong>Each computation nodes circulates around ring looking for its data</strong></font> </p> </li> <li> <p align="left"><font size="4"><strong>Each Data packets knows how many computation packets it needs to populate</strong></font> </p> </li> <li> <p align="left"><font size="4"><strong>When a computation packet has all its data, it is enabled</strong></font> </p> </li> <li> <p align="left"><font size="4"><strong>Any idle processor can execute an enabled computation packet.</strong></font> </p> </li> <li> <p align="left"><font size="4"><strong>Data computed stays on processor waiting for computation packets which need that value to pass by.</strong></font> </p> </li> </ul> <hr> <h2>Switched MultiProcessors</h2> <p><font size="4"><strong>When communications channel saturate, add more (in parallel, as tree, in hierarchy).</strong></font></p> <p><font size="4"><strong>DASH</strong></font></p> <ul> <li><font size="4"><strong>Cluster = 4 CPUs with snoopy cache and memory</strong></font></li> <li><font size="4"><strong>Cluster connected by intercluster bus (MESH)</strong></font></li> <li><font size="4"><strong>memory is distributed with each cluster holding 16M</strong></font></li> <li><font size="4"><strong>Directory kept at each cluster knows who else has copy</strong></font></li> <li><font size="4"><strong>state of each block can be uncached, clean, dirty</strong></font></li> <li><font size="4"><strong>uncached and clean blocks owned by home cluster</strong></font></li> <li><font size="4"><strong>dirty owned by cluster holding one and only copy</strong></font></li> <li><font size="4"><strong>See figure 6-8, number of bus transfers</strong></font></li> </ul> <p><font size="4"><strong>READ: r = local request, R = global request, d = local data, D = global data, s = state change to local home directory, S = state change globally</strong></font></p> <table border="1" width="100%"> <tr> <td width="20%">Block State</td> <td width="20%">R's cache</td> <td width="20%">Intracluster cache</td> <td width="20%">Home memory</td> <td width="20%">Intercluster cache</td> </tr> <tr> <td width="20%">UNCACHED</td> <td width="20%">NA</td> <td width="20%">NA</td> <td width="20%">1R + 1D + 2s</td> <td width="20%">NA</td> </tr> <tr> <td width="20%">CLEAN</td> <td width="20%">0</td> <td width="20%">1r + 1d </td> <td width="20%">1r+1R+1D+1s</td> <td width="20%">NA</td> </tr> <tr> <td width="20%">DIRTY</td> <td width="20%">0</td> <td width="20%">1r+1d+1D+2s</td> <td width="20%">NA</td> <td width="20%">1r+2R+2D+2s</td> </tr> </table> <p>WRITE: n = number of cached copies</p> <table border="1" width="100%"> <tr> <td width="20%">Block State</td> <td width="20%">R's cache</td> <td width="20%">Intracluster cache</td> <td width="20%">Home memory</td> <td width="20%">Intercluster cache</td> </tr> <tr> <td width="20%">UNCACHED</td> <td width="20%">NA</td> <td width="20%">NA</td> <td width="20%">1R + 1D + 2s</td> <td width="20%">NA</td> </tr> <tr> <td width="20%">CLEAN</td> <td width="20%">2R+1s+nR</td> <td width="20%">1r+1d+2R+1s+nR </td> <td width="20%">1r+1R+1D+1s+nR</td> <td width="20%">NA</td> </tr> <tr> <td width="20%">DIRTY</td> <td width="20%">0</td> <td width="20%">1r+1d</td> <td width="20%">NA</td> <td width="20%">1r+2R+1D+2s</td> </tr> </table> <p>Considerable overhead in memory for directories and bus traffic.</p> <hr> <h2>NonUniform Memory Access (NUMA)</h2> <p><font size="4"><strong>Just eliminate caches. Easy but order of magnitude overhead on remote accesses.</strong></font></p> <p><font size="4"><strong>Examples Cm* (Bus), BBN butterfly (Switched)</strong></font></p> <p><font size="4"><strong>Initial location of memory is critical to performance.</strong></font></p> <p><font size="4"><strong>But could remap periodically using various adaptive algorithms</strong></font></p> <hr> <h2>Consistency Models</h2> <p><font size="4"><strong>Contract which specifies permissible interactions between competing processors over communications channels with delays.</strong></font></p> <h3 align="center"><font size="4"><strong>Strict Consistency</strong></font></h3> <p align="left"><font size="4"><em><strong>Any Read to a memory location x returns the value stored by the most recent write to x.</strong></em></font></p> <blockquote> <p align="left"><font size="4"><strong>Assumes instaneous communications<br> </strong></font><font color="#FF0000"><font size="4"><strong>QUESTION: is absolute global time order sufficient?<br> How about GPS clocks to 100 nanoseconds?</strong></font></font></p> </blockquote> <p align="left"><font color="#000000"><font size="4"><strong>Why is strict consistency important anyway?<br> consider two processes on a single processors. Since the order in which they run is arbitrary, then can make no guarantees about relative ordering even though absolute total ordering is possible.</strong></font></font></p> <ul> <li> <p align="left"><font color="#000000"><font size="4"><strong>So if I care the order of reads and writes - I should control explicitly.<br> If not, not an issue.</strong></font></font></p> </li> <li> <p align="left"><font color="#000000"><font size="4"><strong>What is important is causality perhaps.</strong></font></font><font size="4"><strong> </strong></font></p> <ul> <li> <p align="left"><font size="4"><strong>Consider that Read and Write are only two "events" on a memory</strong></font></p> </li> <li> <p align="left"><font size="4"><strong>Any Read by Process i can potentially affect the results of a subsequent write by Process i</strong></font></p> </li> <li> <p align="left"><font size="4"><strong>Any Read of a memory location x is affected by a previous Write of memory x by any process.</strong></font></p> </li> </ul> </li> <li> <p align="left"><font color="#000000"><font size="4"><strong>People live with out of date information all the time.</strong></font></font><font size="4"><strong> </strong></font></p> </li> <li> <p align="left"><font size="4"><strong>What exactly is a Read and Write? When CPU issues instruction? When at least one memory is accessed? when all copies are updated?</strong></font></p> </li> </ul> <p align="left"> </p> <hr> <h2>Sequential Consistency</h2> <p><font size="4"><em><strong>The result of any execution is the same as if the operations of all processors were executed in some sequential order, and the operations of each individual processor appear in this sequence in the order specified by its program.</strong></em><strong> (LAMPORT)</strong></font></p> <p><font size="4"><strong>Note similarity to serializability.</strong></font></p> <p><font size="4"><strong>Implies that all processes see same SEQUENCE of memory references</strong></font></p> <p><font size="4"><strong>While strict is nearly impossible on a distributed system, sequential is possible but with a penalty the r + w >= transfer time between nodes.</strong></font></p> <hr> <h2>Causal Consistency</h2> <p><font size="4"><strong>Not time ordering but information transfer potential</strong></font></p> <ul> <li><font size="4"><strong>write of any variable following read in one process is potentially linked</strong></font></li> <li><font size="4"><strong>read of some variable following write from any process is linked</strong></font></li> </ul> <p><font size="4"><em><strong>Writes that are pontentially causally related must be seen b all processes in the same order. Concurrent writes may be seen in a different order on different machines.</strong></em></font></p> <p> </p> <hr> <h2>Other Consistency Models</h2> <dl> <dt><font size="4"><strong>PRAM Consistency</strong></font></dt> <dd><font size="4"><em><strong>Writes done by a single processor are received in order issued</strong></em></font></dd> </dl> <p><font size="4"><strong>Weak Consistency</strong></font>: <font size="4"><strong>Need not propagate changes made inside critical section or atomic action. Synchronization variable: when synched, all writes propagated out and writes from others brought in.</strong></font></p> <ol> <li><font size="4"><strong>Accesses to sync variable are sequentially consistent</strong></font></li> <li><font size="4"><strong>no access to sync variable until previous writes propagated</strong></font></li> <li><font size="4"><strong>no data access until all previous access to sync variables are performed</strong></font></li> </ol> <p><font size="4"><strong>Release Consistency: Acquire and Release actions.</strong></font></p> <ol> <li><font size="4"><strong>Before ordinary access to shared variable, all previous acquires must be complete</strong></font></li> <li><font size="4"><strong>Before release, all previous reads and writes by this process must be done</strong></font></li> <li><font size="4"><strong>acquire/release must be processor consistent</strong></font></li> </ol> <p><font size="4"><strong>Can be eager or lazy.</strong></font></p> <p><font size="4"><strong>Entry Consistency: </strong><em><strong>Associate a set of shared variables with a synchronization lock. these locks are owned and have exclusive write permission.</strong></em></font></p> <ol> <li><font size="4"><strong>At acquire, all guarded variables are brought up to date for that process</strong></font></li> <li><font size="4"><strong>There can be only one process owning with exclusive mode</strong></font></li> <li><font size="4"><strong>Acquiring non-exclusive mode implies update by the owner of exclusive mode.</strong></font></li> </ol> <hr> <h5>Copyright chris wild 1996.<br> For problems or questions regarding this web contact <a href="mailto:wild@cs.odu.edu">[Dr. Wild]</a>.<br> Last updated: October 22, 1996.</h5> </body>