========================================================================= Improving Memory Performance on Multiprocessor Systems Xiaodong Zhang College of William and Mary A major challenge we are facing in the design of computer systems and for effective use of systems is the bottleneck of memory performance. With the rapid development of VLSI technology, the speed of processors has increased dramatically in the past decade. Nevertheless, the speed of memories has been increased at a much slower pace. We will first overview the technical development in the field and show the limit of the connectivity in computer systems To overcome the limit, we should make a strong effort to fully use the cache and reduce the number of memory accesses at both system and application levels. We present a runtime approach to exploit cache locality for loop oriented scientific application programs with dynamic memory access patterns. It is complementary to a compiler approach. Guided by some simple application- and architecture-dependent hints, the runtime system generates task partitions from locality optimizations at runtime. The objectives are to maximize data reuse in each cache, and to minimize the data sharing among multiprocessors. The effectiveness of this approach is evaluated by using both simulation and measurements on two commercial SMP systems: an HP/Convex Exemplar S-class and a SUN Ultra-SPARCstation-20. We show that this runtime approach can achieve comparable performance to that of regular programs optimized by compilers. However, we also show that it is able to significantly improve execution performance of programs with dynamic memory access patterns. Programs of this type are usually hard to be optimized by compilers.