Enhancing the Performance of Unstructured Mesh codes

Collaborators:

Jinghua Fu

Objective:

Irregular computations such as unstructured mesh algorithms or sparse matrix algorithms achieve only a small fraction of the available peak performance on modern microprocessors. Understanding the reasons for this low performance is difficult on modern microprocessors since they issue multiple instructions in each clock cycle, and permit them to execute out of issue-order by exploiting multiple functional units and multiple levels of caches.

Approach:

We have initiated a study measuring the performance of these codes using hardware performance counters and simulation tools to first understand and then improve the performance. We have studied the performance of the kernel of an unstructured mesh code for solving Euler's equations. By measuring various performance metrics, we find that on an SGI architecture the performance is limited primarily by the number of loads and stores that can be issued in a clock cycle. By scheduling the code such that it makes effective use of the registers and functional units, we have been able to improve the performance of the kernel from about $40$ Mflops/second to $270$ Mflops/second, half the peak performance of a $270$ MHz MIPS R12000 processor that can graduate two floating point instructions per clock cycle. One surprise so far has been that for this code, the level $1$ and level $2$ caches play a relatively minor role in determining performance on the SGI. We are continuing our work to find general principles for improving the performance of irregular computations.

back to highlights page