Enhancing the Performance of Unstructured Mesh codes
Collaborators:
Jinghua Fu
Objective:
Irregular computations such as
unstructured mesh algorithms or sparse matrix algorithms
achieve only a small fraction of the available peak performance
on modern microprocessors. Understanding the reasons for this
low performance is difficult on modern microprocessors since
they issue multiple instructions in each clock cycle,
and permit them to execute out of issue-order by exploiting
multiple functional units and multiple levels of caches.
Approach:
We have initiated a study
measuring the performance of these codes using hardware performance counters
and simulation tools to first understand and then improve the performance.
We have studied the performance of the kernel of an unstructured mesh
code for solving Euler's equations.
By measuring various performance metrics, we find that on an SGI architecture
the performance is limited primarily by the number of
loads and stores that can be issued in a clock cycle.
By scheduling the code such that it makes effective use of the
registers and functional units, we have been able to improve the performance
of the kernel from about $40$ Mflops/second to $270$ Mflops/second,
half the peak performance of a $270$ MHz MIPS R12000 processor
that can graduate two floating point instructions per clock cycle.
One surprise so far has been that for this code,
the level $1$ and level $2$ caches play a relatively minor role
in determining performance on the SGI.
We are continuing our work to find general principles
for improving the performance of irregular computations.
back to
highlights page