The BLAS Process



next up previous contents
Next: The BLACS Network Up: The Distributed Linear Previous: The Distributed Linear

The BLAS Process

 

As mentioned before, an efficient implementation of the BLAS masks the effects of the processor memory hierarchy and frees the programmer from local tuning of this basic kernel. The performance of the BLAS heavily depends on the number of memory references per floating point operation. This ratio naturally sorts the BLAS in three levels, where routines belonging to the same level usually reach similar execution rates. Consequently, the BLAS processes are, as far as performance analysis is concerned, able to perform only three instructions, corresponding to the three BLAS levels. The execution times per floating point operation of each of these instructions are then denoted by , with .



Antoine Petitet
Fri Mar 31 13:01:26 EST 1995