Designing vectorizable algorithms in linear algebra is usually straightforward. Indeed, for many computations there are several variants, all vectorizable, but with different characteristics in performance (see, for example, ). Linear algebra algorithms can come close to the peak performance of many machines -- principally because peak performance depends on some form of chaining of vector addition and multiplication operations, and this is just what the algorithms require.
However, when the algorithms are realized in straightforward Fortran 77 code, the performance may fall well short of the expected level, usually because vectorizing Fortran compilers fail to minimize the number of memory references -- that is, the number of vector load and store operations. This brings us to the next factor.