HPL Frequently Asked Questions

What problem size N should I run ?

In order to find out the best performance of your system, the largest problem size fitting in memory is what you should aim for. The amount of memory used by HPL is essentially the size of the coefficient matrix. So for example, if you have 4 nodes with 256 Mb of memory on each, this corresponds to 1 Gb total, i.e., 125 M double precision (8 bytes) elements. The square root of that number is 11585. One definitely needs to leave some memory for the OS as well as for other things, so a problem size of 10000 is likely to fit. As a rule of thumb, 80 % of the total amount of memory is a good guess. If the problem size you pick is too large, swapping will occur, and the performance will drop. If multiple processes are spawn on each node (say you have 2 processors per node), what counts is the available amount of memory to each process.

What block size NB should I use ?

HPL uses the block size NB for the data distribution as well as for the computational granularity. From a data distribution point of view, the smallest NB, the better the load balance. You definitely want to stay away from very large values of NB. From a computation point of view, a too small value of NB may limit the computational performance by a large factor because almost no data reuse will occur in the highest level of the memory hierarchy. The number of messages will also increase. Efficient matrix-multiply routines are often internally blocked. Small multiples of this blocking factor are likely to be good block sizes for HPL. The bottom line is that "good" block sizes are almost always in the [32 .. 256] interval. The best values depend on the computation / communication performance ratio of your system. To a much less extent, the problem size matters as well. Say for example, you emperically found that 44 was a good block size with respect to performance. 88 or 132 are likely to give slightly better results for large problem sizes because of a slighlty higher flop rate.

What process grid ratio P x Q should I use ?

This depends on the physical interconnection network you have. Assuming a mesh or a switch HPL "likes" a 1:k ratio with k in [1..3]. In other words, P and Q should be approximately equal, with Q slightly larger than P. Examples: 2 x 2, 2 x 4, 2 x 5, 3 x 4, 4 x 4, 4 x 6, 5 x 6, 4 x 8 ... If you are running on a simple Ethernet network, there is only one wire through which all the messages are exchanged. On such a network, the performance and scalability of HPL is strongly limited and very flat process grids are likely to be the best choices: 1 x 4, 1 x 8, 2 x 4 ...

What about the one processor case ?

HPL has been designed to perform well for large problem sizes on hundreds of nodes and more. The software works on one node and for large problem sizes, one can usually achieve pretty good performance on a single processor as well. For small problem sizes however, the overhead due to message-passing, local indexing and so on can be significant.

Why so many options in HPL.dat ?

There are quite a few reasons. First off, these options are useful to determine what matters and what does not on your system. Second, HPL is often used in the context of early evaluation of new systems. In such a case, everything is usually not quite working right, and it is convenient to be able to vary these parameters without recompiling. Finally, every system has its own peculiarities and one is likely to be willing to emperically determine the best set of parameters. In any case, one can always follow the advice provided in the tuning section of this document and not worry about the complexity of the input file.

Can HPL be Outperformed ?

Certainly. There is always room for performance improvements. Specific knowledge about a particular system is always a source of performance gains. Even from a generic point of view, better algorithms or more efficient formulation of the classic ones are potential winners.

Execution dies with "Floating point exception" on Compaq Alpha 21264 processor based node

Some older version of the cxml BLAS library seems to require operands adresses to be aligned on 64 or 96 bytes boundaries. This error was tracked down to the Level 2 BLAS DGER routine. It is rather difficult to reproduce in a small example program since the operands A, X and Y need to be stored at specific places in memory for the problem to occur. There are at least 4 possibles independent fixes (applying one of them should be enough):

Get the latest cxml library.
Do not use the right-looking variant of the panel factorization: Line 14 of the input data file HPL.dat should not contain a 2. Use only 0s or 1s on that line.
Link in the Fortran 77 reference implementation of that routine. Download the source from netlib and compile it. Move the object file "dger.o" into the hpl/testing/ptest/<arch> directory. Edit the Makefile there and add "dger.o" to the list of object files. Issue in that directory a "make clean; make". You should have just built a patched executable (hpl/bin/<arch>/xhpl).
Another possible fix that does not seem to always work is to set the value of the ALIGN parameter in the data input file hpl/bin/<arch>/HPL.dat to a multiple of 8 (or 12) larger than or equal to 8 (or 12).

[Home] [Contact] [Copyright and Licensing Terms] [Algorithm] [Scalability] [Performance Results] [Documentation] [Software] [FAQs] [Tuning] [Errata-Bugs] [References] [Related Links]