We first describe the meaning of each line of this input file below. Finally, a few useful experimental guide lines to set up the file are given at the end of this page.

HPL Linpack benchmark input file

Innovative Computing Laboratory, University of Tennessee

HPL.out output file name (if any)

6 device out (6=stdout,7=stderr,file)

3 # of problems sizes (N)

3000 6000 10000 Ns

5 # of NBs

80 100 120 140 160 NBs

2 # of process grids (P x Q)

1 2 Ps 6 8 Qs

4 2 Ps 13 8 Qs

16.0 threshold

-16.0 threshold

The remaning lines allow to specifies algorithmic features. xhpl will run all possible combinations of those for each problem size, block size, process grid combination. This is handy when one looks for an "optimal" set of parameters. To understand a little bit better, let say first a few words about the algorithm implemented in HPL. Basically this is a right-looking version with row-partial pivoting. The panel factorization is matrix-matrix operation based and recursive, dividing the panel into NDIV subpanels at each step. This part of the panel factorization is denoted below by "recursive panel fact. (RFACT)". The recursion stops when the current panel is made of less than or equal to NBMIN columns. At that point, xhpl uses a matrix-vector operation based factorization denoted below by "PFACTs". Classic recursion would then use NDIV=2, NBMIN=1. There are essentially 3 numerically equivalent LU factorization algorithm variants (left-looking, Crout and right-looking). In HPL, one can choose every one of those for the RFACT, as well as the PFACT. The following lines of HPL.dat allows you to set those parameters.

3 # of panel fact 0 1 2 PFACTs (0=left, 1=Crout, 2=Right) 4 # of recursive stopping criterium 1 2 4 8 NBMINs (>= 1) 3 # of panels in recursion 2 3 4 NDIVs 3 # of recursive panel fact. 0 1 2 RFACTs (0=left, 1=Crout, 2=Right)

2 # of panel fact 2 0 PFACTs (0=left, 1=Crout, 2=Right) 2 # of recursive stopping criterium 4 8 NBMINs (>= 1) 1 # of panels in recursion 2 NDIVs 1 # of recursive panel fact. 2 RFACTs (0=left, 1=Crout, 2=Right)

In the main loop of the algorithm, the current panel of column is broadcast in process rows using a virtual ring topology. HPL offers various choices and one most likely want to use the increasing ring modified encoded as 1. 3 and 4 are also good choices.

1 # of broadcast 1 BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)

2 # of broadcast 0 4 BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)

1 # of lookahead depth 1 DEPTHs (>=0)

2 # of lookahead depth 0 1 DEPTHs (>=0)

1 SWAP (0=bin-exch,1=long,2=mix) 60 swapping threshold

2 SWAP (0=bin-exch,1=long,2=mix) 60 swapping threshold

0 L1 in (0=transposed,1=no-transposed) form

0 U in (0=transposed,1=no-transposed) form

1 Equilibration (0=no,1=yes)

8 memory alignment in double (> 0)

- Figure out a good block size for the matrix multiply
routine. The best method is to try a few out. If you happen
to know the block size used by the matrix-matrix multiply
routine, a small multiple of that block size will do fine.
This particular topic is discussed in the
FAQs section.

- The process mapping should not matter if the nodes of
your platform are single processor computers. If these nodes
are multi-processors, a row-major mapping is recommended.

- HPL likes "square" or slightly flat process grids. Unless
you are using a very small process grid, stay away from the
1-by-Q and P-by-1 process grids. This particular topic is also
discussed in the FAQs section.

- Panel factorization parameters: a good start are the
following for the lines 14-21:
1 # of panel fact 1 PFACTs (0=left, 1=Crout, 2=Right) 2 # of recursive stopping criterium 4 8 NBMINs (>= 1) 1 # of panels in recursion 2 NDIVs 1 # of recursive panel fact. 2 RFACTs (0=left, 1=Crout, 2=Right)

- Broadcast parameters: at this time it is far from obvious
to me what the best setting is, so i would probably try them
all. If I had to guess I would probably start with the
following for the lines 22-23:
The best broadcast depends on your problem size and harware performance. My take is that 4 or 5 may be competitive for machines featuring very fast nodes comparatively to the network.2 # of broadcast 1 3 BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)

- Look-ahead depth: as mentioned above 0 or 1 are likely to
be the best choices. This also depends on the problem size
and machine configuration, so I would try "no look-ahead (0)"
and "look-ahead of depth 1 (1)". That is for lines 24-25:
2 # of lookahead depth 0 1 DEPTHs (>=0)

- Swapping: one can select only one of the three algorithm
in the input file. Theoretically, mix (2) should win, however
long (1) might just be good enough. The difference should be
small between those two assuming a swapping threshold of the
order of the block size (NB) selected. If this threshold is
very large, HPL will use bin_exch (0) most of the time and if
it is very small (< NB) long (1) will always be used. In
short and assuming the block size (NB) used is say 60, I
would choose for the lines 26-27:
I would also try the long variant. For a very small number of processes in every column of the process grid (say < 4), very little performance difference should be observable.2 SWAP (0=bin-exch,1=long,2=mix) 60 swapping threshold

- Local storage: I do not think Line 28 matters. Pick 0 in
doubt. Line 29 is more important. It controls how the panel
of rows should be stored. No doubt 0 is better. The caveat is
that in that case the matrix-multiply function is called with
( Notrans, Trans, ... ), that is C := C - A B^T. Unless the
computational kernel you are using has a very poor (with
respect to performance) implementation of that case, and is
much more efficient with ( Notrans, Notrans, ... ) just pick
0 as well. So, my choice:
0 L1 in (0=transposed,1=no-transposed) form 0 U in (0=transposed,1=no-transposed) form

- Equilibration: It is hard to tell whether equilibration
should always be performed or not. Not knowing much about the
random matrix generated and because the overhead is so small
compared to the possible gain, I turn it on all the time.
1 Equilibration (0=no,1=yes)

- For alignment, 4 should be plenty, but just to be safe,
one may want to pick 8 instead.
8 memory alignment in double (> 0)