XBLAS - Extra Precise Basic Linear Algebra Subroutines

1. Introduction

This library of routines is part of a reference implementation for the Dense and Banded BLAS routines, along with their Extended and Mixed Precision versions, as documented in Chapters 2 and 4 of the new BLAS Standard, which is available from: http://www.netlib.org/blas/blast-forum/.

EXTENDED PRECISION is only used internally; the input and output arguments remain the same as in the existing BLAS. At present, we only allow Single, Double, or Extra internal precision. Extra precision is implemented as double-double precision (128-bit total, 106-bit significand). The routines for the double-double precision basic arithmetic operations +, -, *, / were developed by David Bailey.

We have designed all our routines assuming that single precision arithmetic is actually done in IEEE single precision (32 bits) and that double precision arithmetic is actually done in IEEE double precision (64 bits). The routines also pass our tests on an Intel machine with 80-bit floating point registers.

MIXED PRECISION permits some input/output arguments to be of different types (mixing real and complex) or precisions (mixing single and double).

The purpose of this implementation is to do a proof of concept implementation, showing that the considerable complexity of the specification is actually implementable and testable with a manageable amount of software. We have not attempted to optimize performance, but our code should be as good as straightforward but careful code written by hand.

2. Download and Copyright

Copyright

XBLAS is a freely-available software package. It is available from netlib via anonymous ftp and the World Wide Web at http://www.netlib.org/xblas .
Thus, it can be included in commercial software packages. We only ask that proper credit be given to the authors.
Like all software, it is copyrighted ( LICENSE ) . It is not trademarked, but we do ask the following:
If you modify the source for these routines we ask that you change the name of the routine and comment the changes made to the original.
We will gladly answer any questions regarding the software. If a modification is done, however, it is the responsibility of the person who modified the routine to provide support.

3. Content

The BLAS Standard defines language bindings for Fortran 95, Fortran 77, and C. Here, we have only implemented the C version and provided a method for binding to one Fortran 77 ABI.

In this initial release, we provide the following 11 routines:

Level 1

DOT (Inner product)
SUM (Sum)
AXPBY (Scaled vector accumulation)
WAXPBY (Scaled vector addition)

Level 2

GEMV (General matrix vector product)
GBMV (Banded matrix vector product)
SYMV (Symmetric matrix vector product)
SBMV (Symmetric banded matrix vector product)
SPMV (Symmetric matrix vector product, packed format)
HEMV (Hermitian matrix vector product)
HBMV (Hermitian banded matrix vector product)
HPMV (Hermitian matrix vector product, packed format)
GE_SUM_MV (Summed matrix-vector product)
TRSV (Triangular solve)

Level 3

GEMM (General matrix matrix product)
SYMM (Symmetric matrix matrix product)
HEMM (Hermitian matrix matrix product)

All have passed our systematic testing of all possible combinations of mixed and extended precision. We will eventually include everything in the intersection of Chapter 2 and Chapter 4 with systematic testing.

3.1 Directory Structure

This release contains the following directories:

doc/          - technical report

m4/           - Directory where m4 macro files and support
                routines are stored
m4/dot       \
m4/sum        } Directories for each function
m4/,..       /
m4/test-dot  \
m4/test-sum   } Directories for each test function
m4/...       /

src/          - Directory where C code is stored
src/dot      \
src/sum       } Target directories for C code
src/...      /

testing/      - Directory where C code for testing is stored
testing/test-dot  - DOT test code and results
testing/test-sum  - SUM test code and results
testing/...

4. Installation

The reference XBLAS are built similarly to the reference BLAS and LAPACK. The current build system produces a static libxblas.a.

You need to provide a make.inc file in the source directory that defines the compiler, optimization flags, and options for building a Fortran→C bridge. Some examples are provided.

Alternatively, you can use the configure script to attempt to produce a make.inc that is appropriate for your system with your C (and optionally, Fortran) compiler. You need to issue ./configure in the top-most directory. This will produce a make.inc file that you should check before proceeding. To specify a specific compiler(s) to use, you will need to do something like:

     CC=my_c_compiler FC=my_fortran_compiler ./configure

M4 is not necessary for the distributed archive; it is only necessary if you modify the generator sources under m4/. See README.devel for more information.

The Fortran→C bridge uses details of a specific toolchain's binary interface, in particular how the Fortran compiler mangles names. See src/f2c-bridge.h for the available options. Most compilers can support different name mangling schemes; be sure to use the SAME naming options for all your Fortran code.

The Fortran→C bridge is included in libxblas.a. Each of the bridge's object files matches *-f2c.o, so you can extract them with ar if you need to share one libxblas.a between multiple Fortran compilers. Example steps to strip the Fortran→C routines from libxblas.a and place them in a separate libxblas-myfortran.a are as follows:

     ar t libxblas.a |fgrep -- -f2c.o | xargs ar x libxblas.a
     ar ru libxblas-myfortran.a *-f2c.o
     ar x libxblas.a *-f2c.o
     rm *-f2c.o

5. Code Generation with M4 macro processor

In the existing BLAS, there are usually 4 routines associated with each operation. All input, output, and internal variables are single or double precision and real or complex. But under the new extended and mixed precision rules (see Chapter 4 for details), the input, output and internal variables may have different precisions and types. Therefore, the combination of all these types results in many more routines associated with each operation. For example, DOT will have 32 routines altogether, 4 “standard” versions (from Chapter 2) and 28 mixed and extended precision versions (from Chapter 4). In addition, the 16 versions with extended precision support up to three internal precisions that can be chosen at runtime. We have automated the code and test code generation as much as possible. We use the M4 macro processor to facilitate this task.

The idea is to define a macro for each fundamental operation. The macro's argument list contains the variables, accompanied by their types and precisions. For example, for the operation c ← a + b, we define the following macro:

ADD(c, c_type, a, a_type, b, b_type)

where, x_type can be one of:

real_single
real_double
real_extra
complex_single
complex_double
complex_extra

Inside the macro body, we use an “if-test” on c_type, a_type and b_type, to generate the appropriate code segment for “plus”. This is similar to operator overloading in C++; but we do it manually. All these if-tests are evaluated at macro-evaluation time, and do not appear in the executable code. Indeed, our goal was to produce efficient C code, which means minimizing branches in inner loops.

Other macros include SUB, MUL, DIV, DECLARE (variable declaration), ASSIGN, etc.

Since these macros are shared among all the BLAS routines, we put them in a common header file, named cblas.m4.h. Each BLAS routine also has its own macro file, such as dot.m4, spmv.m4 and gbmv.m4, to generate the specific functions. All the macro files are located in the m4/ subdirectory.

For example, the inner loop of the M4 macro for the dot product is simply as follows (the M4 parameters $2, $3, and $4 are types):

        for (i = 0; i < n; ++i) {
            GET_VECTOR_ELEMENT(x_ii, x_i, ix, $2)
                  /* put ix-th element of vector x into x_ii */
            GET_VECTOR_ELEMENT(y_ii, y_i, iy, $3)
                  /* put iy-th element of vector y into y_ii */
            MUL(prod, $4, x_ii, $2, y_ii, $3) /* prod = x[i]*y[i] */
            ADD(sum, $4, sum, $4, prod, $4) /* sum = sum+prod */
            ix += incx;
            iy += incy;
        } /* endfor */

The motivation for this macro-based approach is simplifying software engineering. For example, the file dot.m4 of M4 macros for the dot product is 401 lines long (245 non-comment lines) but expands into 11145 lines in 32 C subroutines implementing different versions of DOT. Similarly the macros for TRSV expand from 732 lines (454 non-comment lines) to 37099 lines in 32 C subroutines. (This does not count the shared M4 macros in the file cblas.m4.h.)

6. Testing

The goal of the testing code is to validate the underlying implementation. The challenges are twofold: First, we must thoroughly test routines claiming to use extra precision internally, where the test code is not allowed to declare any extra precision variables or use any other extra precision facilities not available to the code being tested. This requires great care in generating test data. Second, we must use M4 to automatically generate the many versions of test code needed for the many versions of the code being tested.

For each BLAS routine, we perform the following steps in the test code:

Generate input scalars, vectors and arrays, according to the routine's specification, so that the result exposes the internal precision actually used.
Call the BLAS routine
For each output, compute a “test ratio” of the computed error to the theoretical error bound, i.e.,
```
| Computed_value - "True_value" | / Error_Bound
```

By design, the test ratio should be at most 1. A larger ratio indicates that the computed result is either completely wrong, or not as accurate as claimed in the specification.

The following section will discuss how we generate “good” inputs in order to reveal the internal precisions actually used. For details, see the paper in file doc/report.ps.

6.1 Testing DOT

DOT performs the following function:

r <- beta * r_in + alpha * (SUM_{i=1,n} x_i*y_i)

Assume that the result r_computed is computed as follows

precision eps_int internally,
precision eps_out when the final result is rounded on output,
underflow threshold UN_int internally,
underflow threshold UN_out on output

and that additionally we compute a very accurate approximation r_acc with

precision eps_acc = 2^-106 (double-double format)
underflow threshold UN_acc = 2^-968

Then the error bound satisfies

|r_computed-r_acc| <= (n+2)(eps_int + eps_acc)*S + U + eps_out*|r_acc| (*)
                    = Scale

where

S = |alpha| * (SUM_{i=1,n} |x_i|*|y_i|) + |beta|*|r_in|
U = (|alpha|*n+2)*(UN_int + UN_acc) + UN_out

Thus, we can confirm that r_computed has been computed as accurately as claimed (i.e. with internal precision defined by eps_int and UN_int) by testing whether the

test ratio = |r_computed - r_acc| / Scale

is at most 1. Suppose that no underflow occurs, and that

eps_intX >> eps_int >= eps_acc.

where eps_intX is the internal precision actually used in some buggy version of DOT that we want to test. Then we can expect the test ratio to be as large as

              (n+2)*(eps_intX + eps_acc)*S + eps_out*|r_acc|
test ratio ~  ----------------------------------------------
              (n+2)*(eps_int  + eps_acc)*S + eps_out*|r_acc|

If we can make r_acc very small, then this ratio will be roughly

test ratio =  eps_intX / eps_int >> 1

which means the bug in DOT will be detected, and in fact the test ratio will actually tell us how much internal precision we effectively used.

Thus our goal is to pick test data alpha, x(1:n), y(1:n), beta and r_in to make |r_acc| as small as possible, MUCH smaller than S, so that eps_int term dominates on the right of the inequality (*). Otherwise eps_int will be invisible in the error bound, then we cannot tell what precision is used internally.

In our test generator, we choose input data alpha, beta, r_in, x(1:n) and y(1:n) judiciously in order to cause as much cancellation in r as possible.

6.2 Choosing input data and computing r_acc

The general approach is to choose some of the input values of x(i) and y(i) so that the exact (partial) dot product of these values has a lot of nonzero fraction bits, preferably at least 106. Then the remaining values of x(i) and y(i) are chosen to cancel the previous bits as much as possible. This latter computation seems to require high precision.

One possibility to use an arbitrary precision package, such as MPFUN, but our goal is to make this test code self contained, and use no extra precision facilities not available to the code being tested. Since most of our routines can be reduced to a series of dot products, testing can be based on DOT. We only need a TRUSTED dot routine. Currently, we are using our own dot routine with double-double internal precision to compute r_truth, which is accurate to 106 bits. This means that any internal precision higher than double-double cannot be detected, and may result in a tiny test ratio. A very tiny test ratio (such as zero) may also occur if the result happens to be computed exactly.

This raises the possibility that there are “matching” bugs in our trusted DOT and the DOT under test, so that bugs are missed. To avoid this possibility we also generate some test examples where the cancellation is done mathematically (and so exactly) rather than depending on a computation. The idea is simple: For example, choose x(1:3) and y(1:3) so that

x(1)*y(1) = -x(3)*y(3) >> x(2)*y(2)

so that SUM_1,3 x(i)*y(i) = x(2)*y(2) exactly.

6.3 Testing SPMV and GBMV

SPMV, GBMV, and many other Level 2 BLAS routines perform the following function:

y <- beta * y + alpha * A * x

Testing it is no more difficult than testing DOT, because each component of the computed y vector is a dot product, and satisfies the error bound (*). So we simply use the same test generator as DOT, and compute a test ratio for each component of the y vector. The only tricky part is that some entries in each dot product may be fixed. For example, the first row of A and the vector x can be chosen freely, but after that x is fixed and, if A is symmetric, the first entry of each subsequent row is fixed. Our dot-product test generator handles all these cases.

This approach can be generalized to most other Level 2 and 3 BLAS.

7. Feedback

Please send any comments or bug reports to extended_blas@cs.berkeley.edu. This code was developed by

Xiaoye Li,
Jim Demmel,
David Bailey,
Yozo Hida,
Jimmy Iskandar,
Anil Kapur,
Michael Martin,
Brandon Thompson,
Teresa Tung,
Daniel Yoo

with help from Ben Wanzo, Berkat Tung, Weihua Shen, Jason Riedy, and Deaglan Halligan.