The references have been sorted in four categories and chronologically listed within each category. The four categories are

- Linpack Benchmark
- Parallel LU Factorization
- Recursive LU Factorization
- Parallel Matrix Multiply
- Parallel Triangular Solve

*LINPACK Users Guide*, J. Dongarra, J. Bunch, C. Moler and G. W. Stewart, SIAM, Philadelphia, PA, 1979.*Performance of Various Computers Using Standard Linear Equations Software*, J. Dongarra, Technical Report CS-89-85, University of Tennessee, 1989. (An updated version of this report can be found at http://www.netlib.org/benchmark/performance.ps).*Towards Peak Parallel LINPACK Performance on 400*, R. Bisseling and L. Loyens, Supercomputer, Vol. 45, pp. 20-27, 1991.*Massively Parallel LINPACK Benchmark on the Intel Touchstone DELTA and iPSC/860 Systems*, R. van de Geijn, 1991 Annual Users Conference Proceedings. Intel Supercomputer Users Group, Dallas, TX, 1991.*The LINPACK Benchmark on the AP 1000*, R. Brent, Frontiers, 1992, pp. 128-135, McLean, VA, 1992.*Implementation of BLAS Level 3 and LINPACK Benchmark on the AP1000*, R. Brent and P. Strazdins, Fujitsu Scientific and Technical Journal, Vol. 5, No. 1, pp. 61-70, 1993.*LU Factorization and the LINPACK Benchmark on the Intel Paragon*, D. Womble, D. Greenberg, D. Wheat and S. Riesen, Sandia Technical Report, 1994.*Massively Parallel Distributed Computing: Worlds First 281 Gigaflop Supercomputer*, J. Bolen, A. Davis, B. Dazey, S. Gupta, G. Henry, D. Robboy, G. Schiffler, D. Scott, M. Stallcup, A. Taraghi, S. Wheat from Intel SSD, L. Fisk, G. Istrail, C. Jong, R. Riesen, L. Shuler, from Sandia National Laboratories, Proceedings of the Intel Supercomputer Users Group 1995.*High Performance Software on Intel Pentium Pro Processors or Micro-Ops to TeraFLOPS*, B. Greer and G. Henry, Proceedings of the SuperComputing 1997 Conference, ACM SIGARCH - IEEE Computer Society Press - ISBN: 0-89791-985-8, San Jose, CA, 1997.

*Communication Complexity of the Gaussian Elimination Algorithm on Multiprocessors*, Y. Saad, Linear Algebra and Its Applications, Vol. 77, pp. 315-340, 1986.*LU Factorization Algorithms on Distributed-Memory Multiprocessor Architectures*, G. Geist and C. Romine, SIAM Journal on Scientific and Statistical Computing, Vol. 9, pp. 639-649, 1988.*Parallel LU Decomposition on a Transputer Network*, R. Bisseling and J. van der Vorst, Lecture Notes in Computer Sciences, Springer-Verlag, Eds. G. van Zee and J. van der Vorst, Vol. 384, pp. 61-77, 1989.*The Distributed Solution of Linear Systems Using the Torus-Wrap Data Mapping*, C. Ashcraft, ECA-TR-147, Boeing Computer Services, Seattle, WA, 1990.*Experiments with Multicomputer LU-Decomposition*, E. van de Velde, Concurrency: Practice and Experience, Vol. 2, pp. 1-26, 1990.*A Taxonomy of Distributed Dense LU Factorization Methods*, C. Ashcraft, ECA-TR-161, Boeing Computer Services, Seattle, WA, 1991.*The Torus-Wrap Mapping for Dense Matrix Calculations on Massively Parallel Computers*, B. Hendrickson and D. Womble, SIAM Journal on Scientific and Statistical Computing, Vol. 15, pp. 1201-1226, 1994.*Scalability Issues in the Design of a Library for Dense Linear Algebra*, J. Dongarra, R. van de Geijn and D. Walker, Journal of Parallel and Distributed Computing, Vol. 22, No. 3, pp. 523-537, 1994.*Matrix Factorization using Distributed Panels on the Fujitsu AP1000*, P. Strazdins, Proceedings of the IEEE First International Conference on Algorithms And Architectures for Parallel Processing ICA3PP-95, Brisbane, 1995.*The Design and Implementation of the ScaLAPACK LU, QR, and Cholesky Factorization Routines*, J. Choi, J. Dongarra, S. Ostrouchov, A. Petitet, D. Walker and R. C. Whaley, Scientific Programming, Vol. 5, pp. 173-184, 1996.

*Locality of Reference in LU Decomposition with partial pivoting*, S. Toledo, SIAM Journal on Matrix. Anal. Appl., Vol. 18, No. 4, 1997.*Recursion Leads to Automatic Variable Blocking for Dense Linear-Algebra Algorithms*, F. Gustavson, IBM Journal of Research and Development, Vol. 41, No. 6, pp. 737-755, 1997

*Matrix Algorithms on a Hypercube I: Matrix Multiplication*, G. Fox, S. Otto and A. Hey, Parallel Computing, Vol. 3, pp. 17-31, 1987.*Basic Matrix Subprograms for Distributed-Memory Systems*, A. Elster, Proceedings of the Fifth Distributed-Memory Computing Conference, Eds. D. Walker and Q. Stout, IEEE Press, pp. 311-316, 1990.*The Parallelization of Level 2 and 3 BLAS Operations on Distributed-Memory Machines*, M. Aboelaze, N. Chrisochoides and E. Houstis, CSD-TR-91-007, Purdue University, West Lafayette, IN, 1991.*The Multicomputer Toolbox Approach to Concurrent BLAS and LACS*, R. Falgout, A. Skjellum, S. Smith and C. Still, Proceedings of the Scalable High Performance Computing Conference SHPCC-92, IEEE Computer Society Press, 1992.*A High Performance Matrix Multiplication Algorithm on a Distributed-Memory Parallel Computer, Using Overlapped Communication*, R. Agarwal, F. Gustavson and M. Zubair, IBM Journal or Research and Development, Vol. 38, No. 6, pp. 673-681, 1994.*PUMMA: Parallel Universal Matrix Multiplication Algorithms on Distributed-Memory Concurrent Computers*, J. Choi, J. Dongarra and D. Walker, Concurrency: Practice and Experience, Vol. 6, No. 7, pp. 543-570, 1994.*Matrix Multiplication on the Intel Touchstone DELTA*, S. Huss-Lederman, E. Jacobson, A. Tsao and G. Zhang, Concurrency: Practice and Experience, Vol. 6, No. 7, pp. 571-594, 1994.*A Three-Dimensional Approach to Parallel Matrix Multiplication*, R. Agarwal, S. Balle, F. Gustavson, M. Joshi and P. Palkar, IBM Journal or Research and Development, Vol. 39, No. 5, pp. 575-582, 1995.*A High Performance Parallel Strassen Implementation*, B. Grayson and R. van de Geijn, Parallel Processing Letters, Vol. 6, No. 1, pp. 3-12, 1996.*Parallel Implementation of BLAS: General Techniques for Level 3 BLAS*, A. Chtchelkanova, J. Gunnels, G. Morrow, J. Overfelt and R. van de Geijn, Concurrency: Practice and Experience, Vol. 9, No. 9, pp. 837-857, 1997.*A Poly-Algorithm for Parallel Dense Matrix Multiplication on Two-Dimensional Process Grid Topologies*, J. Li, R. Falgout and A. Skjellum, Concurrency: Practice and Experience, Vol. 9, No. 5, pp. 345-389, 1997.*SUMMA: Scalable Universal Matrix Multiplication Algorithm*, R. van de Geijn and J. Watts, Concurrency: Practice and Experience, Vol. 9, No. 4, pp. 255-274, 1997.

*Parallel Solution Triangular Systems on Distributed-Memory Multiprocessors*, M. Heath and C. Romine, SIAM Journal on Scientific and Statistical Computing, Vol. 9, pp. 558-588, 1988.*A Parallel Triangular Solver for a Distributed-Memory Multiprocessor*, G. Li and T. Coleman, SIAM Journal on Scientific and Statistical Computing, Vol. 9, No. 3, pp. 485-502, 1988.*A New Method for Solving Triangular Systems on Distributed-Memory Message-Passing Multiprocessor*, G. Li and T. Coleman, SIAM Journal on Scientific and Statistical Computing, Vol. 10, No. 2, pp. 382-396, 1989.*Parallel Triangular System Solving on a Mesh Network of Transputers*, R. Bisseling and J. van der Vorst, SIAM Journal on Scientific and Statistical Computing, Vol. 12, pp. 787-799, 1991.