EuroBen Experiences with the SGI Origin 2000 and the Cray T3E

 
A.J. van der Steen
Computational Physics
Utrecht University
Holland

 
June 1998


ABSTRACT


1. INTRODUCTION

This paper describes our first experiences with the Silicon Graphics Origin2000 system obtained through  experiments with the EuroBen Benchmark [12]. These experiments have been conducted between June and September 1997. We think that they have given us a fairly intimate knowledge of the behaviour of the Origin2000 system with regard to its performance. The knowledge of the system architecture together with  the benchmark results also gives some confidence about predicting performances of programs that to a fair  extent employ algorithms as present in the benchmark.

The Origin2000 is a so-called Non-Uniform Memory Access (NUMA) parallel system which means that the  time to fetch or store data from/to the memory may vary with the particular CPU and storage location of  the data that are required or produced by that CPU. The machines employs MIPS R10000 processors with  32 KB primary instruction and data caches. On this machine all tests were conducted with the MIPSPRO  f77 compiler, version 7.20 with options -O3 -64 -OPT:IEEE arithmetic=3:roundoff=3. The secondary  cache was 4 MB on our test system. The system will usually be used as a shared-memory system in which  all CPUs can access all of the memory as a logical single entity. However, this is not necessary: the machines  can also be viewed as a distributed memory system and various message passing libraries like MPI is (and  the Cray shmem library shortly will be) available. So, the system can be used with various programming  models each of which may have its own advantages in various situations.

The Cray T3E on which we generated our results contains DEC Alpha 21164 processors with a clock  cycle of 3.3 ns. The primary data cache and instruction caches are both 8 KB while the secondary cache  is 96 KB. On the T3E we used the f90 compiler, version 3.0.0.1 with options -O3 -dp. On the T3E are  stream buffers available that assist in prefetching fixed-stride data and that may speed up both computation  and communication. The results in this paper all are generated with the stream buffers on except for the  measurement of the point-to-point communication speed where this speed was measured both with streams  on and off to show the influence. The computational nodes are connected by a fast 3-D torus network. The  T3E was only used as a distributed memory message passing machine. Both the use of the Cray shmem  library and MPI are possible. We only quote MPI results, however. For a more detailed description of the  Origin2000 and the Cray T3E systems one is referred to [9, 13].

This paper has the following structure: Section 2 contains a short description of the EuroBen Benchmark. We present the structure of the benchmark, give reasons for its structure and what we hope to learn from its results.

In section 3 a selection of the results of the benchmark is presented. We discuss these results in the light of  the knowledge of the  architecture. Both single CPU results and parallel results for a selection of programs are  given. The EuroBen benchmark originally was written for shared-memory systems, however, a few programs  are also available as distributed-memory message passing codes. Comparing these different programming  models is quite interesting. Where relevant we compare the outcomes of the benchmark on the Origin2000  with those of  the Cray T3E.

The last section summarises the results of the previous one and we draw conclusions as to our findings with  regard to the performance of the Origin 2000.

The results in this paper have been generated using Origin2000 systems that run at a clock speed of 5.13 ns  (195 MHz). There are models of the Origin2000 that run at a speed of 5.56 ns (180 MHz). The results from the faster machine will carry over directly to the slower model.

2. The EuroBen Benchmark

The programs in the EuroBen Benchmark range from very simple to complete algorithms that are important  in certain application areas. The simple programs give basic information on the performance of operations, intrinsic functions, etc., in order to identify the strong and weak  points of a machine. When one wants to  extract more information one should conduct the next level of tests which contains simple but  frequently  used algorithms like FFTs, random number generation, etc. So, a graded approach is used to collect the  necessary information. When one wants more information one has to run more modules. The results will  generally become less easy interpretable as the level of  complication rises but they will still be comparable  for different machines of similar architecture. In this stage the degree of similarity  is already known from the  hardware and software characteristics as described above. As yet all modules are written in Fortran 77  and at least 64-bit floating-point arithmetic is required.

The structure of the benchmark is inherited from an earlier one [10] but it has been modified considerably  since its first version and improved by programs of some of the EuroBen founders, Armin Friedli (measuring speed and accuracy of intrinsic functions) and Roger Hockney (measurement of the parameter f 1/2 , for an explanation of this parameter see [4].

2.1 Contents of module 1

The contents of the first module as distributed in the current version (version 3.2) are:

mod1ac) A test program for basic operations.

mod1d) A test program for memory bank conflicts. mod1f) A test program for the speed of the important intrinsic functions like cos, sin, arctg, log, exp, sqrt,  etc. mod1g) A test program to measure the memory bottleneck. The decision which kernels should be represented in mod1ac above was primarily made on grounds of the  programs in the algorithmic part of the benchmark, module 2. So, with the data obtained from the first  module we are prepared to interpret the results of more substantial programs.

2.2 Contents of module 2

In the second module simple basic algorithms are brought together. These algorithms are frequently employed  in typical numerical applications  like optimisation, digital signal analysis, etc.. For many algorithms in this  module vendors make library routines available because of their importance. The library routines should be  compared with the algorithms given here. The second module contains the following programs:

mod2a) A full matrix-vector product for n = 25, 50, 100, 200, 300, 500.

mod2b) Solution of full linear systems of size N = 100, 300, 500, and 1,000 mod2c) Solution of sparse linear systems of sizes 1,000,. . . ,1,000,000. mod2d) A program for the solution of a general real symmetric eigenvalue problem. mod2e) A program for the computation of a sparse real symmetric mod2f) A 1-D complex FFT program. Sizes range from n = 28 ; : : : ; 216 . mod2h) A program for the generation of uniformly distributed random numbers. As already explicitly mentioned for mod2a and mod2b, there may be corresponding library routines available  from the vendor. It is valuable to run the programs in mod2c--mod2f with these library routines for several  reasons. Firstly, it gives an impression of how far a vendor has proceeded in optimising his library routines for  his own hardware. For some machines the Fortran source of the programs will not be able to fully employ  the hardware which may be overcome by using assembler coded routines. Furthermore, especially for parallel architectures the choice of  synchronisation primitives, etc., can have a large effect on the performance and  machine coding can be more flexible. Also, the optimal amount of work to be done in a parallel process may  not be straightforward. One may hope that such optimal choices are already made in the provided library  routines. Program mod2g is presently lacking. It is a wavelet transform that is still under development.

2.3 Contents of module 3

In the third module more elaborate algorithms have been collected. For the greater part, they contain  algorithms already present in some form in module 1 and 2 but they more represent classes of computational problems encountered in scientific applications. This includes explicit I/O-schemes for very large problems  as opposed to automatic handling by virtual memory operating systems. The contents of the third module  are:

mod3a) Very large, very sparse matrix-vector multiplication.

mod3b) A program to perform a complex two-dimensional 8K x 8K complex FFT. mod3c) A zebra-line multigrid code. mod3d) A program for solving a linear least squares problem. mod3e) A program to solve a non-linear optimization problem. mod3f) A program which solves a one-dimensional diffusion problem. mod3g) A Fast Elliptic Solver program. mod3h) A Block Relaxation program. The programs in module 3 represent important aspects of numerical computation in computational science  and hence their performance should be known. Again as in modules 1 and 2 the variety in architectures and mechanisms for parallelisation make it difficult to find optimal workloads for parallel processors beforehand.  So, it is advisable to run on different numbers of processors to see whether the scaling properties turn out  as expected.

One should refrain from tuning the code beyond the necessary changes to employ the available parallelism  in order to obtain results that are comparable with those from other machines. We believe that most of the results found from module 3 can be interpreted with reasonable completeness with help of the knowledge  provided by the necessary machine information and the former two modules.

2.4 Time dependency of the benchmark

The contents of the benchmark set is not static. So, new versions are developed if required. For instance,  various programs in module 2 and module 3 relied on BLAS, LINPACK, and EISPACK as widely distributed  and accepted standard software. However, these packages presently often have been replaced by more modern  software like level 2 and level 3 BLAS, and LAPACK. These developments have had their impact on the  contents of  the EuroBen benchmark. Also, new algorithms have to be introduced because of their growing  importance. For instance a Discrete Wavelet Transform will be introduced as program mod2g in the next  release of the benchmark.

2.5 Distributed memory programs

The EuroBen benchmark was originally conceived for shared-memory systems. However, the proportion  of DM-MIMD systems has grown steadily since the time of conception. Although also on these machines  nowadays often a shared-memory program model is offered (Cray T3D, HP/Convex Exemplar, SGI Origin  2000), the importance of benchmarking DM-MIMD systems with a message-passing programming model is  quite high.

Presently, only for a few programs in the EuroBen suite message-passing implementations are available:  gmxv and disfft which correspond to the shared memory programs mod2a and mod2f, respectively. Program  disfft is implemented in a 2-D product form using the same radix-4 algorithm as program mod2f. In addition  programs mod1h and dddot are provided. The first program measures basic point-to-point communication  speed (bandwidth and latency); the second one tests three implementations of a distributed dotproduct. It  is important in that it gives insight in the quality of available reduction and broadcast operations and also  tests the influence of concentrating many send and receive buffers on one processor.

3 Benchmark results

3.1 Results of module 1

We first present the results for basic operations as measured with program mod1ac. In Table 1 the n1/2 and rinf  values on one processor of the Origin2000 and the Cray T3E are given for operations of which the operands are loaded from the primary cache.

We only show the first 14 of the 31 kernels. Kernel 15--31  perform operations that are largely identical with kernels 1--14 except that odd and even strides and index  arrays are employed. The results found with these latter kernels were so similar to the first ones that we  do not give them here. However, this behaviour is markedly different from that of vectorprocessor-based  systems where usually a significant difference in speed is found.

Table 2 list the results for the same kernels, however now with the operations from memory. In this table  Origin2000 Cray T3E

we left the n1/2 values out because they all were negligible. The first to remark is that, although the clock  frequency of the R10000 processor is lower than that of the Alpha 21164 (195 vs. 300 MHz), the Origin2000 processor is always faster, except for the in-cache division. This exception is caused by the fact that division  in the R10000 is performed by a so-called secondary floating-point unit that is not pipelined at all. For the  dyadic operations, +, -, and x the Origin2000 approaches a speed of 65 Mflop/s which is consistent with  loading/storing one operand/cycle in the in-cache situation. The R10000 should be able to load/store two  64-bit operands/cycle, so only half of  the maximum bandwidth is used. The 21164 processor uses less than  50% of the bandwidth that ships one operand/cycle for these operations. The situation for out-of-memory  operations is worse: for most operations the performance roughly drops by a factor of 2 for the Origin2000  while it decreases mostly by a factor of 2--3 on the T3E. The dyadic out-of-memory operations suggest  processor-to-memory bandwidths of 828 and 416 MB/s for the Origin2000 and T3E, respectively. The speed  given for kernels 7--10 in Table 2 are consistent with these bandwidths.

Clearly, no optimisation is done for the dyadic operations on the T3E. A hand-coded single unrolling  of kernels 3--5 yielded rinf  values of about 70 Mflop/s for in-cache operation (while the n1/2  values were approximately the same). The speeds for out-of-cache operations are not notably affected as these are  completely determined by the low processor-memory bandwidth.

Kernel 14, the evaluation of a 9th degree polynomial has a very high reuse of operands in the registers  and should consequently nearly attain the Theoretical Peak Performance at the respective systems (these  values are 390 and 600 Mflop/s for the Origin2000 and the T3E, respectively). This turns out to be true for  the Origin2000 but hardly so for the T3E: both with in-cache and out-of-cache operation slightly more than  10% of the Theoretical Peak Performance is observed. As yet, we have no explanation for this behaviour.

3.1.1 mod1ac in parallel

As the Origin2000 can be used as a SM-MIMD system it is simple to run program mod1ac in parallel. In  Figure 1 the rinf  values for 1, 4, and 10
processors for the first 14 kernels are displayed.

Kernel 7, 11, and 12 do not benefit much from parallelisation. In fact, the compiler uses heuristics to  determine whether parallelisation is  worthwhile. The choices made for these estimations seem to work: either a kernel is parallelised and benefits from it or the loop is not parallelised. We did not observe kernels that  decreased in performance due to parallelisation overhead. However, the overhead is clearly not  negligible  as can be seen from the values in Figure 1. Another indicator for the parallelisation overhead were the  corresponding  n1/2 values. These ranged to lengths of 200--500 on 4 processors and to 800--1400 on 10  processors.

Note that for kernel 7, the dotproduct, there is a slight improvement from 1 to 4 processors but the  kernel is not parallelised on 10 processors due to too little work to be done in the loop. Kernels 11 and 12  are first and second order recurrences. Naturally, these are not parallelised.

3.1.2 The distributed dotproduct

On both machines we executed the program dddot which contains three different implementations of a distributed-    dotproduct of length  1,000,000: a "naive" implementation, a Fortran-coded tree implementation  and using MPI Reduce/ Broadcast. In the naive implementation all partial sums are sent to one processor to  assemble the global sum. The global sum is broadcast from this processor to all others again. The results  of this program are given in Figures 2 and 3 respectively. The Figures 2 and 3 show that on both machines  there is not much difference between the tree implementation and using the MPI Reduce/ Broadcast routines. As may be expected the naive implementation performs worse on both systems. In absolute speed the Origin2000 is about 2 times faster than the T3E. This is not in line with the out-of-cache results found earlier  with program mod1ac where the dotproduct is more than 4 times faster. The relatively better performance  must be ascribed to the communication that seems more efficient on the T3E as is apparent from the per  node performance: on the Origin2000 it decreases from 138 to 99 Mflop/s on 32 processors while on the T3E  this decrease is only from 44 to 41 Mflop/s on 32 processors.

3.1.3 Results for mod1e/f

Program mod1f measures the speed of the most important intrinsic functions for all ranges of interest while  mod1e tests their correctness and precision. For both machines the precision as tested with mod1e was  acceptable. The results of program mod1f are given in Table 3. We only show the result for the argument  intervals that are evaluated fastest. However, there is not much difference in speed between the various  argument intervals. Also, the length of the argument vector has not much influence: in-cache and out-of-cache results hardly differ in speed.

As can be seen from Table 3 in almost all cases the Origin2000 is faster by a factor of 1.5--4. There are  some notable exceptions: (x 2 ) 1:5 is  evaluated almost 15 times faster on the Origin2000 and both the Sqrt  and Exp functions are significantly faster on the T3E. For the Sqrt this can be explained by the fact that  a non-pipelined secondary floating-point unit is employed to compute square root values on the Origin2000  which is relatively slow. For the Exp function there is no such explanation.

3.1.4 Results for mod1g

Program mod1g computes the cost of memory references through the f 1/2  parameter (see [4]). For the T3E f 1/2  is 0.4452 flop/mref with a   rinf (rhat) of 77.08 Mflop/s. For the Origin2000 these figures are 0.7323 flop/mref  and 413.99 Mflop/s, respectively. The values given here are measured for in-cache operation. Though the  f 1/2  value of the Origin2000 is almost two times higher, this is more than compensated for by the higher  rinf (rhat).

3.1.5 Results for mod1h

mod1h is a simple point-to-point communication measurement program that measures the bandwidth and  latency for a 1-hop communication. This is done by a ping-pong procedure using the MPI Send and MPI Recv  routines, much like that in the COMMS1 program from the Parkbench Benchmark [5]. However, for the  computation of the latencies only short messages were used. Firstly, because for short messages the latency  has a large effect and so is more important for the effective bandwidth with which such a short message is sent.  Secondly, because the buffering strategy is changed with the message length for most MPI implementations.  This is also the case on the Origin2000 and the T3E. Therefore, only messages up to 400 bytes were used  to determine the latencies. The latencies found were 22.3 and 22.2 micro-secs for the T3E with the stream buffer off and on respectively. The standard errors in these values were 1.5% and 2.9%. On the Origin2000 the  latency for small messages was only 14.6 micro-secs with a standard error of 3.3%.

On both the Origin2000 and the T3E MPI was used for passing the messages. We do not yet have shmem  library results because this library  was not yet available on the Origin2000 . In Figure 4 the results for both  machines are shown. For the T3E the bandwidths both with the  stream buffers off and on were measured.  As already seen with the distributed dot product and as also will be apparent from program mod2a  and the  distributed FFT program, the values as given in Figure 4 do not reflect the actual situation in algorithms:  although the rcinf  bandwidth for the Origin2000 is about the same and the latency is 1.7 times lower, the  T3E scales mostly better than the Origin2000 . This is due to MPI implementation on the Origin2000: as  yet at least one extra buffer copy is required in passing a message. Currently SGI is working on a better  implementation.

The two-parameter model for the communication, represented by rcinf and nc1/2, does not fit well with  the observed results. This is due to the different buffer strategies that are employed in the MPI Send/ Recv  routines depending on the lengths of the messages mentioned earlier.

3.2 Results of module 2

For reasons of briefness we present a selection of results of module 2. We mainly concentrate on the programs  of which we have shared-memory and distributed-memory parallel results. For the Origin2000 this allows us  to assess the overheads associated with the different programming models and, in addition, we can compare  the distributed-memory version results with those from the T3E.

3.2.1 Results of mod2a

Program mod2a performs a full matrix-vector multiplication (transposed and not transposed). Four different  implementations are executed. Moreover, on both the Origin2000 and the T3E the library routines as provided by SGI was also run. In Figure 5 the best results on a single processor for the not-transposed case  are given on the Origin2000 and the T3E (a 4 times unrolled column-wise version). In addition, the result  for the SGI's DGEMV libary version are given.

Figure 5 shows that, except for the smallest problem order, the library version on the Origin2000 is  significantly faster than the Fortran 77 coded version as (one might hope): about 220 vs. 180 Mflop/s.  For the T3E the primary cache size (8 KB) is already exceeded at a problem size if n = 50. At n = 200  the speed is completely determined by processor-to-memory bandwidth. This results in a speed of roughly  45 Mflop/s. The library version also experiences a performance decrease after n = 50 but always performs  much better than the Fortran version. At n = 500 the performance is still about 120 Mflop/s.  On the

Origin2000 the same program has been executed as a shared-memory parallel program. The results  for matrix order n = 500, are displayed for one of the Fortran 77 versions and the SGI library version in  Figure 6.

Figure 6 shows that the Fortran 77 is faster up to 8 processors. Above this number of processors the  library routine clearly is more efficient. Another point to remark is that the plain column-wise version is now  significantly faster than the 4 times unrolled version, sometimes by a factor of 4. The parallelising software  seems not to agree well with the unrolling of the loops.

For the Fortran version the speedup levels off at 16 processors: it is still 7.1 but above this number  of processors the synchronisation overhead together with the decreased amount of work per processor cause a performance drop: the speedup at 32 processors is only 4.6. By constrast, the library version still scales reasonably well at 32 processors: the speedup is 13.8 in this case.

3.2.2 Distributed memory results for mod2a

The message passing counterpart of mod2a is the program gmxv. This program was run on the Origin2000  and the T3E. The results of the fastest implementations for both machines are shown for a matrix order n = 512 in Figure 7.

It is instructive to compare the speedup of shared-memory programming model with that of the distributed-  memory implementation on the Origin2000. One should realise that, although the system image of an  Origin2000 is that of a shared-memory system, the data are physically distributed and that non-uniform  memory transactions occur. In addition, a certain overhead is incurred in keeping the caches coherent of  the processors involved in the computation. The net result of this is that on 32 processors the in the shared  memory model is 13.8 (using the library routine), while in the message passing model this speedup is 31.6  compared with the best single processor result. So, one has to pay a price for using the shared memory  model. This price may however not be too high in many applications.


The scaling on the T3E is excellent: a speedup of 6.36 on 8 processors and of 21.6 on 32 processors  is found. On has to take into account that even on 32 processors the primary cache size is still exceeded  by more than a factor of 8 and, consequently, no speedup due to in-cache operation will occur. As in the  non-transposed case no data have to be exchanged between the processors, less than perfect speedup is to  be ascribed to the setup of the communication network and the synchronisations at the beginning and end  of the computations.

For the transposed matrix-vector multiplication data have to be exchanged between processors as the  matrix and the result vector are distributed rowwise over the processors. The amount of communication consists of an all-to-all communication of partial result vectors between the processors: p(p-1) messages of length n/p are sent around. In Figure 8 the results for the fastest implementations are displayed.

The behaviour of the algorithm on the Origin2000 is not very good. The communication is taking already as much as 81% of  total execution time on 8 processors. This is apparent in the decreased Mflop-rate  from 8 processors on. The problem lies in a less than optimal MPI implementation which does a fair amount  of buffer copying that is not strictly necessary. Silicon Graphics is presently working on the problem. On  the T3E the communication time is growing as expected. The total amount of bytes to be communicated  is n(p-1). In addition, the number of messages grows with O(p2)  while the message length decreases as  O(1/p). The net result is that on two processors 18% of the time is spent in communication, while on 32  processors this fraction has increased to 62%. This is clearly observable in Figure 8. Above 64 processors  no speedup can be expected.

3.2.3 Results of mod2f

Program mod2f performs a 1-D FFT of various sizes. As a radix-4 algorithm is employed, only transform  lengths that are powers of 2 are
considered. In Figure 9 the results of the single processor versions on the  Origin2000 and the T3E are given.

The performance as a function of the transform length n is rather capricious on both systems. The  access/storage pattern of operands is
fairly complicated. They either are read or written with strides that are a power of 2. It is clear that beyond transform a length of 8192
both systems degrade in performance  because of cache problems (three arrays of the transform length are involved in the computation). Both
have  their optimal region between lengths of 256--4096. However, the performance of the R10000 is significantly  higher than that of the
Alpha 21164 over the whole problem range, being from 3--7 times faster except for  n = 4. For n ? 8192 the speed on the T3E is reduced to
roughly 8 Mflop/s, the same speed as was found  for out-of-memory dyadic operations. This suggests that in this range the speed is completely
determined  by the available processor-to-memory bandwidth.

3.2.4 Distributed memory results of mod2f

On both systems we also executed a distributed memory implementation of the FFT program: disfft. The  actual implementation is slightly different from that of mod2f. Although the same radix-4 FFT is employed,  in disfft, the 1-D FFT is actually executed as a product of small 1-D FFTs arranged in a 2-D array.

The advantages are that the small FFTs fit easier into the cache and that the sequence of small FFTs are  completely independent. The disadvantages are that n additional multiplications have to be done and that  one global transposition of the 2-D array is required, leading to p(p -1) messages of length n/p (all-to-all) communication). The result for a transform length 65,536 is shown in figure 10 for both systems.

It is clear from Figure 10 that the communication on the Origin2000 is not optimal. This is ascribed to the MPI implementation problems mentioned earlier. The performance drops by more than 10%  from 1 to 2 processors. After this initial drop the scaling is more or less linear, up to 16 processors but the  performance is always at least 60--70% lower than that from the T3E. The scaling on the T3E is slightly sublinear: the speedup on 2 processors is 1.85 while on 32 processors it is 22.49. This is explained by the  increasing amount of communication: the growth is O(p2 ) (although the message length decreases with O(1/p)). The communication takes 17.6% of the total execution time on 2 processors while this percentage  has grown to 38.4% on 32 processors. By contrast, the communication times on the Origin2000 are highly erratic. On 2 processors is is about 50 %, on 8 processors it is as much as 84%, while on 32 processors it  is 52% again. It is clear that with such communication behaviour no conclusions about the scaling of the algorithm can be drawn for this system.

4 Concluding remarks

In the previous sections we have not discussed all results of the Benchmark. Particularly, the results of pro-  grams from module 3 are not shown. The analysis of these programs and there relation to the performances  of the lower modules would warrant a paper in its own right. For those who are interested in the full results  to do their own analysis the results are available via:

http://www.fys.ruu.nl/ ~ steen/euroben/results/.

One should be careful however, because the experiments are not completely finished and updates may occur  in the near future.

The experiments so far have shown us that as a single-processor system the Origin2000 is generally faster  than the T3E, often by a factor of 2 or more, notwithstanding the higher clock speed of the T3E processor.  We surmise that this is not only an effect of a lower processor-cache and processor-memory bandwidth but  also because of better utilisation of the Origin2000 processor.

The results sofar suggest that one has to pay a performance penalty for using the Origin2000 as a shared-  memory system. The overhead involved in cache coherency, synchronisation and the non-uniform access time  for physically distributed data result in a lower speedup than that for a corresponding distributed memory  code.

Although the lower communication latency of the Origin2000 would suggest otherwise, in practical situation  the communication on the T3E is more efficient. The present MPI implementation on the Origin2000  is presently not optimal while it is very good on the T3E.

A further analysis of the Origin2000 results is required to assess the influence of the the secondary cache and  the possible influence of data placement and process migration in the shared-memory parallel programming  model.

Acknowledgements

I would like to thank Silicon Graphics for providing access to their central testing facilities in Cortaillod,  Switzerland and in particular Ruud van der Pas of Silicon Graphics' European HPC Team for helping me with the experiments on the Origin2000. Also I thank the HPaC Centre at the Delft University of Technology  for the use of their T3E system.

REFERENCES

[1] D. Fraser, Algorithm 545, An optimized mass storage FFT, in CACM, Vol. III, ACM, New York, 1983.
[2] A. Friedli, W. Gentzsch, R.W. Hockney, A.J. van der Steen, A European supercomputer benchmark  effort, Supercomputer 6, 6, (1989) 14--17.
[3] R.W. Hockney, A framework for benchmark performance analysis, Supercomputer, 9, 2, (1992) 9--22.
[4] R.W. Hockney, f 1=2 : A parameter to characterize memory and communication bottlenecks, Parallel  Computing, 10 (1989) 277-286.
[5] R.W. Hockney, M. Berry, Public International Benchmarks for Parallel Computers, Parkbench Committee, Report-1, Febr. 1994 (available through www.netlib.org/parkbench)
[6] R.W. Hockney, C.R. Jesshope, Parallel Computers 2, Adam Hilger Ltd, Bristol, 1988.
[7] N. Itoh, Y. Kanada, Monte Carlo Simulation of the Ising model and random number generation on vector processors, in: Proceedings of Supercomputing '90, IEEE Computer Society, 1990, 753--762.
[8] Knuth, D.E., The Art of Computer Programming 2, Semi-numerical algorithms, Addison-Wesley, New York, 1981.
[9] Silicon Graphics Inc., Origin Servers, Technical Report, April 1997.
[10] A.J. van der Steen, Is it really possible to benchmark a supercomputer? A graded approach to performance measurement, in Evaluating Supercomputers, UNICOM Proceedings, London, Ed.: A.J. van der Steen, Chapman & Hall, London, 1990.
[11] A.J. van der Steen, Portable parallel generation of random numbers, Supercomputer, 7, 1, (1990) 18--20.
[12] A.J. van der Steen, The benchmark of the EuroBen Group, Parallel Computing, 17, (1991) 1211-1221.
[13] A.J. van der Steen, Overview of recent supercomputers, 7th Edition, NCF, The Hague, 1997.