next up previous contents
Next: The Sun E10000 Starfire. Up: Shared-memory MIMD systems Previous: The Hitachi S3800 series..

The NEC SX-4.

Machine type Distributed-memory multi-vector processor
Models SX-4C, SX-4
Operating system EWS-UX/V (Unix variant based on Unix System V.4)
Connection structure Multi-stage crossbar (see Remarks)
Compilers Fortran 77, Fortran 90, HPF, ANSI C, C++
Vendors information Web page http://www.nec.co.jp/english/product/computer/sx/
Year of introduction 1995.

System parameters:

Model SX-4B/eA SX-4B/A SX-4A SX-4
Clock cycle 8.8 ns 8.8 ns 8 ns 8 ns
Theor. peak performance
Per Proc. (64 bits) 0.9 Gflop/s 1.8 Gflop/s 2 Gflop/s 2 Gflop/s
Maximal
Single frame: 0.9 Gflop/s 7.2 Gflop/s 32 Gflop/s 64 Gflop/s
Multi frame: --- --- 0.5 Tflop/s 1 Tflop/s
Main memory(per frame) 1--4 GB 2--8 GB 1--32 GB 1--16 GB
Ext. memory(per frame) <= 8 GB <= 16 GB --- <= 32 GB
No. of processors 1 1--4 4--256 4--512

Remarks:

The SX-4 series is comprised of a large range of machine sizes. The smallest of these is the SX-4B/eA. This machine has one CPU housing 4 vector pipe sets. As the clock cycle is 8.8 ns and each pipe set is able to deliver 2 floating-point results per cycle, the total maximum performance is 0.9 Gflop/s for this system. In all other systems the replication factor of the pipe sets is 8 which doubles the speed per CPU to a maximum of 1.8 in the SX-4B series and to 2 Gflop/s for the SX-4A and SX-4 series. The bandwidth from memory to the CPUs is 16 64-bit words per cycle per CPU. With a replication factor of 8 this is enough to provide two operands per pipe set but it is not sufficient to transport the results back to the memory at the same time. So, some trade-offs with the re-use of operands have to be made to attain the peak performance.

The distinction between the SX-4A and the SX-4 lies in the type of memory used: in the SX-4A Synchronous DRAM memory is used while in the SX-4 Synchronous SRAM is employed. SSRAM is faster but bulkier than SDRAM, so, the maximum amount of memory per frame in an SX-4 is 16 GB while it is 32 GB per frame in the SX-4A.

The technology used is CMOS. This lowers the fabrication costs and the power consumption appreciably (the same approach is being used in the Fujitsu VPP700) and all models are air cooled. This enables the placement of up to 32 CPUs in one frame for the SX-4 model and 16 in the SX-4A. The placement of less CPUs in the SX-4A frame is a consequence of the slower memory, not of power consumption. Beyond this maximum single frame system, it is possible to couple up to 16 frames together to form a distributed memory system. This is equivalent to the AlphaServer cluster idea. There are two ways to couple the SX-4 frames: NEC provides a full crossbar, the so-called IXS crossbar, to connect the various frames together at a speed of 16 GB/s for point-to-point out-of-frame communication (128 GB/s bi-sectional bandwidth for a maximum configuration). In addition, a HiPPI interface is available for interframe communication at lower cost and speed.

For distributed computing there is an HPF compiler and for message passing an optimised MPI (MPI/SX) is available. The SX-4 is the only system that supports three floating-point number systems: IBM-compatible, Cray-compatible, and the IEEE 754 standard.

Measured Performances: In [2] a speed of 122.2 Gflop/s was reported for the solution of a full linear system of order 30080 on a 64-processor multiframe configuration. This amounts to to a very high efficiency of 95%. The author measured speeds of 1.83 Gflop/s for a matrix-vector multiplication on a single CPU (not yet published) which also shows an efficiency of over 90%.



next up previous contents
Next: The Sun E10000 Starfire. Up: Shared-memory MIMD systems Previous: The Hitachi S3800 series.



Aad van der Steen
Thu Feb 12 10:14:58 MET 1998