The NEC SX-6.

Next: The NEC TX-7 series. Up: Recount of (almost) available ... Previous: The IBM BlueGene/L.

The NEC SX-6.

Machine type Distributed-memory multi-vector processor
Models SX-6i, SX-6A, SX-6xMy
Operating system Super-UX (Unix variant based on BSD V.4.3 Unix).
Connection structure Multi-stage crossbar (see Remarks)
Compilers Fortran 90, HPF, ANSI C, C++.
Vendors information Web page http://www.hpce.nec.com/468.0.html
Year of introduction 2002.

Machine type	Distributed-memory multi-vector processor
Models	SX-6i, SX-6A, SX-6xMy
Operating system	Super-UX (Unix variant based on BSD V.4.3 Unix).
Connection structure	Multi-stage crossbar (see Remarks)
Compilers	Fortran 90, HPF, ANSI C, C++.
Vendors information Web page	http://www.hpce.nec.com/468.0.html
Year of introduction	2002.

System parameters:

Model SX-6i SX-6A SX-6xMy
Clock cycle 500 MHz 562.5 MHz 562.5 MHz
Theor. peak performance
Per Proc. (64 bits) 8 Gflop/s 9 Gflop/s 9 Gflop/s
Maximal
Single frame: 9 Gflop/s 72 Gflop/s ---
Multi frame: --- --- 9.2 Tflop/s
Main memory 4—8 GB 32—64 GB ≤ 16 TB
No. of processors 1 4—8 8—1024

Model	SX-6i	SX-6A	SX-6xMy
Clock cycle	500 MHz	562.5 MHz	562.5 MHz
Theor. peak performance
Per Proc. (64 bits)	8 Gflop/s	9 Gflop/s	9 Gflop/s
Maximal
Single frame:	9 Gflop/s	72 Gflop/s	---
Multi frame:	---	---	9.2 Tflop/s
Main memory	4—8 GB	32—64 GB	≤ 16 TB
No. of processors	1	4—8	8—1024

Remarks:

The SX-6 series is offered in numerous models but most of these are just smaller frames that house a smaller amount of the same processors. We only discuss the essentially different models here. All models are based on the same processor, an 8-way replicated vector processor where each set of vector pipes contains a logical, mask, add/shift, multiply, and division pipe (see section SM-SIMD systems for an explanation of these components). As multiplication and addition can be chained (but not division) the peak performance of a pipe set at 500 MHz is 1 Gflop/s. Because of the 8-way replication a single CPU can deliver a peak performance of 8 Gflop/s. The vector units are complemented by a scalar processor that is 4-way super scalar and at 562.5 MHz has a theoretical peak of 1.125 Gflop/s. The peak bandwidth per CPU is 64 B/cycle. This is sufficient to ship 8 8-byte operands back or forth and just enough to feed one operand to each of the replicated pipe sets.

It is interesting to note that the peak performance of a single processor actually has dropped from 10 Gflop/s in the SX-5, the predecessor of the SX-6 to 8 Gflop/s. The reason is that the SX-6 CPU now houses on a single chip, an impressive feat, where in the former versions of the CPU always multiple chips were required. The replication factor which was 16 in the SX-5 had therefore to be halved to 8.

The SX-6i is the single CPU system that because of the single chip implementation is offered as a desk side model. Also a rack model is available that enables housing two systems in a rack but there is no connection between the systems.

In a single frame of the SX-6A models fit up to 8 CPUs at the same clock frequency as the SX-6i. Internally the CPUs in the frame are connected by a 1-stage crossbar with the same bandwidth as that of a single CPU system: 36 GB/s/port. The fully configurated frame can therefore attain a peak speed of 72 Gflop/s.

In addition, there are multi-frame models (SX-6xMy) where x = 8,...,1024 is the total number of CPUs and y = 2,...,128 is the number of frames coupling the single-frame systems into a larger system. There are two ways to couple the SX-6 frames in a multi-frame configuration: NEC provides a full crossbar, the so-called IXS crossbar to connect the various frames together at a speed of 8 GB/s for point-to-point unidirectional out-of-frame communication (1024 GB/s bisectional bandwidth for a maximum configuration). Also a HiPPI interface is available for inter-frame communication at lower cost and speed. When choosing for the IXS crossbar solution, the total multi-frame system is globally addressable, turning the system into a NUMA system. However, for performance reasons it is advised to use the system in distributed memory mode with MPI.

The technology used is CMOS. This lowers the fabrication costs and the power consumption appreciably (the same approach was already used in the late Fujitsu VPP5000 and the Cray SV1 and now in the Cray X1) and all models are air cooled.

For distributed computing there is an HPF compiler and for message passing an optimised MPI (MPI/SX) is available. In addition for shared memory parallelism, OpenMP is available.

Measured Performances:
Results for a 31-frame SX-6/248M31 with 248 processors are available from [42]. The system attained 2155 Gflop/s, an efficiency of 97%. The size of the linear system for this result was 220,224.

Next: The NEC TX-7 series. Up: Recount of (almost) available ... Previous: The IBM BlueGene/L.

Aad van der Steen
Mon Oct 11 15:27:34 CEST 2004