next up previous contents
Next: The Fujitsu AP3000. Up: Distributed-memory MIMD systems Previous: The C-DAC PARAM 9000/SS.

The Cray Research Inc. T3E.

Machine type RISC-based distributed-memory multi-processor
Models T3E
Operating system UNICOS MAX (micro-kernel Unix)
Connection structure 3-D Torus
Compilers CFT77_M (Fortran 77 with extensions), C
Vendors information Web page http://www.cray.com/PUBLIC/product-info/T3E/

System parameters:

Model T3E T3E-900
Clock cycle 3.3 ns 2.2 ns
Theor. peak performance
Per proc. (64-bit) 600 Mflop/s 900 Mflop/s
Maximal (64-bit) 1229 Gflop/s 1843 Gflop/s
Main memory <=4096 GB <=4096 GB
Memory/node <= 2 GB <= 2 GB
Communication bandwidth 300 MB/s 300 MB/s
No. of processors 16-2048 6-2048

Remarks:

The T3E is the second generation of DM-MIMD systems from CRI. Lexically, it follows in name after its predecessor T3D which name referred to its connection structure: a 3-D torus. In this respect it has still the same interconnection structure as the T3D. In many other respects, however, there are quite some differences. A first and important difference is that no front-end system is required anymore (although it is still possible to connect to a Cray T90). The systems up to 128 processors are air-cooled. The larger ones, from 256-2,048 processors, are liquid cooled.

The T3E uses the DEC Alpha 21164 RISC processor for the T3E and the 21164A processor for the T3E-900 for its computational tasks just like the Avalon A12. Cray stresses, however, that the processors are encapsulated in such a way that they can be exchanged easily for any other (faster) processor as soon as this would be available without affecting the macro-architecture of the system.

Each node in the system contains one processing element (PE) which in turn contains a CPU, memory, and a communication engine that takes care of communication between PEs. The bandwidth between nodes is quite high: 300 MB/s. Like the T3D, the T3E has hardware support for fast synchronisation. E.g., barrier synchronisation takes only one cycle per check.

In the microarchitecture most changes have taken place with the transition from the T3D to the T3E. First, there is only one CPU per node instead of two, which removes a source of asymmetry between processors. Second, the new node processor has a 96 KB 3-way set-associative secondary cache which may relieve some of the problems of data fetching that were present in the T3D where only a primary cache was present. Third, the Block Transfer Engine has been replaced by a set of E-registers that are believed to be much more flexible and at least removes some odd restrictions on the size of shared arrays and the number of processes when using Cray-specific PVM. An interesting additional feature is the availability of 32 contexts per processor which opens the door for multiprocessing.

In the T3D all I/O had to be handled by the front-end, a system at least from the Cray Y-MP/E class. In the T3E distributed I/O is present. For every 8 PEs an I/O channel can be configured in the air-cooled systems and 1 I/O channel per 16 nodes in the liquid-cooled systems. The maximum bandwidth for a channel is about 1 GB/s, the actual speed will be in the order of 700 MB/s.

The T3E supports various programming models. Apart from PVM 3.x and MPI for message passing and HPF for data distribution, a Cray proprietary work sharing model, called CRAFT, can be employed. Cray views HPF and Fortran 90 array syntax as subsets of the CRAFT model. Within this model data can be exchanged implicitly, thus looking effectively as a shared-memory system to the user. As several other vendors, Cray has extended/altered the implementation of PVM to enhance the communication performance. For small messages this can give an improvement of a factor 3 (20--25 µs instead of 70--80 µs). For SPMD programs channel send/receive functions can be used which reduces the communication time to 4--5 µs.

Measured Performances: In [4] a speed of 93.2 Gflop/s is quoted for solving a dense linear system of size 53,644 on 256 processors.



next up previous contents
Next: The Fujitsu AP3000. Up: Distributed-memory MIMD systems Previous: The C-DAC PARAM 9000/SS.



Aad van der Steen
Mon Mar 3 13:14:53 MET 1997