|Machine type||RISC-based distributed-memory multi-processor|
|Connection structure||3-D Torus, Tree network|
|Compilers||XL Fortran (Fortran 90), XL C, C++|
|Vendors information Web page||www-1.ibm.com/servers/deepcomputing/|
|Year of introduction||2004.|
|Clock cycle||700 MHz|
|Theor. peak performance|
|Per Proc. (64-bits)||2.8 Gflop/s|
|Memory/card||<= 512 MB|
|Memory/maximal||<= 16 TB|
|No. of processors||2×65,536|
|Point-to-point (3-D Torus)||175 MB/s|
|Point-to-point (Tree network)||175 MB/s|
The BlueGene/L is the first in a new generation of systems made by IBM for very massively parallel computing. The individual speed of the processor has therefore been traded in favour of very dense packaging and a low power consumption per processor. The basic processor in the system is a modified PowerPC 400 at 700 MHz. Two of these processors reside on a chip together with 4 MB of shared L3 cache and a 2 KB L2 cache for each of the processors. The processors have two load ports and one store port from/to the L2 caches at 8 bytes/cycle. This is half of the bandwidth required by the two floating-point units (FPUs) and as such quite high. The CPUs have 32 KB of instruction cache and of data cache on board. In favourable circumstances a CPU can deliver a peak speed of 2.8 Gflop/s because the two FPUs can perform fused multiply-add operations. Note that the L2 cache is smaller than the L1 cache which is quite unusual but which allows it to be fast.
The packaging in the system is as follows: two chips fit on a compute card with 512 MB of memory. Sixteen of these compute cards are placed on a node board of which in turn 32 go into one cabinet. So, one cabinet contains 1024 chips, i.e., 2048 CPUs. For a maximal configuration 64 cabinets are coupled to form one system with 65,356 chips/130,712 CPUs. In normal operation mode one of the CPUs on a chip is used for computation while the other takes care of communication tasks. In this mode the \tpp of the system is 183.5 Tflop/s. It is however possible when the communication requirements are very low to use both CPUs for computation, doubling the peak speed; hence the double entries in the System Parameters table above. The number of 360 Tflop/s is also the speed that IBM is using in its marketing material.
The BlueGene/L possesses no less than 5 networks, 2 of which are of interest for inter-processor communication: a 3-D torus network and a tree network. The torus network is used for most general communication patterns. The tree network is used for often occurring collective communication patterns like broadcasting, reduction operations, etc. The hardware bandwidth of the tree network is twice that of the torus: 350 MB/s against 175 MB/s per link.\\ At the time of writing this report no fully configured system exists yet. One such system should be delivered to Lawrence Livermore Lab by the end of this year. A smaller system of around 34 Tflop/s peak will be delivered at ASTRON, an astronomical research organisation in the Netherlands for the synthesis of radio-astronomical images.
Recently IBM has reported to have attained a speed of 36.01 Tflop/s on the HPC Linpack benchmark. Neither the order of the linear system, nor the size of the the BlueGene system were disclosed in the press release.