Machine type: Distributed-memory vector multi-processor.
Operating system: UXP/VPP (a V5.4 based variant of Unix).
Connection structure: Multi-stage crossbar.
Compilers: Fortran 77+ (Fortran 77 with data decomposition extensions), C.
Note: and are given for a 128 processor configuration.
The VPP-500 is front-ended by a vector processor of the VPX200 or VP2000 series. The system itself can be used as a batch processor from the front-end. Each node, called a Processing Element (PE) in the system is a powerful (1.6 Gflop/s peak speed) vector processor in its own right. The vector processor is complemented by a RISC scalar processor with a peak speed of 200 Mflop/s. The scalar instruction format is 64 bits wide and may cause the execution of three operations in parallel. Each PE has a memory of 128--256 MB while a PE communicates with its fellow PEs at a point-to-point speed of 400 MB/s. This communication is cared for by separate Data Transfer Units (DTUs). To enhance the communication efficiency, the DTU has various transfer modes like contiguous, stride, sub array, and indirect access. Also translation of logical to physical PE-ids and from Logical in-PE address to real address are handled by the DTUs.
Because the network is a multistage crossbar, the complexity only grows logarithmically. The network has some intelligence built in in the form of a Synchronisation Register (SR). When synchronisation is required each PE can set its corresponding bit in the SR. The value of the SR is broadcast to all PEs and synchronisation has occurred if the SR has all its bits set for the relevant PEs . This method is comparable to the use of synchronisation registers in shared-memory vector processors and much faster than synchronising via memory.
Communication to the outside world is realised via the SSU of the front-end VP2000 or VPX200 system. Also all I/O is handled by the front-end. On the VPP side traffic is controlled by one or two Control Processors (CPs). Other tasks of the CPs is to keep track of the availability and state of the PEs and to allocate work on them. The allocation is dynamic and three access modes of the PEs are possible: ``simplex'', meaning that on request of N processors, N PEs are exclusively granted as soon as they are available with one process per processor. The ``exclusive'' mode again claims N PEs but in this case 1 or 2 processes may run on each PE in the complex. In ``shared'' mode many processes may share the requested PEs but at least are guaranteed during processing.