Comparing The Performance of MPI on
The Cray T3E-900, The Cray Origin 2000 and
The IBM P2SC

 
Glenn R. Luecke and James J. Coyle
Iowa State University
Ames, Iowa 50011-2251, USA

June, 1998
 
 

 Abstract

INTRODUCTION

The performance of the communication network of a parallel computer plays a critical role in its overall performance, see [4,6]. Writing scientific programs with calls to MPI [5,9] routines is rapidly becoming the standard for writing programs with explicit message passing. Thus, to evaluate the performance of the communication network of a parallel computer for scientific computing, a collection of communication tests that use MPI for the message passing have been written. These communication tests are a significant enhancement from those used in [7] and have been designed to test those communication patterns that we feel are likely to occur in scientific programs. This paper reports the results of these tests on the Cray T3E-900, the Cray Origin 2000 and the IBM P2SC ("Power 2 Super Chip").

DESCRIPTION OF THE COMMUNICATION TESTS AND RESULTS

All tests were written in Fortran with calls to MPI routines for the message passing. The communication tests were run with message sizes ranging from 8 bytes to 10 MBytes and with the number of processors ranging from 2 to 64. (Throughout this paper, MByte means 1,000,000 bytes not 1024*1024 bytes and KBytes means 1,000 bytes not 1024 bytes.) Because of memory limitations, for some of the tests the 10 MByte message size was replaced by a message of size 1 MByte. Message sizes were not chosen based on any knowledge of the computers being used but were chosen to represent a small message (8 bytes), a ‘large’ message (10 BMytes) with two additional message sizes between these values (1 KByte and 100 KBytes). Some of these communication patterns took a very short amount of time to execute so they were looped to obtain a wall-clock time of at least one second in order to obtain more accurate timings. The time to execute a particular communication pattern was then obtained by dividing the total time by the number of loops. All timings were done using the MPI wall-clock timer, mpi_wtime(). A call to mpi_barrier was made just prior to the first call to mpi_wtime and again just prior to the second call to mpi_wtime to ensure processor synchronization for timing. Sixty-four bit real precision was used on all machines. Default mpi environmental settings were used for all tests. Tests run on the Cray T3E-900 and Cray Origin 2000 were executed on machines dedicated to running only our tests. For these machines, five runs were made and the best performance results are reported. Tests run on the IBM P2SC were executed using LoadLeveler so that only one job at a time would be executing on the 32 nodes used. However, jobs running on other nodes would sometimes cause variability in the data so the tests were run at least ten times and the best performance numbers reported. Default environmental settings were used for all three machines.

Tests for the Cray T3E-900 were run on a 64-processor machine located in Chippewa Falls, Wisconsin. Each processor is a DEC EV5 microprocessor running at 450 MHz with peak theoretical performance of 900 Mflop/s. The three-dimensional, bi-directional torus communication network of the T3E-900 has a bandwidth of 350 MBytes/second and latency of 1 microsecond. For more information on the T3E-900 see [9]. The UNICOS/mk version 1.3.1 operating system, the cf90 version 3.0 Fortran compiler with the -O2 –dp compiler options and MPI version 3.0 were used for these tests. The MPI implementation used had the hardware data streams work-around enabled even though this is not needed for the T3E-900.

Tests for the Cray Origin 2000 were run on a 128-processor machine located in Eagan, Minnesota. Each processor is a MIPS R10000, 195 MHz microprocessor with a peak theoretical performance of 390 Mflop/s and a 4 MByte secondary cache. Each node consists of two processors sharing a common memory. The communication network is a hypercube for up to 32 processors and is called a "fat bristled hypercube" for more than 32 processors since multiple hypercubes are interconnected via the CrayRouter. The node-to-node bandwidth is 150 MBytes per second and the maximum remote latency in a 128-processor system is about 1 microsecond. For more information see [9]. A pre-release version of the Irix 6.5 operating system, MPI from version 1.1 of the Message Passing Toolkit, and the MipsPro 7.20 Fortran compiler with -O2 –64 compiler options were used for these tests.

Tests for the IBM P2SC were run at the Maui High Performance Computing Center. The peak theoretical performance of each of these processors is 480 Mflop/s for the 120 MHz thin nodes and 540 Mflop/s for the 135 MHz wide nodes. The communication network has a peak bi-directional bandwidth of 110 MBytes/second with a latency of 40.0 microseconds for thin nodes and 39.2 microseconds for wide nodes. Performance tests were run on thin nodes each with 128 MBytes of memory. At the Maui High Performance Computing Center, there were only 48 thin nodes available for running these tests, so there is no data for 64 processors. For more information about the P2SC see [10]. The AIX 4.1.5.0 operating system, xlf version 4.1.0.0 Fortran compiler with –O3 qarch=pwr2 compiler options, and MPI version 2.2.0.2 were used for these tests.

Communication Test 1 (point-to-point, see table 1)

The first communication test measures the time required to send a real array A(1:n) from one processor to another by dividing by two the time required to send A from one processor to another and then send A back to the original processor, where n is chosen to obtain a message of the desired size. Thus, to obtain a message of size 1 KByte, n = 1,000/8 = 125. Since each A(i) is 8 bytes, the communication rate for sending a message from one processor to another is calculated by 2*8*n/(wall-clock time), where the wall-clock time is the time to send A from one processor to another and then back to the original processor and where n is chosen to obtain the desired message size. This test is the same as the COMMS1 test described in section 3.3.1 of [4]. This test uses mpi_send and mpi_recv. Table 1 gives performance rates in KBytes per second. As is done in all the tables, the last column gives the ratios of the performance results of the T3E-900 to the IBM P2SC and of the T3E-900 to the Origin 2000. Recall that for small messages network latency, not bandwidth, is the dominant factor determining the communication rate. Notice that the achieved bandwidth on this test is significantly less than the bandwidth rates provided by the vendors: 350,000 KBytes for the T3E-900, 110,000 KBytes for the P2SC, and 150,000 KBytes for the Origin. All data in this paper has been rounded to three significant digits.
 

Message Size (Bytes)
T3E-900
IBM P2SC
Origin 2000
T3E/IBM,T3E/Origin
8
371
215
303
1.7, 1.2
1,000
30100
14500
18600
2.1, 1.6
100,000
144000
78700
86500
1.8, 1.7
10,000,000
151000
103000
90900
1.5, 1.7
Peak Rates
350000
110000
150000
 
Table 1: Point-to-point communication rates plus advertised peak rates in KBytes/second.

Communication Test 2 (broadcast, see tables 2-a, 2-b and 2-c and figures 1 and 2):

This test measures communication rates for sending a message from one processor to all other processors and uses the mpi_bcast routine. This test is the COMMS3 test described in [4]. To better evaluate the performance of this broadcast operation, define a normalized broadcast rate as

(total data rate)/(p-1)

where p is the number of processors involved in the communication and total data rate is the total amount of data sent on the communication network per unit time measured in KBytes per second. Let R be the data rate when sending a message from one processor to another and let D be the total data rate for broadcasting the same message to the p-1 other processors. If the broadcast operation and communication network were able to concurrently transmit the messages, then D = R*(p-1) and thus the normalized broadcast rate would remain constant as p varied for a given message size. Therefore, for a fixed message size, the rate at which the normalized broadcast rate decreases as p increases indicates how far the broadcast operation is from being ideal. Assume the real array A(1:n) is broadcast from the root processor where each A(i) is 8 bytes, then the communication rate is calculated by 8*n*(p-1)/(wall-clock time) and then normalized by dividing by p-1 to obtain the normalized broadcast rate.

Table 2-a gives the normalized broadcast rates obtained by keeping the root processor fixed for all repetitions of the broadcast. Figure 1 shows the graph of these results for a message of size 100 KBytes. Observe that for all machines for a fixed message size the normalized broadcast rate decreases as the number of processors increase (instead of being constant). Notice that the P2SC and Origin machines perform roughly about the same and that the T3E-900 ranges from 1.1 to 4.7 times faster than the P2SC and Origin. Observe that the Origin does not scale well as the number of processors increase as compared with the T3E-900. One might expect that the communication rate for a broadcast with 2 processors would be the same as the rate for communication test 1. However, the rates measured for the broadcast in tables 2-a and 2-b are higher than those measured in communication test 1 for all machines. It is not clear why this is so.

Figure 1: Normalized broadcast rates for a 100 KByte message (from table 2-a).

 

Message Size (Bytes)
Number of Processors
T3E-900
IBM P2SC
Origin 2000
T3E/IBM, T3E/Origin
8
2
545
350
518
1.6, 1.1
8
4
388
194
245
2.0, 1.6
8
8
290
132
159
2.2, 1.8
8
16
264
103
110
2.6, 2.4
8
32
223
83
60
2.7, 3.7
8
64
196
NA
42
---, 4.7
1,000
2
43600
32800
26700
1.3, 1.6
1,000
4
24400
16200
13900
1.5, 1.8
1,000
8
19100
11100
9500
1.7, 2.0
1,000
16
15900
8700
6650
1.8, 2.4
1,000
32
13300
7090
4960
1.9, 2.7
1,000
64
11800
NA
3820
---, 3.1
100,000
2
157000
81600
81000
1.9, 1.9
100,000
4
104000
41000
46200
2.5, 2.3
100,000
8
72500
27300
30900
2.7, 2.3
100,000
16
52300
20100
22100
2.6, 2.4
100,000
32
34900
13100
13300
2.7, 2.6
100,000
64
23800
NA
7900
---, 3.0
10,000,000
2
157000
103000
92800
1.5, 1.7
10,000,000
4
80400
51400
45100
1.6, 1.8
10,000,000
8
50800
34200
28900
1.5, 1.8
10,000,000
16
37600
25600
19500
1.5, 1.9
10,000,000
32
29200
15400
13300
1.9, 2.2
10,000,000
64
24500
NA
8080
---, 3.0
Table 2-a: Normalized broadcast rates in Kbytes/second with a fixed root processor.

Table 2-b gives the normalized broadcast rates where the root processor is cycled through all p processors as the broadcast operation is repeated. Notice that the rates do change from those in table 2-a and the maximum percent change depends on the machine. The maximum percent change is about 14% for the T3E-900, 150% for the P2SC, and about 50% for the Origin. 
 

Message Size (Bytes)
Number of Processors
T3E-900
IBM P2SC
Origin 2000
T3E/IBM, T3E/Origin
8
2
637
178
415
3.6, 1.5
8
4
387
126
213
3.1, 1.8
8
8
287
93
178
3.1, 1.6
8
16
247
77
113
3.2, 2.2
8
32
211
62
58
3.4, 3.6
8
64
185
NA
37
---, 5.0
1,000
2
40300
13000
20200
3.1, 2.0
1,000
4
27000
9670
12000
2.8, 2.3
1,000
8
18400
7500
9340
2.5, 2.0
1,000
16
15300
6220
6710
2.6, 2.3
1,000
32
12900
4940
4920
2.6, 2.6
1,000
64
11400
NA
3570
---, 3.2
100,000
2
160000
77800
87000
2.1, 1.8
100,000
4
95800
40100
40800
2.4, 2.4
100,000
8
70000
26900
28500
2.6, 2.5
100,000
16
48700
20000
20400
2.4, 2.4
100,000
32
33900
12700
11800
2.7, 2.9
100,000
64
23700
NA
6020
---, 3.9
10,000,000
2
162000
103000
87600
1.6, 1.9
10,000,000
4
73300
51400
45000
1.4, 1.6
10,000,000
8
51000
34300
28700
1.5, 1.8
10,000,000
16
36300
25700
19500
1.4, 1.9
10,000,000
32
29300
15200
12200
1.9, 2.4
10,000,000
64
24600
NA
8270
---, 3.0
Table 2-b: Normalized broadcast rates in KBytes/second with the root processor cycled.

To better understand the amount of concurrency occurring in the broadcast operation, define the log normalized broadcast rate as

(total data rate)/log(p)

where p is the number of processors involved in the communication and log(p) is the log base 2 of p. Thus, if binary tree parallelism were being utilized, the log normalized data rate would be constant for a given message size as p varies. Table 2-c gives the log normalized data rates with a fixed root processor and shows in fact that concurrency is being utilized in the broadcast operation for these machines. Figure 2 shows these results for a message of size 100 KBytes. Notice that the performance of the T3E-900 is significantly better than binary tree parallelism for all message sizes tested. For messages of size 8 bytes and 1 KByte, the P2SC performs better than binary tree parallelism and yields binary tree parallelism for the other two message sizes. The Origin gives better than binary tree parallelism for a 1 KByte message and binary tree parallelism for the other three message sizes.
 

Message Size (Bytes)
Number of Processors
T3E-900
IBM P2SC
Origin 2000
T3E/IBM,T3E/Origin
8
2
637
178
415
3.6, 1.5
8
4
580
189
319
3.1, 1.8
8
8
669
217
415
3.1, 1.6
8
16
926
288
423
3.2, 2.2
8
32
1310
384
353
3.4, 3.7
8
64
1940
NA
388
---, 5.0
1,000
2
40300
13000
20200
3.1, 2.0
1,000
4
40500
14500
17900
2.8, 2.3
1,000
8
43000
17500
21800
2.5, 2.0
1,000
16
57300
23300
25200
2.5, 2.3
1,000
32
79800
30600
30500
2.6, 2.6
1,000
64
120000
NA
37400
---, 3.2
100,000
2
160000
77800
87000
2.1, 1.8
100,000
4
144000
60200
61100
2.4, 2.4
100,000
8
163000
62700
66600
2.6, 2.5
100,000
16
183000
74900
76300
2.4, 2.4
100,000
32
210000
78900
73000
2.7, 2.9
100,000
64
249000
NA
63200
---, 3.9
10,000,000
2
162000
103000
87600
1.6, 1.9
10,000,000
4
110000
77100
67500
1.4, 1.6
10,000,000
8
119000
80000
67100
1.5, 1.8
10,000,000
16
136000
96300
73200
1.4, 1.9
10,000,000
32
184000
94400
75800
1.9, 2.4
10,000,000
64
258000
NA
86800
---, 3.0
Table 2-c: Log normalized broadcast rates in KBytes/second with the root processor cycled.
Figure 2: Log normalized broadcast rates for a 100 KByte message (from table 2-c).

Communication Test 3 (reduce, see table 3):

Assume that there are p processors and that processor i has a message, Ai(1:n), for i = 0, p-1 and where n is chosen to obtain the message of the desired size. Test 3 measures communication rates for varying sizes of the Ais when calculating A = S Ai and placing A on the root processor. Thus, this test uses mpi_reduce with the mpi_sum option. Since each element of Ai is 8 bytes, the communication rate can be calculated by 8*n*(p-1))/(wall-clock time) and then normalized by dividing by p-1. As was done with mpi_bcast, one could also calculate a log normalized data rate. Table 3 contains log normalized data rates since, as was true for mpi_bcast, more insight into the level of concurrency is obtained. Table 3 shows that the Origin exhibits binary tree parallelism and the other two machines exhibit better than binary tree parallelism. Notice that the T3E-900 performs well compared with the two other machines for messages of size 8 bytes and 1 KByte. However, for messages of size 100 KBytes (with 8 or more processors) and 10 MBytes (with 4 or more processors), the IBM machine gives superior performance. These results suggest that, for the larger message sizes IBM may be using a better algorithm for mpi_reduce than the algorithm used on the T3E.
 

Message Size (Bytes)
Number of Processors
T3E-900
IBM P2SC
Origin 2000
T3E/IBM, T3E/Origin
8
2
439
187
430
2.3, 1.0
8
4
336
185
372
1.8, .90
8
8
385
145
345
2.7, 1.1
8
16
491
120
368
4.1, 1.3
8
32
670
74
360
9.0, 1.9
8
64
977
NA
431
---, 2.2
1,000
2
29000
19400
26200
1.5, 1.1
1,000
4
22600
14800
21900
1.5, 1.0
1,000
8
25000
11900
21100
2.1, 1.2
1,000
16
30800
15800
25100
2.0, 1.2
1,000
32
42300
20400
29000
2.1, 1.5
1,000
64
59200
NA
38300
---, 1.5
100,000
2
49000
38300
63100
1.3, .78
100,000
4
40800
39600
47400
1.0, .86
100,000
8
43500
52600
47100
.83, .92
100,000
16
53500
75900
53100
.70, 1.0
100,000
32
70600
88700
59500
.80, 1.2
100,000
64
96300
NA
67900
---, 1.2
10,000,000
2
49600
39400
41900
1.3, 1.2
10,000,000
4
40100
56600
29100
.71, 1.4
10,000,000
8
41700
78600
26800
.53, 1.6
10,000,000
16
52000
118000
28900
.44, 1.8
10,000,000
32
70500
162000
35100
.43, 2.0
10,000,000
64
94100
NA
32400
---, 2.9
Table 3: Log normalized data rates in KBytes/second for mpi_reduce with the mpi_sum option.

Communication Test 4 (all reduce, see table 4 and figure 3):

This communication test is the same as communication test 3 except A is placed on all processors instead of only on the root processor. This test uses the mpi_allreduce routine and is functionally equivalent to a reduce followed by a broadcast. Thus, the communication rate for this test is calculated by 2*[8*n*(p-1)]/(wall-clock time) and then divided by p-1 to get a normalized data rate. Since normalized data rates drop sharply for fixed message sizes as the number of processors increase, more insight into the level of concurrency is obtained by displaying log normalized data rates, see table 4 and figure 3. Notice that the P2SC and Origin exhibit binary tree parallelism and the T3E does much better. Also notice that for most of the cases in test 4, the T3E-900 significantly outperforms the other two machines. The P2SC does not scale nearly as well as the T3E-900 for messages of sizes 8 bytes and 1 KByte. The Origin does not scale nearly as well as the T3E-900 for all message sizes.

Figure 3: Log normalized data rates for mpi_allreduce for a 1 KByte message (from table 4).

 

Message Size (Bytes)
Number of Processors
T3E-900
IBM P2SC
Origin 2000
T3E/IBM,T3E/Origin
8
2
671
225
363
3.0, 1.8
8
4
671
191
291
3.5, 2.3
8
8
775
198
299
3.9, 2.6
8
16
994
233
319
4.3, 3.1
8
32
1360
273
291
5.0, 4.7
8
64
2000
NA
357
---, 5.6
1,000
2
46000
17700
20300
2.6, 2.3
1,000
4
39900
14000
13700
2.9, 2.9
1,000
8
44700
14200
14400
3.2, 3.1
1,000
16
55000
16700
17600
3.3, 3.1
1,000
32
73400
19000
20200
3.9, 3.6
1,000
64
106000
NA
26300
---, 4.0
100,000
2
77900
63400
74400
1.2, 1.0
100,000
4
68300
49900
51500
1.4, 1.3
100,000
8
75100
51700
52300
1.5, 1.4
100,000
16
92300
61400
61000
1.5, 1.5
100,000
32
118000
56100
67700
2.1, 1.7
100,000
64
168000
NA
74700
---, 2.2
10,000,000
2
76500
69800
48000
1.1, 1.6
10,000,000
4
67700
55800
28100
1.2, 2.4
10,000,000
8
73900
59700
27400
1.2, 2.7
10,000,000
16
92300
71300
32800
1.3, 2.8
10,000,000
32
114000
70200
38800
1.6, 2.9
10,000,000
64
166000
NA
33300
---, 5.0
Table 4: Log normalized data rates in KBytes/sec for mpi_allreduce with the mpi_sum option.

Communication Test 5 (gather, see table 5):

Assume that there are p processors and that processor i has a message, Ai(1:n), for i = 0, p-1. This test uses the mpi_gather routine and measures the communication rate for gathering the Ai’s into an array B located on the root processor, where B(1:n,i) = Ai(1:n) for i = 0, p-1. Since the normalized data rates drop sharply as the number of processors increase for a fixed message size, the log normalized data rate is used for reporting performance results for this test. Thus, the communication rate is calculated by 8*n*(p-1)/(wall-clock-time) and then normalized by dividing by log(p). Because of the large amount of memory required to store B when a large number of processors are used, the largest message size used for this test was 1 MByte instead of 10 MBytes. Relative performance results are quite mixed but the T3E-900 outperformed the other machines on all of these tests. Notice the large drop in performance on the Origin for 8 bytes and 1 KByte messages as the number of processors increase.

Message Size (Bytes)
Number of Processors
T3E-900
IBM P2SC
Origin 2000
T3E/IBM, T3E/Origin 
8
2
191
175
222
1.1, 0.9
8
4
266
167
233
1.6, 1.1
8
8
268
107
163
2.5, 1.6
8
16
251
83
83
3.0, 3.0
8
32
223
62
37
3.6, 6.0
8
64
189
NA
11
---, 18.
1,000
2
16200
14300
13700
1.1, 1.2
1,000
4
22700
8990
12100
2.5, 1.9
1,000
8
19900
3610
10400
5.5, 1.9
1,000
16
18900
4220
7190
4.5, 2.6
1,000
32
14700
5090
4280
2.9, 3.4
1,000
64
12300
NA
1450
---, 8.5
100,000
2
43300
30700
41100
1.4, 1.1
100,000
4
48400
34400
32700
1.4, 1.5
100,000
8
31900
21600
27800
1.5, 1.5
100,000
16
34400
23600
22300
1.5, 1.5
100,000
32
28600
19600
16300
1.5, 1.8
100,000
64
26400
NA
10600
---, 2.5
1,000,000
2
43600
37100
39100
1.2, 1.1
1,000,000
4
49100
38500
28200
1.3, 1.7
1,000,000
8
42200
32900
20100
1.3, 2.1
1,000,000
16
35700
27100
16100
1.3, 2.2
1,000,000
32
24100
23300
12000
1.0, 2.0
1,000,000
64
NA
NA
8440
 
Table 5: Log normalized data rates in KBytes/second for mpi_gather.

 Communication Test 6 (all gather, see table 6 and figure 4):

This test is the same as test 5 except the gathered message is placed on all processors instead of only on the root processor. Test 6 is functionally equivalent to a gather followed by a broadcast and uses mpi_allgather. The communication rate calculated by 2*[8*n*(p-1)]/(wall-clock time) and is divided by log(p) to obtain a log normalized data rate. Because of the large amount of memory required to store B when a large number of processors is used, the largest message size used for this test was 1 MByte. Notice the large drop in relative performance of the Origin as the number of processors increase. Also notice that none of these machines were able to achieve binary tree parallelism on this test.

Figure 4: Log normalized data rates for mpi_allgather for a 100 KByte message (from table 6).
 
Message Size (Bytes)
Number of Processors
T3E-900
IBM P2SC
Origin 2000
T3E/IBM, T3E/Origin
8
2
658
270
426
2.4, 1.5
8
4
498
203
323
2.5, 1.5
8
8
397
208
252
1.9, 1.6
8
16
315
244
176
1.3, 1.8
8
32
260
273
112
1.0, 2.3
8
64
221
NA
32
---, 7.0
1,000
2
50600
20700
20700
2.4, 2.4
1,000
4
36600
13100
14000
2.8, 2.6
1,000
8
27200
8800
9800
3.1, 2.8
1,000
16
18600
7730
5730
2.4, 3.2
1,000
32
12200
5720
3810
2.1, 3.2
1,000
64
9250
NA
2590
---, 3.6
100,000
2
137000
80200
65600
1.7, 2.1
100,000
4
87800
35100
26500
2.5, 3.3
100,000
8
63300
25600
14000
2.5, 4.5
100,000
16
36100
20000
8310
1.8, 4.3
100,000
32
22200
9230
3780
2.4, 5.9
100,000
64
14200
NA
1020
---, 14.
1,000,000
2
137000
93300
60300
1.5, 2.3
1,000,000
4
91600
38700
20700
2.4, 4.4
1,000,000
8
69100
26800
9010
2.6, 7.7
1,000,000
16
36700
19800
3880
1.9, 9.5
1,000,000
32
21100
9390
2030
2.2, 10.
1,000,000
64
NA
NA
725
 
Table 6: Log normalized data rates in KBytes/second for mpi_allgather.
Communication Test 7 (scatter, see table 7):

Assume that B is a two dimensional array, B(1:n,0:p-1), where p is the number of processors used. This test uses mpi_scatter and measures communication rates for scattering B from the root processor to all other processors so that processor j receives B(1:n,j), for j = 0, p-1. The communication rate for this test is calculated by 8*n*(p-1))/(wall-clock-time) and then dividing by log(p) to obtain the log normalized data rate. Because of the large memory requirements when a large number of processors is used for this test, the largest message used for this test was 1 MByte.

Notice that relative to the T3E-900, the Origin performance results decrease as the number of processors increase for each message size. This also happens for the P2SC for all message sizes other than 8 bytes. Observe that the Origin and IBM P2SC perform roughly the same for most cases and that the T3E-900 is 2 to 3 times faster than both of these machines for most tests. Also notice that none of these machines are able to achieve binary tree parallelism except for the P2SC on the 8 byte message.
 

Message Size (Bytes)
Number of Processors
T3E-900
IBM P2SC
Origin 2000
T3E/IBM, T3E/Origin
8
2
417
314
265
1.3, 1.6
8
4
423
293
224
1.4, 1.9
8
8
418
345
196
1.2, 2.1
8
16
356
510
165
0.7, 2.2
8
32
304
614
136
0.5, 2.2
8
64
263
NA
105
---, 2.5
1,000
2
33600
28200
19700
1.2, 1.7
1,000
4
30500
20100
12100
1.5, 2.5
1,000
8
25500
11100
9510
2.3, 2.7
1,000
16
21300
9640
7610
2.2, 2.8
1,000
32
18000
7680
6030
2.3, 3.0
1,000
64
15700
NA
4600
---, 3.4
100,000
2
91900
65400
67400
1.4, 1.4
100,000
4
83200
43800
46000
1.9, 1.8
100,000
8
77700
30300
33100
2.6, 2.3
100,000
16
66600
22700
26300
2.9, 2.5
100,000
32
55800
17900
21000
3.1, 2.7
100,000
64
46100
NA
16000
---, 2.9
1,000,000
2
93500
73800
66500
1.3, 1.4
1,000,000
4
84500
46100
43100
1.8, 2.0
1,000,000
8
78700
32600
30900
2.4, 2.6
1,000,000
16
67400
25200
23600
2.7, 2.9
1,000,000
32
56900
20400
19400
2.8, 2.9
1,000,000
64
46900
NA
15300
---, 3.1
Table 7: Log normalized data rates in KBytes/second for mpi_scatter.
Communication Test 8 (all-to-all, see table 8, figure 5):

Assume C is a three dimensional array, C(1:n,0:p-1,0:p-1) with C(1:n,j,0:p-1) on processor j. Also assume that C(1:n,j,k) is sent to processor k, where j and k both range from 0 to p-1. This test uses mpi_alltoall and the communication rate is calculated by 8*n*(p-1)*p/(wall-clock time) and then normalized by dividing by p and not by log(p). As the number of processors increase, this test provides a good stress test for the communication network. Because of the large memory requirements when a large number of processors are used for this test, the largest message used for this test was 1 MByte.

Notice that table 8 and figure 5 use normalized data rates and not log normalized data rates. Thus, table 8 and figure 5 show the high level of parallelism achieved for mpi_alltoall for these machines, especially for the T3E-900. Also notice that relative to one another, the performance of the T3E-900 and P2SC remained nearly constant for all these tests with the T3E-900 giving roughly twice the performance of the P2SC. However, the performance of the Origin relative to the other two machines dropped significantly as the number of processors increases. There was insufficient memory on the T3E-900 to run this test for a 1 MByte message with 64 processors.

Figure 5: Normalized data rates for mpi_alltoall for a 1 KByte message (from table 8).

 

Message Size (Bytes)
Number of Processors
T3E-900
IBM P2SC
Origin 2000
T3E/IBM, T3E/Origin
8
2
305
151
154
2.0, 2.0
8
4
480
186
210
2.6, 2.3
8
8
574
273
203
2.1, 2.8
8
16
615
405
135
1.5, 3.0
8
32
620
527
62
1.2, 10.
8
64
630
NA
25
---, 25.
1,000
2
24100
11400
10800
2.1, 2.2
1,000
4
35400
16500
12900
2.1, 2.7
1,000
8
36900
16900
13300
2.2, 2.8
1,000
16
35800
16600
12300
2.2, 2.9
1,000
32
29400
12500
7840
2.3, 3.8
1,000
64
27300
NA
882
---, 31.
100,000
2
63900
40800
35000
1.6, 1.8
100,000
4
75300
44500
42900
1.7, 1.8
100,000
8
88100
44900
38000
2.0, 2.3
100,000
16
70400
44800
27100
1.6, 2.6
100,000
32
52400
33800
14400
1.6, 3.6
100,000
64
42700
NA
315
---, 135
1,000,000
2
62000
46500
38600
1.3, 1.6
1,000,000
4
92400
51500
37000
1.8, 2.5
1,000,000
8
103000
52900
29500
2.0, 3.5
1,000,000
16
73100
52700
8660
1.4, 8.4
1,000,000
32
53300
43400
2640
1.2, 20.
1,000,000
64
NA
NA
945
 
Table 8: Normalized data rates in KBytes/second for mpi_alltoall.

Communication Test 9 (broadcast-gather, see table 9):

This test uses mpi_bcast and mpi_gather and measures communication rates for broadcasting a message from the root processor to all other processors and then having the root processor gather these messages back from all processors. This test is included since there may be situations where the root processor will broadcast a message to the other processors, the other processors use this message to perform some calculations, and then the newly computed data is gathered back to the root processor. The communication rate is calculated by 2*[8*n*(p-1)]/(wall-clock time) and then divided by log(p) to obtain the log normalized data rate. Because of the large memory requirements when a large number of processors is used, the largest message used for this test was 1 MByte.

Notice that the T3E-900 significantly outperforms the other machines. Also observe that there seems to be a problem on the Origin for 8 byte messages with 64 processors. No machine achieved binary tree parallelism on this test. There was insufficient memory on the T3E-900 to run this test for a 1 MByte message with 64 processors.
 

Message Size (Bytes)
Number of Processors
T3E-900
IBM P2SC
Origin 2000
T3E/IBM, T3E/Origin 
8
2
501
164
409
3.1, 1.2
8
4
473
131
317
3.6, 1.5
8
8
460
140
252
3.3, 1.8
8
16
446
169
195
2.6, 2.3
8
32
403
198
124
2.0, 3.3
8
64
357
NA
11
---, 34.
1,000
2
45300
12400
21300
3.7, 2.1
1,000
4
37300
8720
16400
4.3, 2.3
1,000
8
32100
7800
14500
4.2, 2.2
1,000
16
30300
7850
12900
3.9, 2.4
1,000
32
25500
7600
9870
3.4, 2.6
1,000
64
22700
NA
7200
---, 3.2
100,000
2
124000
71400
88400
1.7, 1.4
100,000
4
95200
53700
62200
1.8, 1.5
100,000
8
80500
44100
46000
1.8, 1.7
100,000
16
66700
38400
38600
1.7, 1.7
100,000
32
58000
33300
29600
1.7, 2.0
100,000
64
53100
NA
21900
---, 2.4
1,000,000
2
124000
85000
89200
1.5, 1.4
1,000,000
4
94700
60700
52200
1.6, 1.8
1,000,000
8
79000
49900
35400
1.6, 2.2
1,000,000
16
67400
42800
27700
1.6, 2.4
1,000,000
32
47100
38800
19600
1.2, 2.4
1,000,000
64
NA
NA
12800
 
Table 9: Log normalized data rates in KBytes/second for mpi_bcast followed by mpi_gather.

Communication Test 10 (scatter-gather, see table 10):

This test uses mpi_scatter followed by mpi_gather and measures communication rates for scattering a message from a root processor and then gathering these messages back to the root processor. The communication rate is calculated by 2*[8*n*(p-1)]/(wall-clock time) and then divided by log(p) to obtain the log normalized data rate. The largest size message used for this test is 1 MByte because of the large memory requirements of this test when 64 processors are used. There was insufficient memory on the T3E-900 to run this test for a 1 MByte message with 64 processors.

Message Size (Bytes)
Number of Processors
T3E-900
IBM P2SC
Origin 2000
T3E/IBM, T3E/Origin
8
2
386
157
408
2.5, 0.9
8
4
363
132
242
2.8, 1.5
8
8
338
147
187
2.3, 1.8
8
16
296
176
154
1.7, 1.9
8
32
254
211
118
1.2, 2.2
8
64
210
NA
84
---, 2.5
1,000
2
30900
11800
20400
2.6, 1.5
1,000
4
25700
8230
14500
3.1, 1.8
1,000
8
21900
6420
10600
3.4, 2.1
1,000
16
19900
5910
8060
3.4, 2.5
1,000
32
15900
5170
6080
3.1, 2.6
1,000
64
13600
NA
4870
---, 2.8
100,000
2
89500
62400
71800
1.4, 1.2
100,000
4
72900
44600
48300
1.6, 1.5
100,000
8
59700
31500
32300
1.9, 1.8
100,000
16
47100
23900
24600
2.0, 1.9
100,000
32
39400
19100
16500
2.1, 2.4
100,000
64
34000
NA
11000
---, 3.1
1,000,000
2
90800
74600
65200
1.2, 1.4
1,000,000
4
73800
48000
35100
1.5, 2.1
1,000,000
8
59400
34300
25900
1.7, 2.3
1,000,000
16
48300
25900
18500
1.9, 2.6
1,000,000
32
34500
22000
10900
3.2, 3.2
1,000,000
64
NA
NA
9030
 
Table 10: Log normalized data rates in KBytes/second for mpi_gather followed by mpi_scatter.

Communication Test 11 (reduce-scatter, see table 11):

The mpi_reduce_scatter routine with the mpi_sum option is functionally equivalent to first reducing messages on all processors to a root processor and then scattering this reduced message to all processors. This MPI routine could be implemented by a reduce followed by a scatter. However, it can be implemented more efficiently by not reducing arrays on a single processor and then scattering the reduced array, but by having each processor only reduce for those elements required for the final scattered array. Our communication rate is based on this more efficient method. Thus, it is calculated by 8*n*(p-1))/(wall-clock time) and then divided by log(p) to obtain the log normalized data rate.

From table 11, notice that the T3E-900 achieves better than binary tree parallelism for all the message sizes tested. The P2SC and Origin achieve better than binary tree parallelism for 100 KBytes and 10 MBytes messages. Notice that the performance of the P2SC significantly improves for messages of size 100 KBytes and 10 MBytes.

The next two communication tests are designed to measure communication between "neighboring" processors for a ring of processors using mpi_cart_create (with reorder = .true.), mpi_cart_shift, and mpi_sendrecv.

Communication Test 12 (right shift, see table 12):

This communication test sends a message from processor i to processor (i+1) mod p, for i = 0, 1, …, p-1. Observe that the data rates for this test will increase proportionally with p in an ideal parallel machine. Thus, for communication tests 12 and 13, we define the normalized data rate to be (total data rate)/p. In an ideal parallel computer, the normalized data rate for the above communication would be constant since all communication would be done concurrently. For this test the total data rate is calculated by 8*n*p/(wall-clock time).

Table 12 gives the normalized data rates for the above communication in KBytes/second. Notice that both the T3E-900 and P2SC scale well as the number of processors increase (although there is only data for the P2SC up to 32 processors) since the normalized data rates are roughly constant as the number of processors increases. Observe that table 12 shows normalized data rates and not log normalized data rates and hence exhibiting the high degree of parallelism achieved on this test for all three machines, especially the T3E-900. Notice that the performance of the Origin relative to the T3E-900 becomes much worse as the message size increases. Also observe that the T3E-900 is significantly faster than both of the other machines.
 

Message Size (Bytes)
Number of Processors
T3E-900
IBM P2SC
Origin 2000
T3E/IBM, T3E/Origin
8
2
249
126
174
2.0, 1.4
8
4
224
65
129
3.5, 1.7
8
8
245
63
121
3.9, 2.0
8
16
266
60
113
4.4, 2.4
8
32
291
56
99
5.2, 2.9
8
64
305
NA
84
---, 3.6
1,000
2
16800
11800
11300
1.4, 1.5
1,000
4
14400
9580
8640
1.5, 1.7
1,000
8
15600
5490
9060
2.8, 1.7
1,000
16
15300
5510
10300
2.8, 1.5
1,000
32
17700
5510
10100
3.2, 2.2
1,000
64
19500
NA
9650
---, 2.0
100,000
2
44600
53900
43500
.82, 1.0
100,000
4
38500
58000
35000
.66, 1.1
100,000
8
42900
76100
40600
.56, 1.1
100,000
16
51300
105000
45800
.49, 1.1
100,000
32
66600
102000
53700
.65, 1.2
100,000
64
91900
NA
55100
---, 1.7
10,000,000
2
45500
62900
28600
.72, 1.6
10,000,000
4
38900
73900
23900
.52, 1.6
10,000,000
8
41600
93400
22900
.45, 1.8
10,000,000
16
51200
149000
27700
.34, 1.9
10,000,000
32
69700
186000
35200
.37, 2.0
10,000,000
64
94000
NA
31600
---, 3.0
Table 11: Log normalized data rates in KBytes/second for mpi_reduce_scatter.
 
Message Size (Bytes)
Number of Processors
T3E-900
IBM P2SC
Origin 2000
T3E/IBM, T3E/Origin
8
2
511
153
208
3.3, 2.5
8
4
495
151
218
3.3, 2.3
8
8
456
145
186
3.1, 2.5
8
16
390
139
115
2.8, 3.4
8
32
371
127
62
2.9, 6.0
8
64
367
NA
44
---, 8.3
1,000
2
30200
11600
13200
2.6, 2.3
1,000
4
28700
11200
12700
2.6, 2.3
1,000
8
28100
10900
11300
2.6, 2.5
1,000
16
25300
10400
10300
2.4, 2.5
1,000
32
23600
9200
7330
2.6, 3.2
1,000
64
22600
NA
5440
---, 4.2
100,000
2
134000
43300
38900
3.1, 3.4
100,000
4
109000
43800
37700
2.5, 2.9
100,000
8
129000
43100
33900
3.0, 3.8
100,000
16
126000
41100
28000
3.1, 4.5
100,000
32
121000
29100
15800
4.2, 7.7
100,000
64
110000
NA
8100
---, 14.
10,000,000
2
137000
56900
39000
2.4, 3.5
10,000,000
4
106000
54500
38200
1.9, 2.8
10,000,000
8
138000
54400
28500
2.5, 4.8
10,000,000
16
137000
54200
24300
2.5, 5.7
10,000,000
32
137000
47400
13800
2.5, 5.7
10,000,000
64
134000
NA
2680
---, 50.
Table 12: Normalized data rates for right shift in KBytes/second.

Communication Test 13 (left & right shift, see table 13 and figure 6):

This test is the same as the above test except here a message is sent from a processor i to each of its neighbors (i-1) mod(p) and (i+1) mod(p), for i = 0, 1,…, p. Thus, the amount of data being moved on the network will be twice that of the previous test so that the normalized data rate is calculated by 2*8*n*p/(wall-clock time). Notice that the normalized data rates for communication test 13 are about the same as those for communication test 12. The Origin communication network allows for the concurrent sending of incoming and outgoing steams of data from one node to another. Because of this, one might expect that the normalized data rates for the Origin for this test to be twice those of the previous test. However, this doubling of the data rate did not occur. Figure 6 shows the normalized data rate for this test for a message of size 100 KBytes.
 

Message Size (Bytes)
Number of Processors
T3E-900
IBM P2SC
Origin 2000
T3E/IBM, T3E/Origin
8
2
501
154
202
3.3, 2.5
8
4
484
146
226
3.3, 2.1
8
8
480
143
194
3.4, 2.5
8
16
419
138
117
3.0, 3.6
8
32
393
126
62
3.1, 6.3
8
64
370
NA
44
---, 8.4
1,000
2
29500
11500
13200
2.6, 2.2
1,000
4
27800
10900
12700
2.6, 2.2
1,000
8
28500
10500
12000
2.7, 2.4
1,000
16
27000
10200
10600
2.6, 2.5
1,000
32
23800
8940
7590
2.7, 3.1
1,000
64
23300
NA
5450
---, 4.3
100,000
2
132000
42800
43300
3.1, 3.1
100,000
4
116000
43600
42900
2.7, 2.7
100,000
8
129000
43100
38000
3.0, 3.4
100,000
16
126000
41700
31200
3.0, 4.0
100,000
32
124000
30300
21200
4.1, 5.8
100,000
64
109000
NA
8080
---, 14.
10,000,000
2
142000
54200
29800
2.6, 4.8
10,000,000
4
122000
54300
30100
2.3, 4.1
10,000,000
8
139000
54200
19700
2.6, 7.1
10,000,000
16
139000
54000
16600
2.6, 8.4
10,000,000
32
139000
48200
11900
2.9, 12.
10,000,000
64
123000
NA
2910
---, 42.
Table 13: Normalized data rates for the left and right shift in KBytes/second.
Figure 6: Normalized data rates for left & right shifts for a 100 KByte message (from table 13).

CONCLUSIONS

This study was conducted to evaluate relative communication performance of the Cray T3E-900, the Cray Origin 2000 and the IBM P2SC on a collection of 13 communication tests that call MPI routines. Communication tests have been designed to include communication patterns that we feel are likely to occur in scientific programs. Tests were run for messages of size 8 bytes, 1 KByte, 100 KBytes and 10 MBytes using 2, 4, 8, 16, 32 and 64 processors (although 64 processors were not available on the P2SC). Because of memory limitations, for some of the tests the 10 MBytes message size was replaced by messages of size 1 MByte. The relative performance of these machines varied depending on the communication test, but overall the T3E-900 was often 2 to 4 times faster than the Origin and P2SC. The Origin and P2SC performed about the same for most of the tests. For a fixed message size the performance of the Origin relative to the T3E-900 would often drop significantly as the number of processors increased. For a fixed message size, the performance of the P2SC relative to the T3E-900 would typically drop as the number of processors increased but this drop was not nearly as much as occurred on the Origin.

ACKNOWLEDGMENTS

Computer time on the Maui High Performance Computer Center’s P2SC was sponsored by the Phillips Laboratory, Air Force Material Command, USAF, under cooperative agreement number F29601-93-2-0001. The views and conclusions contained in this document are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of Phillips Laboratory or the U.S. Government.

We would like to thank Cray Research Inc. for allowing us to use their T3E-900 and Origin 2000 located in Chippewa Falls, Wisconsin and Eagan, Minnesota, USA, respectively.

REFERENCES  

  1. Cray MPP Fortran Reference Manual, SR 2504 6.2.2, Cray Research, Inc., June 1995.
  2. J. Dongarra, R. Whaley, A User’s Guide to the BLACS v1.0, Computer Science Department Technical Report CS-95-281, University of Tennessee, 1995. (Available as LAPACK Working Note 94 at: http://www.netlib.org/lapack/lawns/lawn94.ps)
  3. A. Geist, A. Beguelin, J. Dongarra, W. Jiang, R. Manchek, V. Sunderam, PVM: Parallel Virtual Machine A Users’ Guide and Tutorial for Networked Parallel Computing, The MIT Press, 1994. 
  4. R. Hockney, M. Berry, Public International Benchmarks for Parallel Computers: PARKBENCH Committee, Report-1, February 7, 1994.
  5. W. Gropp, E. Lusk, A. Skjellum, USING MPI, The MIT Press 1994.
  6. G. Luecke, J. Coyle, W. Haque, J. Hoekstra, H. Jespersen, Performance Comparison of Workstation Clusters for Scientific Computing, SUPERCOMPUTER, vol XII, no. 2, pp 4-20, March 1996.
  7. G. Luecke, J. Coyle, Comparing the Performance of MPI on the Cray Research T3E and IBM SP-2, January, 1997 (preprint), see http://www.public.iastate.edu/~grl/homepage.html.
  8. Optimization and Tuning Guide for Fortran, C, and C++ for AIX version 4, second edition, IBM, June 1996.
  9. M. Snir, S. Otto, S. Huss-Lederman, D. Walker, J. Dongarra, MPI: The Complete Reference, The MIT Press, 1996.
  10. http://www.cray.co
  11. http://www.austin.ibm.com/hardware/largescale/index.html