The estimated runtime is represented in Figures 4 and 5. The 2D partitioning allows us to study a variety of process structures. The part of the computational work is gridded and the Hippis part is dashed. Since we assume a series-parallel model, there is no overlap between these parts. The intra-system communication is of course existing and is considered. But this part takes less than 1% and cannot be seen in the figures.
The computational work is always split into a small gridded and a largely gridded part. Here the ideal minimum runtime on a single C90 system (i.e. sequential_runtime/16) is standardised to be 1. This part is represented by the small gridded bar. The influence of different partitionings as provided by the IFS is only marginal in the case of this large problem. In the case of 2 and 4 systems per cluster, we see nearly 1/2 or even 1/4 for the estimated runtime. Here the ideal runtime is of course sequential_runtime/32 or sequential_runtime/64 respectively. In all cases the overall runtime is dominated by the part called ideal runtime in our large example.
The remaining amount of computational work (largely gridded part) is caused mainly by more inefficient vectorisation and load imbalancing in the parallel case. We considered only those examples where load imbalancing is of minor importance. Load imbalancing would occur for an process structure, if s is not a good divisor of the number of levels z. In this case, load imbalancing would occur at least within the Fourier space (cf. Figure 1 and ).
The Hippi time cost are split into a densely dashed part for start-up time and a sparsely dashed part for transmission time. The start-up time does not play any role for large problems. Therefore, the best partitioning is here . For the -case, however, start-up time takes a considerable amount of time in particular on a 4-system cluster and with high values of r. Since the vectorisation is more efficient with high r-values, the best partitioning is here a squared partitioning.
Figure 3 shows better efficiency on a 4-system cluster than on a 2-system cluster for our largest example. To explain this effect, we remind that a 4-system cluster has 6 Hippi channels and a 2-system cluster has only one. This is of importance in cases showing high transmission time via Hippi.
We considered also other mappings (cf. ) but the column mapping used here showed the best results as long as .