The way the data is distributed over the memory hierarchy of a computer is of fundamental importance to load balancing and software reuse. The block cyclic data layout allows a reduction of the overhead due to load imbalance and data movement. Block-partitioned algorithms are used to maximize the local processor performance.
Since the data decomposition largely determines the performance and scalability of a concurrent algorithm, a great deal of research [10, 21, 23, 25] has focused on different data decompositions [4, 6, 26]. In particular, the two-dimensional block cyclic distribution  has been suggested as a possible general-purpose basic decomposition for parallel dense linear algebra libraries [13, 24, 30], such as ScaLAPACK.
Block cyclic distribution is beneficial because of its scalability , load balance, and communication  properties. The block-partitioned computation then proceeds in consecutive order just like a conventional serial algorithm. This essential property of the block cyclic data layout explains why the ScaLAPACK design has been able to reuse the numerical and software expertise of the sequential LAPACK library.