Hardware that is based on parallel computing architecture has recently been gaining increasing popularity in high performance computing.
The efficiency of parallel processing hardware in engineering problem solving such as the computer simulation of physical processes is not directly dependent on the number of processors: four CPU cores do not in fact provide a fourfold speed increase in solving complex engineering problems over one CPU core. Similarly, the transfer of computation to graphics cards with hundreds of cores cannot provide a hundredfold increase in speed.
First of all, parallel computation acceleration is limited by computational algorithms; running algorithms with a low degree of parallelization on supercomputers and high-performance workstations is irrational. The notion of "efficiency of parallelization" is explained by Amdahl's law, according to which if at least 1/10 of the program is executed sequentially, then the acceleration cannot be increased beyond 10 times the original speed regardless the number of cores employed.
Telling examples of the limited effectiveness of algorithm parallelization for solving engineering problems are provided in the relatively weak results of worldwide leaders in computer-aided engineering (CAE) software - Abaqus and Ansys.
In SIMULIA's Abaqus transfer of computations from 2 CPU cores to 4 CPU cores, the speedup factor was 1.7 times. Transferring these algorithms to CUDA architecture with 448 cores of Nvidia Tesla C2075 sharing 4 CPU cores resulted in an increase of only 3.5 times [Source].
SIMULIA’s Abaqus performance acceleration when transferring from 2 to 4 CPU cores
SIMULIA’s Abaqus performance acceleration when using 4 CPU cores and 448 GPU cores
Ansys also achieved parallelization efficiency of algorithms commensurate with Abaqus. When increasing the number of CPU cores from two to eight, the processing speed of the Ansys Mechanical 15.0 package tripled. Sharing between 2 CPU cores and the 2880 cores on the Nvidia Tesla K40 video accelerator was 3.5 times faster than the 2 CPU cores alone [Source].
Ansys Mechanical 15.0 performance acceleration with parallel processing
The mathematical solvers embedded in the «Frost 3D Universal» software demonstrate the superior computational algorithm parallelization and use of parallel architecture in terms of efficiency.
A computer model of production wells was used to compare the parallel computing speed on CPUs and GPUs.
The hardware was selected from widely available user computing resources such as the Intel Core i7 CPU and the Nvidia Titan graphics card.
|Intel Core i7-3770||Nvidia GeForce GTX Titan|
|Cores: 4||Cores: 2688|
|Base Clock: 3.4 GHz||Base Clock: 836 MHz|
|Boost Clock: 3.9 GHz||Boost Clock: 876 MHz|
|Graphics Card Power: 77 W||Graphics Card Power: 250 W|
|Recommended price: $305||Recommended price: $1080|
The three-dimensional model was discretized with different spatial steps. As a result, meshes with the following number of nodes were obtained: ~2 million, 4 million, 8 million and 16 million. Each computational mesh was computed on 1 core of Intel Core i7, 4 cores of Intel Core i7 and the GeForce GTX Titan video card. Below there are computational results for the two-year simulation forecast.
|Number of nodes||Processing time, s||Speedup factor|
|1 core of Intel Core i7||4 cores of Intel Core i7||GeForce GTX Titan||4 cores of Intel Core i7 to 1 core||GeForce GTX Titan to 4 cores of Intel Core i7||GeForce GTX Titan to 1 core Intel Core i7|
The performance of 1 core of Intel Core i7 represents an speedup factor of 1x
It should be noted that, when comparing the computational speed on multi-core architectures, the following model parameters have a significant impact on the acceleration:
- number of materials;
- the number of boundary conditions;
- mesh uniformity;
- multiplicity of mesh nodes and computational cores;
- conformity of thermo-physical properties of materials.
It means that the maximum acceleration on parallel architectures could be achieved on the simplest models with a uniform computational mesh and the minimum number of materials and boundary conditions. In practice, however, computational models are more complicated, that’s why our speed analysis was based on the production wells simulation model for more objective results.
- The use of computational algorithms with a low degree of parallelization is inefficient on multi-core processors and video accelerators.
- The major engineering analysis software packages on the market contain a high degree of serial code, significantly hampering the acceleration potential of parallel computing. This is largely due to the implementation of now dated mathematical solver algorithms, developed when there were no technologies such as CUDA and therefore not designed to take advantage of these parallelization technology enhancements.
- Mathematical algorithms in the latest generation CAE software are designed basing on parallel processing technology. It allows achieving speedup by a factor of ten by transferring computation from one CPU core to multi-core graphics accelerators.