Parallel Computing GPU vs. CPU. Technical Details, Comments

The article “Nvidia Tesla K20 vs Intel Core i7: Speed Comparison”, published on the company's website, and, at the request of many readers, additional technical information regarding the computations performed on the NVIDIA Tesla K20 and NVIDIA Geforce GTX 560 Ti (see the original post on Nvidia web site) has been placed here on the corporate blog.

The first benchmark tests on the heat problem with phase transition (Stefan problem) calculation were executed for a sphere with a radius of 10 meters for 364 days. The summary table below gives the computational time taken by the corresponding graphics accelerator.

Number of cells along each axis X, Y, Z 100 150 200 250 300 350
Number of nodes (millions) 1,03 3,44 8,12 15,81 27,27 43,24
Time in seconds for NVIDIA Tesla K20 7 52 212 637 1625 3394
Time in seconds for NVIDIA GeForce GTX 560 Ti 10 75 311 923

The further increase in the number of spatial nodes was stopped, since the generation of the test (input files) for the solver would have taken a long time – more than a day.

Subsequent tests of the heat problem with phase transition calculation were performed for a rectangular parallelepiped having the dimensions 20x20x20 meters for 189 days. The summary table below shows the respective computational times.

Number of cells along each axis X, Y, Z 100 150 200 250 300
Number of nodes (millions) 1,03 3,44 8,12 15,81 27,27
Time in seconds for NVIDIA Tesla K20 0,5049 3,375 16,848 50,544 125,118
Time in seconds for NVIDIA GeForce GTX 560 Ti 0,8829 6,318 27,81 79,623 197,532
Number of cells along each axis X,Y, Z 350 400 450 470 475 477 478
Number of nodes (millions) 43,24 64,48 91,73 104,487 107,85 109,215 109,902
Time in seconds for NVIDIA Tesla K20 272,943 526,527 956,556 1175,553 1233,711 1263,627 1277,937

For n=479 cells along the axis, the computation did not begin due to the lack of the GPU memory.

The computation of the previously published model of Fukushima on the NVIDIA Tesla K20 GPU for 730 days took 5 minutes 43 seconds. The computational mesh on which the calculation was performed contained 17 828 087 spatial nodes. The preparation of the input data for iterations took 6 minutes and 21 seconds on the CPU Intel Core i7 (1 core was used).

Comments

Nathan Campbell
Senior Research Analyst at Southwest Research Institute

Interesting comparison, I looked up the specs of the K20, wow! I would have liked a little more detail on the i7 (number of cores, processor speed, CPU utilization during test, etc). Adding parallelization directives to a single threaded implementation will frequently not result in a optimal multi-threaded implementation. To highlight that point, the end of the article notes that the single core i7 implementation took 192 minutes, the multi-threaded implementation took 58 minutes. Depending on the i7 used, that may or may not be a very good speed increase.

 

Valery I. Kovalenko
IT Director

Thanks for your comment Nathan. Concerning the configuration: i7 3770 (4 cores). For the Fukushima calculation, the OpenMP solver used 3 cores and 1 core was used by the Frost3D Universal GUI and the operating system (Win7).Tesla K20 was installed on the same system, so the HDD, bus, etc. were the same for each test. The number of executions for the Fukushima was 6 (the time given is the average). The Fukushima calculation is just a demonstration of a complete job from beginning to end, with and without Tesla. The number of test runs (for the sphere and cuboid) was increased to 8 (the time given is the average). The first result of each test was not taken into account.

 

Nathan Campbell
Senior Research Analyst at Southwest Research Institute

Valery, Thank you for the details, I appreciate you taking the time to answer my questions. The decrease in execution time is reasonable (quite good in fact) when you moved from 1 to 4 cores. I look forward to the final report!

 

Petrica Barbieru
General Manager at PRO SYS SRL

Sorry, but what do you want to compare? CPU with GPU? If your software is CUDA optimised, K20 will "pump" at full capabilities with very good results but I think can't be compared with a CPU which has a totally different architecture and no CUDA support.

 

Eugene Miya
Retired at Retired

LLNL (LLL) once compared their most important program as requiring Cray-Years of execution. They had more than one machine dedicated to that program (a transport code). They do equivalents now (which are still super codes).

 

Valery I. Kovalenko
IT Director

Petrica,
we are comparing the speed of thermal simulations on the GPU and CPU. For these purposes, we use three (3) different algorithms for each of the architectures: (1) parallel for GPU; (2) semi-parallel for CPU (2-10 cores); (3) sequential (1 core of the CPU). Each algorithm has deep optimization. For example, about 9 months ago we proposed a new sequential algorithm for thermal simulations and compared the accuracy and the speed with well-known Ansys solvers: At that time we also published a video, where we demonstrated that we really used only one CPU core for the thermal calculation (cooling gear).

 

Petrica Barbieru
General Manager at PRO SYS SRL

Valery, I don't know details about your algorithm but as I know such thermal calculation use finite elements analysis (FEA) which lead to huge matrix that are very good canditate for parallelization, so many cores. Many cores mean GPU (in single system, excluding HPC). That for me, sound strange to use CPU (single) in such area. If you have more details about your sequential algorithm used for CPUs, maybe will be more clear for me. I know very high mathematics russian tradition - did you made a revolution in such

 

Valery I. Kovalenko
IT Director

Petrica, yes, we successfully revolutionized thermal analysis a year ago. And you are right, you can use FEA method where you create a huge matrix and then perform matrix operations. But you have missed just one thing - the limitation of GPU memory. And in reality, the GPU didn't make sense due to mesh limitations. Even using Tesla K40 with 12 GB (!) of onboard memory, the order of magnitude of the obtained mesh would be 1-10 million. Thus, for calculation of a 50-500 million mesh, the only solution was to use CPU or supercomputing technology. We use another way: our mathematicians have discovered another fundamental method for numerically calculating thermal fields that significantly reduces allocated memory. The algorithm was sequential, but very fast. And only after a year of research, we have obtained the parallel version based on this. As a result, when it comes to CPU vs. GPU for thermal calculations, I can definitely say that GPU is the clear winner. As an aside, I didn't say Parallel vs Sequential; that would be absurd.

 

Ed Trice
Executive Director at Lightning Cloud Computing

I have collected a fair number of single-core benchmark results using an app that takes about 2 minutes to run. I would be interested in seeing how your i7-4930K does on this test

 

Valery I. Kovalenko
IT Director

You can ask the Forsite company for this purpose. They have a deal with NVIDIA to promote NVIDIA’s equipment in Russia. We had access to their equipment for several weeks following the request from NVIDIA. I think you can directly ask Forsite for access. Their contacts. We communicated with Roman.
The reference to their publication about the considered GPU test.

 

Valery I. Kovalenko
IT Director

Concerning AMD Radeon HD 7990, we are just planning to use OpenCL...

 

Ryan Taylor
Software Design Engineer Sr. at Alion Science and Technology

I thought the days of publications comparing GPU to old style CPUs were dead... and with good reason.

 

Valery I. Kovalenko
IT Director

Ryan, can you suggest the new "style" CPUs for comparisons?

 

Ryan Taylor
Software Design Engineer Sr. at Alion Science and Technology

Did you do any comparisons to heterogeneous implementations? Obviously the actual application here is very interesting and worthwhile; however, when looking at the work from a strictly "is the GPU faster than the CPU for this application", it's not, just my opinion.

 

Valery I. Kovalenko
IT Director

No, we didn’t compare because it doesn’t make sense. The current approach that was used in the thermal solver is more efficient in the parallel implementation and the referred comparison shows that. However, I suppose for thermal + stress-strain calculations (conjugate problem) heterogeneous computing would be the best.

 

Ryan Taylor
Software Design Engineer Sr. at Alion Science and Technology

Heterogeneous solution would also be parallel implementation. I suppose not knowing the application, it would be interesting to see how it responded to on chip GPU core balancing with CPU and shared memories/caches. Also, what was the SIMD usage on the i7, if any?

 

Andrey Vladimirov
Head of HPC Research at Colfax International

Valery, have you benchmarked this application on Xeon processors? A dual-socket Xeon E5-2687W v2 has 16 cores at 3.4 GHz and 50 MB of cache, as opposed to 6 cores at 3.4 and 12 MB of cache in the i7-4930K. If the calculation is compute-bound or has a latency-bound component, then Xeon may accelerate the simulation 3x or more compared to the 4930K. Comparing a dual Xeon-based system to i7+K40 would be more meaningful, because both solutions have approximately the same cost and power consumption.

 

Ryan Taylor
Software Design Engineer Sr. at Alion Science and Technology

So I was also curious about this. I know that some people were getting good FFT number on many-core machines compared with GPGPU. Interesting to see what other algorithms/applications fit that mold.

 

Valery I. Kovalenko
IT Director

Ryan, I don't foresee tangible results from GPU-CPU balancing due to a very high parallelization of the considered algorithm that we use in the thermal solver.
With respect to SIMD; we have tried to use AVX and SSE (we had some discussions in another group several weeks ago), but didn't obtain significant resulting performance increases.

 

Valery I. Kovalenko
IT Director

Top Contributor
Andrey, if we obtain 2 Xeon E5-2687W v2, we will compare the performance. This is an interesting configuration, and I hope we can test it in future. But right now I would say that Tesla K40 "would be faster" (in terms of calculations), given that the 15 Xeon cores would calculate the thermal task approximately 3 times faster than the i7’s 5 cores, and we know that Tesla K40 is 5.35 times faster than the i7 for the current type of calculations. Therefore, the Tesla K40 would be approx. 2 times faster than 2 Xeons.

 

Ryan Taylor
Software Design Engineer Sr. at Alion Science and Technology

I was talking about on-chip GPU cores with a 'shared unified' memory model, so that shouldn't be much of an issue. ie Kaveri?

 

Valery I. Kovalenko
IT Director

Ryan, Kaveri is an interesting technology. I have found AMD A10-7850K as a workable sample but I suppose such systems are more suited to "hybrid algorithms"— when you have some sequential and parallel parts of one algorithm. In that case, the sequential part will use the CPU and the parallel will use the GPU. Through shared memory they can quickly perform data exchange.But our parallel approach of the thermal calculation algorithm uses data preparation just once, and then all the calculations steps are performed in parallel. It makes no sense to use hybrid architecture. Currently, we have developed another solver: “Darcy filtration” + “Thermal” (alternating direction implicit method approach). Hence, the “Darcy filtration” solver uses parallel (SLAE calculation) and sequential (matrix preparation) components. If we get something like an AMD A10-7850K, we will test that approach on it and compare only-GPU and only-CPU implementations. So, thank you for the suggestion.

 

Nigel Goodwin
at Essence Products and Services

I think the real issue is whether explicit solvers are general purpose enough and have enough stability. They may be good for test problems, but....how well do they model moving phase boundaries, for example? Maybe not a problem in your application area?
[ps. what does SLAE stand for in the paper you reference?]

 

Valery I. Kovalenko
IT Director

Actually, moving boundaries are used in 1D and 2D (only in some cases) simulations. In practice, it is not used in 3D because the method is very slow. There are a lot of other accurate methods that work much faster. For example, we use shock capturing methods.Concerning SLAE, in the previous comment I meant that we solve SLAE on one of our company's solvers for water filtration calculation. We call it the "Darcy filtration solver" because it is based on the Darcy's equation and calculates water filtration in soil.As an aside, we have several other solvers for the calculation of filtration. For example, one of them is based on the Richardson equation and doesn’t require solving SLAE. But “Darcy” is more user-friendly due to the small amount of parameters that the user needs to input.

 

Nigel Goodwin
at Essence Products and Services

I'm still trying to guess what SLAE stands for. In my industry, reservoir engineering, Darcy's law rules.

 

Valery I. Kovalenko
IT Director

SLAE = System of Linear Algebraic Equations. Reservoir engineering ... three-phase filtration equations. The Richards equation is a two-phase filtration equation, but very similar to the three-phase.

 

Peter Bonsma
Technical Director at RDF

Hello Valery, very interesting. Do you have any idea how non-/semi-professional cards like the 780Ti and especially the TITAN will perform? I assume the speed for double precision calculations is of importance what would give the TITAN a better position than the 780Ti.

 

Valery I. Kovalenko
IT Director

Hello Peter,
Regarding GPU cards, when we execute thermal calculations on GPU we usually have 2 main criteria: 1) performance, and 2) amount of memory. In most cases, we have approximately the same performance during our tests as NVIDIA states in their specifications. So, let me list the performance:
GTX 780 Ti – 5,040 GFLOPS, Tesla K20c – 3,524 GFLOPS, 660 Ti – 2,460 GFLOPS.
As you can see, using 780 Ti will deliver a much faster result than the Tesla K20.
But let me also list the amount of memory: Tesla K20c – 5 GB, GTX 780 Ti – 3 GB, 660 Ti – 2 GB.
It shows that using GTX 780 Ti will allow calculation of a simple temperature field with the mesh containing approx. 50 million nodes as opposed to 91 million that Tesla K20 can.
If we use such a device for the calculation of ground freezing around the Fukushima nuclear plant, we should divide 50 million by 2 due to additional data that we need to store in the GPU memory (BC, like cooling devices, range of materials etc.).
As for the Titan with 6 GB, I suppose it would be an optimal solution for thermal calculations.Concerning double-float precision, for our thermal solver implementation, double or float doesn’t matter due to some techniques that our mathematicians have developed. The resulting accuracy lies in the difference between double and float values, but this is a too small a value to be of any significance.
Best regards, Valery.

 

Luigi Morelli
PM at TE4I

Did you do any tests on GTX 580? I know it's an obsolete, non professional card with little memory on board, but I noticed that floating point operations perform a lot faster on it rather on 3.0 series due to a different ALU access.

 

Valery I. Kovalenko
IT Director

Yes, you are right! We obtained the same while comparing GTX 660 Ti and GTX 560 Ti. Sometimes the 560 was 2x times faster than the 660! Unfortunately, we didn't test 580. Perhaps further on we’ll try to test more semi-professional cards.
Thanks for your comment.
Regards, Valery

 

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>