# Implementation of the Douglas - Rachford Scheme with Cuda Technology

1. INTRODUCTION

This article describes the performance of calculations on video cards (using CUDA) for modeling physical processes and phenomena based on the solution of the three-dimensional heat equation via the Douglas-Rachford scheme (ADI method). A comparative analysis of the calculation speeds of the central (CPU) and graphics (GPU) processors was conducted.

2. DOUGLAS-RACHFORD SCHEME DESCRIPTION

For the mathematical simulation of heat distribution, accounting for filtration and phase transition, the following heat equation is applied:

(1) A description of the coefficients is given in Table 1.

Table 1: The heat equation coefficients

 Coefficients Description effective heat capacity frozen water content fitted coefficient temperature of phase transition temperature heat capacity of the ground heat capacity of frozen ground heat capacity of thawed ground time effective conductivity of ground thermal conductivity of frozen ground thermal conductivity of thawed ground water heat capacity filtration velocity vector heat source

Equation (1) is a partial differential equation of the parabolic type. Since the computational domain of equation (1) is arbitrary, numerical methods are used to solve it.
Alternating direction schemes form a class of numerical methods for parabolic partial differential equation solving . The Douglas-Rachford scheme is a widely used method [1, 2], briefly described below.

Let’s consider a homogeneous heat equation in three dimensions, which is a simplified version of equation (1):

(2) The computational domain is represented by a parallelepiped with the spatial mesh , and the time mesh . , , – spatial mesh steps, and – time steps.

That temperature field value at time is known, and at node the temperature is equal to . To determine the temperature field at the next time layer, the approximation equation (2) is used with finite-difference equations (3)-(5) due to the Douglas-Rachford scheme, as follows:

(3) (4) (5) where , , – are the 2nd order finite difference derivatives along axes , и respectively. At every fixed pair of indices along the and axes (the number of sets is ), of equations (3) are gathered together and form a tridiagonal system of linear equations. To solve a system of linear equations with a tridiagonal matrix algorithm requires operations.

Since for each fixed pair of indices (j,k) ((i,k), (i,j)) of equation (3) ((4) and (5), respectively) are solved independently of each other, then we can say that the Douglas–Rachford scheme has natural parallelism, hence allowing for the effective parallelization of the scheme with CUDA.

3. BENCKMARKS

Calculations have been performed to demonstrate the performance acceleration for a software-implemented Douglas-Rachford scheme using CUDA technology, compared with the implementation of the scheme on a CPU.

Consider computational domain P = {(x,y,z): } (see Figure 1), which has the shape of a cube with a side of 10 meters. SCD (seasonal cooling device) is placed in domain parallel to axis . The lowest point of SCD S is located at the coordinates (5.0, 5.0, 1.0), and the upper point of SCD is at (5.0, 5.0, 9.0).

SCD has the following design parameters: the radius of the tube evaporator is 16,85 mm, the area of the evaporator is 1.0 m2, the area of the capacitor is equal to 2.0 m2.

The computational domain is filled with material , whose volumetric heat capacity in the thawed and frozen states is equal to .

The thermal conductivity of material in the thawed and frozen states is equal to . Fig. 1: Computational domain

A boundary condition of the second type (zero flow) is specified at the boundary of the computational domain , a boundary condition of the third type is specified for SCD , where the ambient temperature is -20.0 °C, and the heat transfer coefficient is equal to . The initial temperature of the material (which occupies the computational domain) is equal to -1.0 °C.

In the test, the thermal field distribution is calculated every 30 days across 360 days.

There were four pairs of calculations with different levels of computational domain discretization: the first calculation of each pair was performed running on a one CPU core, and the second on a video card using CUDA technology.

For the first pair of calculations, a spatial step equal to 100 mm in all three directions was used. For second pair of calculations, the spatial step along the axes and is 100 mm, and 50 mm along the axis. For the third pair of calculations, the spatial step along the axis is equal to 100 mm, and 50 mm along and . Finally, for the last pair of calculations a spatial step of 50 mm in all three directions was used. Thus, in the first case, the domain was meshed into nodes. In the second, third and fourth cases, the domain was divided into 2,000,000, 4,000,000 and 8,000,000 nodes, respectively.

Note: that this article does not analyze the accuracy of the calculations and the dependence of accuracy of the obtained solution on the level of computational domain discretization. In all the calculations performed, identical heat distributions were obtained; the results are shown in Figures 2 (a, b).  Fig. 2a: heat distribution in the form of color filling Fig. 2b: heat distribution in the form of isolines

The calculations were performed on the processors listed in Table 2.

Table 2. Computational units

 Processor (CPU) Inlel Core i7 4 cores Video card (GPU) GeForce GTX TITAN 2688 cores

The computational time depending on the number of nodes in the computational domain is given in Table 3 and Figure 3.

Table 3. Computational time

 Number of nodes Computational time on Intel Core i7 (to the nearest second) Computational time on GeForce GTX TITAN (to the nearest second) GPU acceleration factor (rounded to 2 decimal places) 1,000,000 1 962 63 31.14 2,000,000 16 296 524 31.10 4,000,000 30 654 1013 30.26 8,000,000 63 858 2079 30.72 Fig. 3: Computation speedup

4. CONCLUSIONS

On average, calculations on the GPU version of the ADI solver using CUDA technology on the Nvidia GTX Titan video card ran 30 times faster than the calculations on a one core of Intel Core i7 CPU. Note that the tests involved resolving for a relatively simple task (a small number of fixed boundary conditions for a small amount of material). When solving more complex computing tasks, the acceleration factor is expected to decrease when implementing parallel computing.

As shown in the article, CUDA technology can be successfully used on personal computers to solve tasks that require large computational resources, such as modeling physical processes and phenomena.

5. REFERENCES

1. N.N. Yanenko, The method of fractional steps: The solution of problems of mathematical physics in several variables, 1st ed. Springer – Verlag, (1971).

2. J.W. Thomas, Numerical partial differential equations: finite difference methods / J.W. Thomas. – USA: Springer 1995. – 437 p.