Implementation of the Douglas - Rachford Scheme with Cuda Technology

1. INTRODUCTION

This article describes the performance of calculations on video cards (using CUDA) for modeling physical processes and phenomena based on the solution of the three-dimensional heat equation via the Douglas-Rachford scheme (ADI method). A comparative analysis of the calculation speeds of the central (CPU) and graphics (GPU) processors was conducted.

2. DOUGLAS-RACHFORD SCHEME DESCRIPTION

For the mathematical simulation of heat distribution, accounting for filtration and phase transition, the following heat equation is applied:

(1)   \begin{equation*}  C_{ev}\frac{\partial T}{\partial t}+\bigtriangledown{(-k\bigtriangledown{T})}+C_w\vec{v}\bigtriangledown{T}-Q=0 \end{equation*}

A description of the coefficients is given in Table 1.

Table 1: The heat equation coefficients

Coefficients Description
C_{ev}=C-\rho_{ice}L{\partial w_{ice}}/{\partial T} effective heat capacity
w_{ice}=1-1/(exp(2A_{w}(T_{p}-T))+1) frozen water content
A_w fitted coefficient
T_p temperature of phase transition
T temperature
C=C_{m}w_{ice}+C_{t}(1-w_{ice}) heat capacity of the ground
C_m heat capacity of frozen ground
C_t heat capacity of thawed ground
t time
k=k_{m}w_{ice}+k_{t}(1-w_{ice}) effective conductivity of ground
k_m thermal conductivity of frozen ground
k_t thermal conductivity of thawed ground
C_w water heat capacity
\vec{v} filtration velocity vector
Q heat source

Equation (1) is a partial differential equation of the parabolic type. Since the computational domain of equation (1) is arbitrary, numerical methods are used to solve it.
Alternating direction schemes form a class of numerical methods for parabolic partial differential equation solving [1]. The Douglas-Rachford scheme is a widely used method [1, 2], briefly described below.

Let’s consider a homogeneous heat equation in three dimensions, which is a simplified version of equation (1):

(2)   \begin{equation*}  \frac{\partial u}{\partial t}=\frac{\partial^{2} u}{\partial x^{2}}+\frac{\partial^{2} u}{\partial y^{2}}+\frac{\partial^{2} u}{\partial z^{2}} \end{equation*}

The computational domain is represented by a parallelepiped with the spatial mesh \omega_{l}=\{(x_{i},y_{j},z_{k}):i=1,...,N_{x};j=1,...,N_{y};k=1,...,N_{z}; \}, and the time mesh \omega_{\tau}=\{t_{i}:i=1,...,n\}.

H^{x}=\{h^{x}_{i}=x_{i+1}-x_{i}:i=1,...,N_{x}-1\},H^{y}=\{h^{y}_{j}=y_{j+1}-y_{j}:j=1,...,N_{y}-1\}, H^{z}=\{h^{z}_{k}=z_{k+1}-z_{k}:k=1,...,N_{z}-1\}

– spatial mesh steps, and

H^{t}=\{\tau_{i}=t_{i+1}-t_{i}:i=1,...,n-1\}

– time steps.

That temperature field value at time t_{s}\in{\omega_{\tau}} is known, and at node (x_{i},y_{j},z_{k}) the temperature is equal to y^{s}_{i,j,k}. To determine the temperature field at the next time layer, the approximation equation (2) is used with finite-difference equations (3)-(5) due to the Douglas-Rachford scheme, as follows:

(3)   \begin{equation*}  \frac{y^{s+1/3}_{ijk}-y^{s}_{ijk}}{\tau}=\Delta_{x}y^{s+1/3}_{ijk}+\Delta_{y}y^{s}_{ijk}+\Delta_{z}y^{s}_{ijk}, j=1,...,N_{x};(j,k) - fix; \end{equation*}

(4)   \begin{equation*}  \frac{y^{s+1}_{ijk}-y^{s+2/3}_{ijk}}{\tau}=\Delta_{z}y^{s+1}_{ijk}-\Delta_{z}y^{s}_{ijk},i=1,...,N_{y};(i,k) - fix; \end{equation*}

(5)   \begin{equation*}  \frac{y^{s+2/3}_{ijk}-y^{s+1/3}_{ijk}}{\tau}=\Delta_{y}y^{s+2/3}_{ijk}-\Delta_{y}y^{s}_{ijk},k=1,...,N_{z};(i,j) - fix; \end{equation*}

where

\Delta_{x}u_{ijk}=\frac{1}{h_{i}^{2}}u_{i+1jk}-\left(\frac{1}{h_{i}^{2}}+\frac{1}{h_{i-1}^{2}}\right)u_{i+1jk}+\frac{1}{h_{i-1}^{2}}u_{i-1jk},
\Delta_{y}u_{ijk}=\frac{1}{h_{j}^{2}}u_{ij+1k}-\left(\frac{1}{h_{j}^{2}}+\frac{1}{h_{j-1}^{2}}\right)u_{ij+1k}+\frac{1}{h_{j-1}^{2}}u_{ij-1k},
\Delta_{z}u_{ijk}=\frac{1}{h_{k}^{2}}u_{ijk+1}-\left(\frac{1}{h_{k}^{2}}+\frac{1}{h_{k-1}^{2}}\right)u_{ijk+1}+\frac{1}{h_{k-1}^{2}}u_{ijk-1}

– are the 2nd order finite difference derivatives along axes Ox, Oy и Oz respectively. At every fixed pair of indices along the Oy and Oz axes (the number of sets is N_{y}\times{N_{z}}), N_{x} of equations (3) are gathered together and form a tridiagonal system of linear equations. To solve a system of linear equations with a tridiagonal matrix algorithm requires O(N_x) operations.

Since for each fixed pair of indices (j,k) ((i,k), (i,j)) of equation (3) ((4) and (5), respectively) are solved independently of each other, then we can say that the Douglas–Rachford scheme has natural parallelism, hence allowing for the effective parallelization of the scheme with CUDA.

3. BENCKMARKS

Calculations have been performed to demonstrate the performance acceleration for a software-implemented Douglas-Rachford scheme using CUDA technology, compared with the implementation of the scheme on a CPU.

Consider computational domain P = {(x,y,z):0\leqslant{x}\leqslant{10}, 0\leqslant{y}\leqslant{10}, 0\leqslant{z}\leqslant{10}} (see Figure 1), which has the shape of a cube with a side of 10 meters. SCD (seasonal cooling device) S is placed in domain P parallel to axis Oz. The lowest point A of SCD S is located at the coordinates (5.0, 5.0, 1.0), and the upper point B of SCD S is at (5.0, 5.0, 9.0).

SCD S has the following design parameters: the radius of the tube evaporator is 16,85 mm, the area of the evaporator is 1.0 m2, the area of the capacitor is equal to 2.0 m2.

The computational domain P is filled with material M, whose volumetric heat capacity in the thawed and frozen states is equal to 2.14\times{10^6}\frac{J}{m^{3}\times{K}}.

The thermal conductivity of material M in the thawed and frozen states is equal to 1.8\frac{W}{m\times{K}}.

The computational domain

Fig. 1: Computational domain

A boundary condition of the second type (zero flow) is specified at the boundary of the computational domain P, a boundary condition of the third type is specified for SCD S, where the ambient temperature is -20.0 °C, and the heat transfer coefficient is equal to 10.0\frac{W}{m\times{K}}. The initial temperature of the material (which occupies the computational domain) is equal to -1.0 °C.

In the test, the thermal field distribution is calculated every 30 days across 360 days.

There were four pairs of calculations with different levels of computational domain discretization: the first calculation of each pair was performed running on a one CPU core, and the second on a video card using CUDA technology.

For the first pair of calculations, a spatial step equal to 100 mm in all three directions was used. For second pair of calculations, the spatial step along the axes Ox and Oy is 100 mm, and 50 mm along the Oz axis. For the third pair of calculations, the spatial step along the Ox axis is equal to 100 mm, and 50 mm along Oy and Oz. Finally, for the last pair of calculations a spatial step of 50 mm in all three directions was used. Thus, in the first case, the domain P was meshed into 100\times{100}\times{100}=1,000,000 nodes. In the second, third and fourth cases, the domain P was divided into 2,000,000, 4,000,000 and 8,000,000 nodes, respectively.

Note: that this article does not analyze the accuracy of the calculations and the dependence of accuracy of the obtained solution on the level of computational domain discretization. In all the calculations performed, identical heat distributions were obtained; the results are shown in Figures 2 (a, b).

Heat distribution in the form of color filling Isolines around thermosyphon

Fig. 2a: heat distribution in the form of color filling

Fig. 2b: heat distribution in the form of isolines

The calculations were performed on the processors listed in Table 2.

Table 2. Computational units

Processor (CPU) Inlel Core i7 4 cores
Video card (GPU) GeForce GTX TITAN 2688 cores

The computational time depending on the number of nodes in the computational domain is given in Table 3 and Figure 3.

Table 3. Computational time

Number of nodes Computational time on Intel Core i7 (to the nearest second) Computational time on GeForce GTX TITAN (to the nearest second) GPU acceleration factor (rounded to 2 decimal places)
1,000,000 1 962 63 31.14
2,000,000 16 296 524 31.10
4,000,000 30 654 1013 30.26
8,000,000 63 858 2079 30.72

Difference in processing speed of CPU and GPU

Fig. 3: Computation speedup

4. CONCLUSIONS

On average, calculations on the GPU version of the ADI solver using CUDA technology on the Nvidia GTX Titan video card ran 30 times faster than the calculations on a one core of Intel Core i7 CPU. Note that the tests involved resolving for a relatively simple task (a small number of fixed boundary conditions for a small amount of material). When solving more complex computing tasks, the acceleration factor is expected to decrease when implementing parallel computing.

As shown in the article, CUDA technology can be successfully used on personal computers to solve tasks that require large computational resources, such as modeling physical processes and phenomena.

5. REFERENCES

1. N.N. Yanenko, The method of fractional steps: The solution of problems of mathematical physics in several variables, 1st ed. Springer – Verlag, (1971).

2. J.W. Thomas, Numerical partial differential equations: finite difference methods / J.W. Thomas. – USA: Springer 1995. – 437 p.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>