## Massively Parallel Monte Carlo Simulation using GPU

## Introduction

In my previous blog posts I’ve explained how you can utilize the GPU on your computer to perform massively parallel computation with the help of NVIDIA CUDA and Thrust technologies. On this blog post I’m diving deeper into Thrust usage scenarios with a simple implementation of Monte Carlo simulation.

My influence was the PI prediction sample on Thrust web site. The sample is running Monte Carlo simulation with 10K samples on a unit circle to estimate the PI number. You can visit this Wikipedia page if you are interested into how Monte Carlo Simulation can be used to approximate PI. Actually it is a solution to the famous Buffon’s Needle problem.

I’m taking the original example one step further to show you how to send device variables to functors in Thrust methods, and also using a slightly different problem. Perhaps there are many other methods to do the same logic, but on this blog post I’m just concentrating on this specific implementation.

## Background

### About Monte Carlo Simulation

Monte Carlo simulation is an approach to solve deterministic problems with probabilistic analog. That is exactly what we are accomplishing in our example: estimating the area of intersecting disks. Monte Carlo methods are especially useful for simulating systems with many coupled degrees of freedom, such as fluids, disordered materials, strongly coupled solids, and cellular structures.

Our simulation is about predicting the intersect area of four overlapping unit disks as seen on the image below. (intersection of A,B,C and D disks) Actually, the problem can also be solved easily with the help of Geometry as explained here. I’ve calculated the area as 0.31515. On the other hand the simulation estimated 0.3149.

### About Thrust

Writing code using CUDA API is very powerful in terms of controlling the hardware, but there are high level libraries like Thrust C++ template library, which provides many fundamental programming logic like sorting, prefix-sums, reductions, transformations etc.. The best part is that Thrust consists only of header files and is distributed with CUDA 4.0 installation.

If you are not familiar with terms like GPGPU and Thrust, I’m suggesting you to check out the background information on my previous posts.

## Setup

The example is a console application written in C++. But you can easily transform it to a DLL to use it from your C# application. (previous posts)

I’ve used Visual Studio 2010 create the C++ console application. If you already did not, you need to install the NVIDIA CUDA Toolkit 4.0 and a supported graphics device driver from the same link. The new CUDA Toolkit 4.1 RC1 is also available at CUDA zone, but the project files are built on 4.0. Also do not forget to install the Build Customization BUG FIX Update from the same link for CUDA Toolkit 4.0.

Once the CUDA Toolkit is installed, creating CUDA enabled projects is really simple. For those who are not familiar using native C++ CUDA enabled projects, please follow the steps below to create one:

- Create a Visual C++ console project in Visual Studio 2010 by selecting Empty project on the wizard,
- Open Build Customization from the C++ project context menu, and check the CUDA 4.0(.targets, .props) checkbox,
- Open the project properties, expand the Configuration Properties, Linker and select Input. Edit the additional dependencies and add cudart.lib.
- Add a new empty source file ending with .cu.

You can also skip the steps above and download the example solution and project files directly from here.

## Implementation

The main application consists of calling `thrust::transform_reduce`

50 times to run the intersection estimation simulation. transform_reduce performs a reduction on the transformation of the sequence [first, last) according to unary_op. The unary_op is applied to each element of the sequence and then the result is reduced to a single value with binary_op.

The main code is as follows:

int main(void) { // use 50 independent seeds int M = 50; //Create some circles in the device thrust::host_vector dCircles; dCircles.push_back(CIRCLE(0.0f, 0.0f)); dCircles.push_back(CIRCLE(1.0f, 0.0f)); dCircles.push_back(CIRCLE(1.0f, 1.0f)); dCircles.push_back(CIRCLE(0.0f, 1.0f)); //The kernel can not access host or device vector directly, //therefore get the device pointer to the circles to pass to the kernel thrust::device_vector circles = dCircles; CIRCLE * circleArray = thrust::raw_pointer_cast( &circles[0] ); float estimate = thrust::transform_reduce(thrust::counting_iterator(0), thrust::counting_iterator(M), estimate_intersection(circleArray, circles.size()), 0.0f, thrust::plus()); estimate /= M; std::cout << std::setprecision(6); //calculate area with gemometry : (pi + 3 - 3*sqrt(3)) / 3 = 0.31515s std::cout << "the area is estimated as " << estimate << std::endl << ". It should be 0.31515." ; return 0; }

The unary_op has the Monte Carlo simulation logic implemented in the `estimate_intersection`

functor. The `estimate_intersection`

is a method derived from the thrust::unary_function class and returning the estimated intersect area as float. Using `estimate_intersection`

in `tranform_reduce`

means estimating the intersect area for every data element provided to `tranform_reduce`

. For the data elements we are using two thrust::counting_iterators. This creates a range filled with a sequence from 1 to 50, without explicitly storing anything in the memory. Using a sequence of numbers helps us to assign different thread id for every `estimate_intersection`

call. This is important to generate distinct seed for the random number generator of the simulation. (I’ve mentioned about random number generator seeds in my previous posts.)

For the reduction part of the `tranform_reduce`

we are using the thrust::plus() binary functor, which sums all results into one number. At last we divide the result into 50 to find the average intersect area value.

Our goal with this code is to run the simulation on the device (GPU) and retrieve the result back to the host. Therefore any data we are going to use on the simulation must be placed into the device memory. That is exactly what is happening before we call `thrust::transform_reduce`

. We are preparing properties of all circles we will try to intersect using the `CIRCLE`

object defined below.

struct CIRCLE{ float x,y; CIRCLE(float _x, float _y) : x(_x), y(_y){} } ;

With `thrust::host_vector dCircles;`

in the main code, we are defining a vector object in the host memory. Using a Thrust host vector object over a custom memory simplifies transferring data directly to device with the `thrust::device_vector circles = dCircles;`

call. As you may know, transferring data between device and host memory in CUDA C is handled with `cudaMemcpy`

. But Thrust has the equal operator overload, which allows you to copy memory easily.

On the next line we access the raw pointer of the `circles`

object with the help of the `thrust::raw_pointer_cast`

method. We do this because the `estimate_intersection`

method can only accept a device pointer to the CIRCLE object array.

### Simulation Method

The `estimate_intersection`

unary function implements the simulation logic. A unary function is a function which takes one argument, has a () operator overload and returns one value. In our case the function takes the `thrust::counting_iterator`

generated unique index number and returns the area of the intersection as float. Another important part of the method is the constructor (seen below) which takes in the device pointer to the CIRCLE array and the length of the allocated memory.

struct estimate_intersection : public thrust::unary_function { CIRCLE * Circles; int CircleCount; estimate_intersection(CIRCLE * circles, int circleCount) : Circles(circles), CircleCount(circleCount) { } __host__ __device__ float operator()(unsigned int thread_id) { float sum = 0; unsigned int N = 30000; // samples per thread unsigned int seed = hash(thread_id); // seed a random number generator thrust::default_random_engine rng(seed); // create a mapping from random numbers to [0,1) thrust::uniform_real_distribution u01(0,1); // take N samples for(unsigned int i = 0; i < N; ++i) { // draw a sample from the unit square double x = u01(rng); double y = u01(rng); bool inside = false; //check if the point is inside all circles for(unsigned int k = 0; k < CircleCount; ++k) { double dy,dx; //check if the point is further from //the center of the circle than the radius dx = Circles[k].x - x; dy = Circles[k].y - y; if ((dx*dx + dy*dy) <= 1) { inside = true; } else { inside = false; break; } } if (inside) sum += 1.0f; } // divide by N return sum / N; } };

In order to run the code on the device and call it from the host, the () operator overload has to be defined as `__host__ __device__`

. The rest of the code is the Monte Carlo simulation logic as follows:

1) Initiate the thrust default random number generator

2) Generate 30K random x and y values

3) Loop through all circles and check if the x and y value is inside the circle by calculating the hypotenuse

4) If all circles are inside the x and y coordinates then increase the found points count

5) return the average found points count

That’s it! I hope you enjoy it.

In addition to the code I included here, there are header includes and a hashing algorithm. You can download the code from here.

## About the Implementations

The Monte Carlo simulation I provided on this post is an example and therefore I’m not guaranteeing that it will perform good enough in your particular solution. Also, for clarity there is almost no exception handling and logging implemented. This is not an API; my goal is to give you a high level idea how you can use utilize the GPU for simulations. Therefore, it is important that you re-factor the code for your own use.

Some of the code is taken from original sample and is under Apache License V 2, the rest is my code which is free to use without any restriction or obligation.

## Conclusion

Thrust is a powerful library providing you with simple ways to accomplish complicated parallel computation tasks. There are many libraries like Thrust which are built on CUDA C. These libraries will save you many engineering hours on parallel algorithm implementation and allow you to concentrate on your real business problem. You can check out GPU Computing Webinars for presentations on this area.