## Large Scale Machine Learning using NVIDIA CUDA

## Introduction

You may have heard about the Stanford University’s machine learning on-line course given by Prof. Andrew Ng. in 2011; it was a great course with lots of real world examples. During the course I’ve realized that GPUs are the perfect solution for large scale machine learning problems. In fact, there are many examples about supervised and unsupervised learning all around the internet. Being a fan of both GPGPU and Machine Learning technologies, I came up with my own perspective to run machine learning algorithms with huge amount of data on the GPUs.

I’ve already presented the solution recently at the South Florida Code Camp 2012. Everybody was interested in these two subjects a lot; therefore I’ve decided to share it on my blog. The example in this post is neither the only solution nor the best solution. I hope it will help you one day solve your own machine learning problem.

There is a lot of concepts to machine learning but in this post I’m only scratching the surface. If you already know about GPGPU and Machine Learning you can just go to the source code at this link, download the Visual Studio 2010 projects and try it out.

I’ve also prepared the same example using CUBLAS with vectorized implementation of the polynomial regression algorithm, but the CUBLAS example would require more in depth explanations. Therefore I’m posting this example first which is a simplified implementation. If you are interested in CUBLAS implementation please let me know and I can send you that copy.

## Background

### Machine Learning

If you are already familiar with machine learning you can skip the brief introduction and jump directly to the Large Scale Machine Learning section. Or if you want to learn more about machine learning please follow the links or check out the Stanford course I’ve mentioned at the beginning.

Machine learning algorithms allow computers recognize complex patterns. It focuses on the prediction, based on known properties learned from the training data. We are using machine learning algorithms every day dozens of times maybe unknowingly: every time we get a book or movie recommendation or every time we do a web search. In 1959 Arthur Samuel described Machine learning as: Field of study that gives computers the ability to learn without being explicitly programmed. It has been a while machine learning was first introduced, and it is gaining popularity again with the rise of Big Data.

Figure 1 shows how some of the machine learning processes work. On phase 1, given a data set a machine learning algorithm can recognize complex patterns and come up with a model. In most cases this phase is the bulk of the computation. In the second phase any given data can run through the model to make a prediction. For example if you have a data set of house prices by size, you could let the machine learn from the data set and let it predict house price of any given size.

It does this by recognizing the function which defines the relation between the different features of the problem. A linear problem with two dimensions, like house price (the house size is the feature and the house price is the label data), can be expressed with the f(x) = ax + b model. Figure 2 shows how one feature can be used on a linear regression problem to predict new house prices. The term “hypothesis” was used in the Stanford course to describe the model.

Depending to the data set, more complex functions can be used. On Figure 3 you can see how the complexity can grow easily from 2 dimensions linear to hundreds of dimensions polynomial. In a spam filtering problem the different features could be words in the email or in a face recognition problem the features could be the pixels of the image. In the house price prediction example, features are properties of the house which are affecting the price. e.g. size, room count, floors, neighborhood, crime rate etc.

There are many machine learning algorithms for different problem types. The most common groups of these algorithms are Supervised Learning, Unsupervised Learning. Supervised learning is used on problems where we can provide the output values for the learning algorithm. For example: house prices for some house features is the output value, therefore house price prediction is a supervised learning problem. Data with these output values is named as “labeled data”. On the other hand unsupervised learning does not require output values, patterns or hidden structures can be recognized just with feature data. For example: clustering social data to determine groups of people by interest would not require to define any output value, therefore it is an unsupervised learning problem.

#### Gradient Descent

In supervised learning problems, the machine can learn the model and come up with a hypothesis by running a hypothesis with different variables and testing if the result is close to the provided labels (calculating the error). Figure 4 shows how a training data is plot and the error is calculated. An optimization algorithm named Gradient Descent (Figure 5) can be used to find the optimum hypothesis. In this simple two dimensional problem, the algorithm would run for every different value of “a” and “b”, and would try to find the minimum total error.

The pseudo code below shows how the gradient descent algorithm in Figure 5 works :

for every a and b loop until converge errors = 0 for i = 1 to data.length fx = a * data[i] + b errors += (fx - labelData[i]) * data[i] end for gradient = gradient - learningRate * 1/data.length * errors end for

#### Large Scale Machine Learning

Machine learning problems become computationally expensive when the complexity (dimensions and polynomial degree) increases and/or when the amount of data increases. Especially on big data sources with hundreds of millions of samples, the time to run optimization algorithms increases dramaticaly. That’s why we are looking for parallelization opportunities in the algorithms. The error summation of gradient descent algorithm is a perfect candidate for parallelization. We could split the data into multiple parts and run gradient descent on these parts in parallel. In Figure 6 you can see how the data is split into four parts and fed into four different processors. On the next step the result is gathered together to run the rest of the algorithm.

Clearly, this approach would speed up the machine learning computation by almost four times. But what if we would have more cores and split the data further? That is where GPUs step into the solution. With GPUs we can parallelize in two layers: multiple GPUs and multiple cores in every GPU. Assuming a configuration with 4 GPUs and 512 cores each, we could split down the data into 512 more pieces. Figure 7 shows this configuration along with the parallelized part on the GPU cores.

### GPGPU

Utilizing GPUs to enable dramatic increases in computing performance of general purpose scientific and engineering computing is named GPGPU. NVIDIA is providing a parallel computing platform and programming model named CUDA to develop GPGPU software on C, C++ or Fortran which can run on any NVIDIA GPU. NVIDIA CUDA comes with many high level APIs and libraries like basic linear algebra, FFT, imaging etc. to allow you concentrate on the business logic rather than re-writing well known algorithms.

You can visit my previous blog posts where I’ve explained how to use NVIDIA CUDA capable GPUs to perform massively parallel computations. The examples include Monte Carlo simulation, random number generators and sorting algorithms.

## House Price Prediction Example

On this post I’ll show you how you can implement house price prediction on NVIDIA CUDA. Given a house price data set based on bedrooms, square feet and year built, it is possible to let the machine learn from this data set and provide us with a model for future predictions. Because the error calculation part of the Gradient Descent algorithm is highly parallelizable, we can offload it to the GPUs.

The machine learning algorithm in this example is Polynomial Regression, a form of the well known Linear Regression algorithm. In Polynomial Regression the model is fit on a high order polynomial function. In our case we will be using bedrooms, square feet, year built, square root of bedrooms, square root of square feet, square root of year built and the product of bedrooms and square feet. The reason we added the four polynomial terms to the function is because of the nature of our data. Fitting the curve correctly is the main idea behind building a model for our machine leaning problem. Logically house prices increase by these features not in a linear or exponential way and they don’t drop after a certain peek. Therefore the graph is more like a square root function, where house prices increase less and less compared to increasing any other feature.

Finding the right polynomial terms is very important for the success of the machine learning algorithm: having a very complex, tightly fitting function would generate too specific model and end up with overfitting, having a very simple function, like a straight line would generate too general model and end up with under fitting. Therefore we are using additional methods like adding regularization terms to provide a better fit to your data. Figure 8 shows the gradient descent algorithm including with the regularization term lambda.

### Application Architecture

The sample application consist of a C++ native DLL named LR_GPULib, for the machine learning implementation on the GPU and a C# Windows application named TestLRApp for the user interface. The DLL implements Data Normalization and Polynomial Regression using the high level parallel algorithm library Thrust on NVIDIA CUDA. I’ve mentioned on my previous blog posts about Thrust more in detail, therefore I’m not going into much detail on this post. Figure 9 shows the application architecture and also the program flow from loading the training data all the way down to making a prediction.

The application provides the UI shown below on Figure 10 to load the data, train and make a prediction with new data set. The UI also shows the hypothesis on the bottom of the dialog with all constants and features.

## Implementation

The LR_GPU_Functors.cu file in the DLL contains the functors used as kernels on Thrust methods. The LR_GPU.cu file in the DLL contains the normalization, learning and prediction methods. The `Learn`

method accepts the training data and the label data, which are all the features and all prices in two float arrays. The first thing the `Learn`

method does is to allocate memory, add bias term and normalize the features. The reason we added the bias term is to simplify the gradient loop and the reason we normalize the features is because the data ranges are too different. E.g. square feet is four digits and bedrooms is single digit. By normalizing the features we bring them into the same range, between zero and one. Normalization gets also executed on the GPU using the `NormalizeFeatures`

. But the normalization requires the mean and standard deviation (std), therefore mean and std are calculated first and provided to the `NormalizeFeaturesByMeanAndStd`

method to calculate the mean normalization.

void NormalizeFeaturesByMeanAndStd(unsigned int trainingDataCount, float * d_trainingData, thrust::device_vector<float> dv_mean, thrust::device_vector<float> dv_std) { //Calculate mean norm: (x - mean) / std unsigned int featureCount = dv_mean.size(); float * dvp_Mean = thrust::raw_pointer_cast( &dv_mean[0] ); float * dvp_Std = thrust::raw_pointer_cast( &dv_std[0] ); FeatureNormalizationgFunctor featureNormalizationgFunctor(dvp_Mean, dvp_Std, featureCount); thrust::device_ptr<float> dvp_trainingData(d_trainingData); thrust::transform(thrust::counting_iterator<int>(0), thrust::counting_iterator<int> (trainingDataCount * featureCount), dvp_trainingData, dvp_trainingData, featureNormalizationgFunctor); }

The Normalization code running on the GPU is implemented in the `FeatureNormalizationgFunctor`

functor, which is simply calculating `data - mean / std`

in parallel for every element of the data, as seen below:

... __host__ __device__ float operator()(int tid, float trainingData) { int columnIdx = tid % featureCount; float fnorm = trainingData - meanValue[columnIdx]; if (stdValue[columnIdx] > 0.0) fnorm /= stdValue[columnIdx]; return fnorm; } ...

On the next step in the `Learn`

method, the gradient descent is calculated with the `for(int i = 0; i < gdIterationCount; i++)`

loop. As I mentioned before, the error calculation part of the gradient descent is executed in parallel but the rest is calculated sequentialy. The `thrust::transform`

is used with the `TrainFunctor`

to calculate f(x)-y in parallel for every sample. f(x) is simply the A*x1 + Bx2 + Cx3 + Dx4 + Ex5 + Fx6 + Gx7 + H hypothesis where x1 through x7 are the features (x1=bedrooms, x2=square feet, x3=year built, x4=square root of bedrooms, x5=square root of square feet, x6=square root of year built and x7=the product of bedrooms and square feet) and A through H are the constants which gradient descent will find out. This is shown with the Green Square on Figure 11. The `TrainFunctor`

code snippet and the usage code snippet are shown below:

... __host__ __device__ float operator()(int tid, float labelData) { float h = 0; for (int f = 0; f < featureCount; f++) h += hypothesis[f] * trainingData[tid * featureCount + f]; return h - labelData; } ...

... thrust::transform(thrust::counting_iterator<int>(0), thrust::counting_iterator<int>(trainingDataCount), dv_labelData.begin(), dv_costData.begin(), tf); ...

The `thrust::transform_reduce`

is used with the `TrainFunctor2`

to apply the features to the error result and sum up all of them. This is shown with the code snippet below and the Red Square on Figure 11. Rest of the `Learn`

method calculates gradient descent part marked with Blue Square on Figure 11.

float totalCost = thrust::transform_reduce(thrust::counting_iterator<int>(0), thrust::counting_iterator<int>(trainingDataCount), tf2, 0.0f, thrust::plus<float>());

Once gradient descent converges, the constants A though H of the hypothesis is returned back to the TestLRApp with the `result`

array.

As you may guess the prediction works by using the constants with new sample data on the hypothesis. This is done using the `Predict`

method in the LR_GPULib library. As seen below the `Predict`

method normalizes the given features set and calculates the hypothesis using the constants and the normalized data with the help of the `PredictFunctor`

. The result is a the predicted house price for the given features.

... NormalizeFeaturesByMeanAndStd(testDataCount, pdv_testData, dv_mean, dv_std); //Predict PredictFunctor predictFunctor(pdv_testData, pdv_hypothesis, featureCount); thrust::transform(thrust::counting_iterator(0), thrust::counting_iterator(testDataCount), dv_result.begin(), predictFunctor); ...

struct PredictFunctor : public thrust::unary_function { float * testData; float * hypothesis; unsigned int featureCount; PredictFunctor(float * _testData, float * _hypothesis, unsigned int _featureCount) : testData(_testData), hypothesis(_hypothesis), featureCount(_featureCount) {} __host__ __device__ float operator()(int tid) { float sum = 0; for(int i = 0; i < featureCount; i++) sum += testData[tid * featureCount + i] * hypothesis[i]; return sum; } };

## Conclusion

GPGPU, Machine Learning and Big Data are three rising fields in the IT industry. There is so much more about these fields than what I’m providing in this post. As much as I get deeper into these fields I figure out how well they fit together. I hope this sample gave you some basic idea and maybe just one perspective how you can use NVIDIA CUDA easily on machine learning problems. As in any other software solution this example is not the only way to do polynomial regression on house price prediction with GPUs. In fact an enhancement would be supporting multiple GPUs and splitting down the data set into more parts.

## Massively Parallel Monte Carlo Simulation using GPU

## Introduction

In my previous blog posts I’ve explained how you can utilize the GPU on your computer to perform massively parallel computation with the help of NVIDIA CUDA and Thrust technologies. On this blog post I’m diving deeper into Thrust usage scenarios with a simple implementation of Monte Carlo simulation.

My influence was the PI prediction sample on Thrust web site. The sample is running Monte Carlo simulation with 10K samples on a unit circle to estimate the PI number. You can visit this Wikipedia page if you are interested into how Monte Carlo Simulation can be used to approximate PI. Actually it is a solution to the famous Buffon’s Needle problem.

I’m taking the original example one step further to show you how to send device variables to functors in Thrust methods, and also using a slightly different problem. Perhaps there are many other methods to do the same logic, but on this blog post I’m just concentrating on this specific implementation.

## Background

### About Monte Carlo Simulation

Monte Carlo simulation is an approach to solve deterministic problems with probabilistic analog. That is exactly what we are accomplishing in our example: estimating the area of intersecting disks. Monte Carlo methods are especially useful for simulating systems with many coupled degrees of freedom, such as fluids, disordered materials, strongly coupled solids, and cellular structures.

Our simulation is about predicting the intersect area of four overlapping unit disks as seen on the image below. (intersection of A,B,C and D disks) Actually, the problem can also be solved easily with the help of Geometry as explained here. I’ve calculated the area as 0.31515. On the other hand the simulation estimated 0.3149.

### About Thrust

Writing code using CUDA API is very powerful in terms of controlling the hardware, but there are high level libraries like Thrust C++ template library, which provides many fundamental programming logic like sorting, prefix-sums, reductions, transformations etc.. The best part is that Thrust consists only of header files and is distributed with CUDA 4.0 installation.

If you are not familiar with terms like GPGPU and Thrust, I’m suggesting you to check out the background information on my previous posts.

## Setup

The example is a console application written in C++. But you can easily transform it to a DLL to use it from your C# application. (previous posts)

I’ve used Visual Studio 2010 create the C++ console application. If you already did not, you need to install the NVIDIA CUDA Toolkit 4.0 and a supported graphics device driver from the same link. The new CUDA Toolkit 4.1 RC1 is also available at CUDA zone, but the project files are built on 4.0. Also do not forget to install the Build Customization BUG FIX Update from the same link for CUDA Toolkit 4.0.

Once the CUDA Toolkit is installed, creating CUDA enabled projects is really simple. For those who are not familiar using native C++ CUDA enabled projects, please follow the steps below to create one:

- Create a Visual C++ console project in Visual Studio 2010 by selecting Empty project on the wizard,
- Open Build Customization from the C++ project context menu, and check the CUDA 4.0(.targets, .props) checkbox,
- Open the project properties, expand the Configuration Properties, Linker and select Input. Edit the additional dependencies and add cudart.lib.
- Add a new empty source file ending with .cu.

You can also skip the steps above and download the example solution and project files directly from here.

## Implementation

The main application consists of calling `thrust::transform_reduce`

50 times to run the intersection estimation simulation. transform_reduce performs a reduction on the transformation of the sequence [first, last) according to unary_op. The unary_op is applied to each element of the sequence and then the result is reduced to a single value with binary_op.

The main code is as follows:

int main(void) { // use 50 independent seeds int M = 50; //Create some circles in the device thrust::host_vector dCircles; dCircles.push_back(CIRCLE(0.0f, 0.0f)); dCircles.push_back(CIRCLE(1.0f, 0.0f)); dCircles.push_back(CIRCLE(1.0f, 1.0f)); dCircles.push_back(CIRCLE(0.0f, 1.0f)); //The kernel can not access host or device vector directly, //therefore get the device pointer to the circles to pass to the kernel thrust::device_vector circles = dCircles; CIRCLE * circleArray = thrust::raw_pointer_cast( &circles[0] ); float estimate = thrust::transform_reduce(thrust::counting_iterator(0), thrust::counting_iterator(M), estimate_intersection(circleArray, circles.size()), 0.0f, thrust::plus()); estimate /= M; std::cout << std::setprecision(6); //calculate area with gemometry : (pi + 3 - 3*sqrt(3)) / 3 = 0.31515s std::cout << "the area is estimated as " << estimate << std::endl << ". It should be 0.31515." ; return 0; }

The unary_op has the Monte Carlo simulation logic implemented in the `estimate_intersection`

functor. The `estimate_intersection`

is a method derived from the thrust::unary_function class and returning the estimated intersect area as float. Using `estimate_intersection`

in `tranform_reduce`

means estimating the intersect area for every data element provided to `tranform_reduce`

. For the data elements we are using two thrust::counting_iterators. This creates a range filled with a sequence from 1 to 50, without explicitly storing anything in the memory. Using a sequence of numbers helps us to assign different thread id for every `estimate_intersection`

call. This is important to generate distinct seed for the random number generator of the simulation. (I’ve mentioned about random number generator seeds in my previous posts.)

For the reduction part of the `tranform_reduce`

we are using the thrust::plus() binary functor, which sums all results into one number. At last we divide the result into 50 to find the average intersect area value.

Our goal with this code is to run the simulation on the device (GPU) and retrieve the result back to the host. Therefore any data we are going to use on the simulation must be placed into the device memory. That is exactly what is happening before we call `thrust::transform_reduce`

. We are preparing properties of all circles we will try to intersect using the `CIRCLE`

object defined below.

struct CIRCLE{ float x,y; CIRCLE(float _x, float _y) : x(_x), y(_y){} } ;

With `thrust::host_vector dCircles;`

in the main code, we are defining a vector object in the host memory. Using a Thrust host vector object over a custom memory simplifies transferring data directly to device with the `thrust::device_vector circles = dCircles;`

call. As you may know, transferring data between device and host memory in CUDA C is handled with `cudaMemcpy`

. But Thrust has the equal operator overload, which allows you to copy memory easily.

On the next line we access the raw pointer of the `circles`

object with the help of the `thrust::raw_pointer_cast`

method. We do this because the `estimate_intersection`

method can only accept a device pointer to the CIRCLE object array.

### Simulation Method

The `estimate_intersection`

unary function implements the simulation logic. A unary function is a function which takes one argument, has a () operator overload and returns one value. In our case the function takes the `thrust::counting_iterator`

generated unique index number and returns the area of the intersection as float. Another important part of the method is the constructor (seen below) which takes in the device pointer to the CIRCLE array and the length of the allocated memory.

struct estimate_intersection : public thrust::unary_function { CIRCLE * Circles; int CircleCount; estimate_intersection(CIRCLE * circles, int circleCount) : Circles(circles), CircleCount(circleCount) { } __host__ __device__ float operator()(unsigned int thread_id) { float sum = 0; unsigned int N = 30000; // samples per thread unsigned int seed = hash(thread_id); // seed a random number generator thrust::default_random_engine rng(seed); // create a mapping from random numbers to [0,1) thrust::uniform_real_distribution u01(0,1); // take N samples for(unsigned int i = 0; i < N; ++i) { // draw a sample from the unit square double x = u01(rng); double y = u01(rng); bool inside = false; //check if the point is inside all circles for(unsigned int k = 0; k < CircleCount; ++k) { double dy,dx; //check if the point is further from //the center of the circle than the radius dx = Circles[k].x - x; dy = Circles[k].y - y; if ((dx*dx + dy*dy) <= 1) { inside = true; } else { inside = false; break; } } if (inside) sum += 1.0f; } // divide by N return sum / N; } };

In order to run the code on the device and call it from the host, the () operator overload has to be defined as `__host__ __device__`

. The rest of the code is the Monte Carlo simulation logic as follows:

1) Initiate the thrust default random number generator

2) Generate 30K random x and y values

3) Loop through all circles and check if the x and y value is inside the circle by calculating the hypotenuse

4) If all circles are inside the x and y coordinates then increase the found points count

5) return the average found points count

That’s it! I hope you enjoy it.

In addition to the code I included here, there are header includes and a hashing algorithm. You can download the code from here.

## About the Implementations

The Monte Carlo simulation I provided on this post is an example and therefore I’m not guaranteeing that it will perform good enough in your particular solution. Also, for clarity there is almost no exception handling and logging implemented. This is not an API; my goal is to give you a high level idea how you can use utilize the GPU for simulations. Therefore, it is important that you re-factor the code for your own use.

Some of the code is taken from original sample and is under Apache License V 2, the rest is my code which is free to use without any restriction or obligation.

## Conclusion

Thrust is a powerful library providing you with simple ways to accomplish complicated parallel computation tasks. There are many libraries like Thrust which are built on CUDA C. These libraries will save you many engineering hours on parallel algorithm implementation and allow you to concentrate on your real business problem. You can check out GPU Computing Webinars for presentations on this area.

## Massively Parallel RNG using CUDA C, Thrust and C#

## Introduction

On this post I’ll give you some simple examples how to use the massively parallel GPU on your computer to generate uniformly distributed psuedo-random numbers. Why GPU? Because it is orders of magnitude faster than CPU, does not occupy your CPU time, it is already on all computers and many other reasons I mentioned in my previous post. While there are maybe hundreds of ways to generate pseudorandom numbers I only covered four ways to do it on NVIDIA cards using CUDA related APIs:

1) A Basic Linear Congruential Generator (LCG) implementation using CUDA C

2) A Thrust C++ template library implementation

3) An NVIDIA CURAND implementation

4) A Mersenne Twister implementation using CUDA C

In order to demonstrate how to utilize the GPU, the implementations are provided as DLL’s and used within C# sample application. Perhaps there are many other APIs and ways worth to talk about to utilize your GPU within your C# application, but this post’s scope is limited only to the subject I mentioned above. I’m suggesting you to visit http://gpgpu.org/, if you already did not, to see the endless possibilities in this area.

While I was preparing these samples I saw that visualizing the data is very important to understand the algorithms. Therefore, I used Microsoft WPF on C# to visualize the generated random numbers. You can use your own application and copy the classes under the RNGLibs folder.

All code can be downloaded from this link: **RNGTests09182011.zip**.

## Background

### About Random Number Generators (RNG)

The generation of random numbers is important in many applications like simulations, cryptography, sampling and mostly in statistics. A sequence of numbers is random when it does not have a recognizable pattern in it or in other words if it is non-deterministic. Although non-deterministic random numbers are ideal, the computer generated, deterministic random numbers can be statistically “random enough”. These random numbers are named as pseudo-random numbers and can have easily identifiable patterns if the algorithm is not chosen wisely. ( http://en.wikipedia.org/wiki/Pseudorandom_number_generator )

There are many pseudo-random number generators and also many different implementations of them in sequential and parallel environments. ( http://en.wikipedia.org/wiki/List_of_pseudorandom_number_generators ) On this post I used only Linear Congruential Generators, Xorshift and Mersenne Twister. Therefore, I explained only these three algorithm, but you can use CUDA to develop also other RNGs.

### Thrust

As I mentioned in my previous post, writing code using CUDA API is very powerful in terms of controlling the hardware, but there are high level libraries like Thrust C++ template library, which provides many fundamental programming logic like sorting, prefix-sums, reductions, transformations etc.. The best part is that Thrust consists only of header files and is distributed with CUDA 4.0 installation.

## Projects

I’ve used Visual Studio 2010 to host one C# Windows Application and native C++ dlls for RNG implementations as seen in the solution structure below:

- RNGVisual (Main C# application)
- CURACRNGLib (CUDA C RNG implementation)
- CURANDRNGLib (CURAND RNG implementation)
- ThrustRNGLib (Thrust RNG Implementation)
- MersenneTwisterRNGLib (Mersenne Twister RNG implementation)

The only additional API is the NVIDIA CUDA Toolkit 4.0, which you will need to install along with a supported graphics device driver from the same link. Also do not forget to install the Build Customization BUG FIX Update from the same link or from here.

Once the CUDA Toolkit is installed creating CUDA enabled projects is really simple. For those who are not familiar using native C++ CUDA enabled projects, please follow the steps below to create one:

- Create a Visual C++ console project in Visual Studio 2010 by selecting DLL and Empty project on the wizard,
- Open Build Customization from the C++ project context menu, and check the CUDA 4.0(.targets, .props) checkbox,
- Open the project properties, expand the Configuration Properties, Linker and select Input. Edit the additional dependencies and add cudart.lib.
- Add a new empty source file ending with .cu.

## Implementation

### WPF Application

The RNGVisual C# WPF Application provides visualization of the random numbers in 2D and 3D. It allows you to select an RNG algorithm (.Net, CUDA C, CURAND, Thrust or Merseene Twister) and allows you to set some display parameters and processor parameters. With any number count below 10K, all RNGs calculate in about one millisecond and most of the time is spent drawing the squares to the screen. Therefore, the time should not confuse you in terms of performance comparison. You can run the algorithms with 100K numbers without the visualization and see the difference on your hardware. But please be aware that it is better to use CUDA events with cudaEventRecord to time GPU execution more precisely.

RNGVisual implements various proxy classes, which uses platform invoke to call RNG methods exported from the native C++ dlls. I used the same export and import technique in my previous post. The RNG libraries have the following two exports, one for CPU implementation and one for GPU implementation:

extern "C" __declspec(dllexport) void __cdecl GPU_RNG(float*, unsigned int, unsigned int); extern "C" __declspec(dllexport) void __cdecl CPU_RNG(float*, unsigned int, unsigned int);

The first argument is a pointer to the memory location to hold the random numbers. The second argument is the size of the array and the last argument is the initial seed.

An important point of random number generation is selecting the seed value, because the same seed will give the same result. While there are many different techniques studied, I used my own method of combining current time, CPU load and available physical memory with the help of the Windows Management Instrumentation (WMI); it still does not perform well in multi-threaded solutions, but it gives at least a better random start. The implementation is in the `CPUHelper`

class of the RNGVisual application.

### A Linear Congruential Generator (LCG) implementation using CUDA C

The first RNG project is using native CUDA Runtime API to implement the oldest and best-known pseudorandom number generator algorithms named LCG. LCG is fast and requires minimal memory to retain state. Therefore, LCG is very efficient for simulating multiple independent streams. But LCGs have some disadvantages and should not be used for applications where high-quality randomness is critical. In fact the simple example I implemented repeats numbers in a very short period and should be enhanced with methods like explained in GPUGems 3 (37-4).

LCG is as simple as seen in the formula below; starting with a seed (Xn), the next random number is determined with `(a * seed + c) mod m`

.

Xn+1 = (a Xn + c) (mod m)

where 0 < m, 0 < a < m, 0 <= x0 < m and 0 <= c < m

Below is a sequential implementation of the LCG algorithm, which generates 100 pseudo-random numbers:

random[0] = 123; //some initial seed for(int i = 1; i < 100; i++) random[i] = ( a * random[i-1] + c) % m;

The CUDACRNGLib project has a very basic implementation of LCG by distributing the work onto 256 threads. Because the same seed will result in the same random number, first we generate different random seeds for every thread. When the kernel below is executed, every thread generates one section of the random number sequence:

__global__ void RNGKernel(float * randomNumbers, unsigned int numberCount, unsigned int * seeds, unsigned int c, unsigned int a, unsigned int M) { int startIdx = threadIdx.x * numberCount; unsigned int x = seeds[threadIdx.x]; for(int i=0; i < numberCount; i++) { x = (a * x + c) % M; //M is shown for purpose of example randomNumbers[startIdx + i]= (float)x / (float)M; //normalize } }

As I mentioned before, this implementation is simplified to give you an idea how you can start using CUDA C. It even has static block count of one and thread count of 256. If you plan to go for production code, it is good to start many blocks of threads. You may want to check out a better implementation on GPU Gems 3 (37-7) or check out Arnold and Meel’s implementation, which provides also better randomness.

### A Thrust C++ template library implementation

The Thrust library default random engine ( default_random_engine ) is a Linear Congruential Generator ( may change in the future ) with a = 48271, c = 0 and m = 2^31. Because c equals to zero, the algorithm is also named multiplicative congruential method or Lehmer RNG.

The ThrustRNGLib has a very basic implementation of Thrust default random engine by running the following functor to generate one random number. A functor is a type of class in C++ that overloads the *operator()* and therefore allows to be called like an ordinary function. Thrust provides `unary_function`

and `binary_function`

functors. Below I used the unary_function because my functor requires on argument to passed into the function:

struct RandomNumberFunctor : public thrust::unary_function<unsigned int, float> { unsigned int mainSeed; RandomNumberFunctor(unsigned int _mainSeed) : mainSeed(_mainSeed) {} __host__ __device__ float operator()(unsigned int threadIdx) { unsigned int seed = hash(threadIdx) * mainSeed; // seed a random number generator thrust::default_random_engine rng(seed); // create a mapping from random numbers to [0,1) thrust::uniform_real_distribution<float> u01(0,1); return u01(rng); } };

Using thrust to utilize the GPU is the simplest way to go. You can see the difference by comparing the GPU_RNG from below to the CUDACRNGLib GPU_RNG implementation. Using CUDA C gives you full control of the toolkit but it comes with the price of writing more code.

extern void GPU_RNG(float * h_randomData, unsigned int dataCount, unsigned int mainSeed) { //Allocate device vector thrust::device_vector<float> d_rngBuffer(dataCount); //generate random numbers thrust::transform(thrust::counting_iterator<int>(0), thrust::counting_iterator<int>(dataCount), d_rngBuffer.begin(), RandomNumberFunctor(mainSeed)); //copy the random mask back to host thrust::copy(d_rngBuffer.begin(), d_rngBuffer.end(), h_randomData); }

Another good part of Thrust is that every implementation (except copy) exist for GPU as well as for CPU. The CPU implementation is also another three lines of code, this time by using the `thrust::generate`

to generate the random numbers by using the C++ standard template library `rand`

method and then after using `thrust::transform`

to normalize the integer result into float with the help of the `[](float n) {return n / (RAND_MAX + 1);}`

lambda expression. I used the lambda expression instead of a functor to show you also this possibility. Especially on the upcoming

Microsoft C++ AMP, lambda expressions play a big role. Lambda expression are handy in C++ as well as in C#, but it comes with a price of giving up unit testing of the inline expression.

### An NVIDIA CURAND implementation

The NVIDIA CURAND library provides an API for simple and efficient generation of high-quality pseudorandom and quasirandom numbers. The CURAND library default pseudorandom engine is a XORWOW implementation of the Xorshift RNG (page 5) and it produces higher quality random numbers than LCG.

In order to start using CURAND, you only need to include the `curand.h`

header and add the `curand.lib`

to the Linker additional dependencies on the Linker settings.

Like ThrustRNGLib Thrust implementation, the CURANDRNGLib has a very basic implementation by running the following main code to generate a series of random numbers:

.... //Create a new generator curandCreateGenerator(&m_prng, CURAND_RNG_PSEUDO_DEFAULT); //Set the generator options curandSetPseudoRandomGeneratorSeed(m_prng, (unsigned long) mainSeed); //Generate random numbers curandGenerateUniform(m_prng, d_randomData, dataCount); ....

CURAND provides the `curandCreateGeneratorHost`

method besides the `curandCreateGenerator`

method, to generate random numbers on the CPU instead of the GPU. Therefore the CPU part is as simple as the GPU part.

### A Mersenne Twister implementation using CUDA C

Mersenne Twister ( MT ) is an algorithm developed by Makoto Matsumoto and provides very fast generation of high-quality random numbers. ( MT Home Page ) A common Mersenne twister implementation, uses an LCG to generate seed data.

Originally there are two MT algorithms suitable to use with CUDA: TinyMT and Mersenne Twister for Graphics Processors (MTGP). But I implemented part of the code from the NVIDIA CUDA Toolkit 4.0 MersenneTwister sample, which uses the original code from Makoto Matsumoto anyways.

The Mersenne Twister RNG is maybe the most complicated implementation out of the other three RNGs I provided, but with that you can look into the algorithm, unlike CURAND. The `MersenneTwisterRNG.cpp`

file from the `MersenneTwisterRNGLib`

project is the entry point to the library and exports the same `GPU_RNG`

and `CPU_RNG`

methods as the other libraries. I simplified the host code as much as possible and placed all GPU logic into the `GPURNG.cu`

file. The remaining simple host code can be seen below:

extern void GPU_RNG(float * h_randomData, unsigned int dataCount, unsigned int mainSeed) { float * d_randomData = 0; //load GPU twisters configuration if(!loadMTGPU("MersenneTwister.dat")) return; seedMTGPU(mainSeed); //find the rounded up data count //because the generator generates in multiples of 4096 int numbersPerRNG = iAlignUp(iDivUp(dataCount, MT_RNG_COUNT), 2); int randomDataCount = MT_RNG_COUNT * numbersPerRNG; //allocate device memory size_t randomDataSize = randomDataCount * sizeof(float); cudaMalloc((void**)&d_randomData, randomDataSize); //Call the generator RNGOnGPU(32, 128, d_randomData, numbersPerRNG); //Make sure all GPU work is done cudaDeviceSynchronize(); //Copy memory back to the device cudaMemcpy(h_randomData, d_randomData, dataCount * sizeof(float), cudaMemcpyDeviceToHost); //free device memory cudaFree(d_randomData); }

## About the Implementations

The Pseudo-random number generators I provided on this post are widely used algorithms, but still I’m not guaranteeing that any of them will perform good enough in your particular solution. In fact, I left some generators poor by purpose to point out the core algorithm and provide variations in randomness. Also, for sake of clarity there is almost no exception handling and logging implemented. This is not an API; my goal is to give you a high level idea how you can use Thrust, CUDA C, CURAND to generate pseudo-random number. Therefore, it is important that you research the algorithms on-line and re-factor the code for your own use.

Some of the code is taken from original NVIDIA samples and have copyright notice on them, the rest is my code which is free to use without any restriction or obligation.

## Conclusion

As in every field of computer science, there are many ways to solve a problem and the possibilities expand exponentially. I just scratched the surface of the possibility of using CUDA to add pseudorandom number generation to your C# application. I hope this post will help you in your project.