Massively Parallel Monte Carlo Simulation using GPU
Introduction
In my previous blog posts I’ve explained how you can utilize the GPU on your computer to perform massively parallel computation with the help of NVIDIA CUDA and Thrust technologies. On this blog post I’m diving deeper into Thrust usage scenarios with a simple implementation of Monte Carlo simulation.
My influence was the PI prediction sample on Thrust web site. The sample is running Monte Carlo simulation with 10K samples on a unit circle to estimate the PI number. You can visit this Wikipedia page if you are interested into how Monte Carlo Simulation can be used to approximate PI. Actually it is a solution to the famous Buffon’s Needle problem.
I’m taking the original example one step further to show you how to send device variables to functors in Thrust methods, and also using a slightly different problem. Perhaps there are many other methods to do the same logic, but on this blog post I’m just concentrating on this specific implementation.
Background
About Monte Carlo Simulation
Monte Carlo simulation is an approach to solve deterministic problems with probabilistic analog. That is exactly what we are accomplishing in our example: estimating the area of intersecting disks. Monte Carlo methods are especially useful for simulating systems with many coupled degrees of freedom, such as fluids, disordered materials, strongly coupled solids, and cellular structures.
Our simulation is about predicting the intersect area of four overlapping unit disks as seen on the image below. (intersection of A,B,C and D disks) Actually, the problem can also be solved easily with the help of Geometry as explained here. I’ve calculated the area as 0.31515. On the other hand the simulation estimated 0.3149.
About Thrust
Writing code using CUDA API is very powerful in terms of controlling the hardware, but there are high level libraries like Thrust C++ template library, which provides many fundamental programming logic like sorting, prefixsums, reductions, transformations etc.. The best part is that Thrust consists only of header files and is distributed with CUDA 4.0 installation.
If you are not familiar with terms like GPGPU and Thrust, I’m suggesting you to check out the background information on my previous posts.
Setup
The example is a console application written in C++. But you can easily transform it to a DLL to use it from your C# application. (previous posts)
I’ve used Visual Studio 2010 create the C++ console application. If you already did not, you need to install the NVIDIA CUDA Toolkit 4.0 and a supported graphics device driver from the same link. The new CUDA Toolkit 4.1 RC1 is also available at CUDA zone, but the project files are built on 4.0. Also do not forget to install the Build Customization BUG FIX Update from the same link for CUDA Toolkit 4.0.
Once the CUDA Toolkit is installed, creating CUDA enabled projects is really simple. For those who are not familiar using native C++ CUDA enabled projects, please follow the steps below to create one:
 Create a Visual C++ console project in Visual Studio 2010 by selecting Empty project on the wizard,
 Open Build Customization from the C++ project context menu, and check the CUDA 4.0(.targets, .props) checkbox,
 Open the project properties, expand the Configuration Properties, Linker and select Input. Edit the additional dependencies and add cudart.lib.
 Add a new empty source file ending with .cu.
You can also skip the steps above and download the example solution and project files directly from here.
Implementation
The main application consists of calling thrust::transform_reduce
50 times to run the intersection estimation simulation. transform_reduce performs a reduction on the transformation of the sequence [first, last) according to unary_op. The unary_op is applied to each element of the sequence and then the result is reduced to a single value with binary_op.
The main code is as follows:
int main(void) { // use 50 independent seeds int M = 50; //Create some circles in the device thrust::host_vector dCircles; dCircles.push_back(CIRCLE(0.0f, 0.0f)); dCircles.push_back(CIRCLE(1.0f, 0.0f)); dCircles.push_back(CIRCLE(1.0f, 1.0f)); dCircles.push_back(CIRCLE(0.0f, 1.0f)); //The kernel can not access host or device vector directly, //therefore get the device pointer to the circles to pass to the kernel thrust::device_vector circles = dCircles; CIRCLE * circleArray = thrust::raw_pointer_cast( &circles[0] ); float estimate = thrust::transform_reduce(thrust::counting_iterator(0), thrust::counting_iterator(M), estimate_intersection(circleArray, circles.size()), 0.0f, thrust::plus()); estimate /= M; std::cout << std::setprecision(6); //calculate area with gemometry : (pi + 3  3*sqrt(3)) / 3 = 0.31515s std::cout << "the area is estimated as " << estimate << std::endl << ". It should be 0.31515." ; return 0; }
The unary_op has the Monte Carlo simulation logic implemented in the estimate_intersection
functor. The estimate_intersection
is a method derived from the thrust::unary_function class and returning the estimated intersect area as float. Using estimate_intersection
in tranform_reduce
means estimating the intersect area for every data element provided to tranform_reduce
. For the data elements we are using two thrust::counting_iterators. This creates a range filled with a sequence from 1 to 50, without explicitly storing anything in the memory. Using a sequence of numbers helps us to assign different thread id for every estimate_intersection
call. This is important to generate distinct seed for the random number generator of the simulation. (I’ve mentioned about random number generator seeds in my previous posts.)
For the reduction part of the tranform_reduce
we are using the thrust::plus() binary functor, which sums all results into one number. At last we divide the result into 50 to find the average intersect area value.
Our goal with this code is to run the simulation on the device (GPU) and retrieve the result back to the host. Therefore any data we are going to use on the simulation must be placed into the device memory. That is exactly what is happening before we call thrust::transform_reduce
. We are preparing properties of all circles we will try to intersect using the CIRCLE
object defined below.
struct CIRCLE{ float x,y; CIRCLE(float _x, float _y) : x(_x), y(_y){} } ;
With thrust::host_vector dCircles;
in the main code, we are defining a vector object in the host memory. Using a Thrust host vector object over a custom memory simplifies transferring data directly to device with the thrust::device_vector circles = dCircles;
call. As you may know, transferring data between device and host memory in CUDA C is handled with cudaMemcpy
. But Thrust has the equal operator overload, which allows you to copy memory easily.
On the next line we access the raw pointer of the circles
object with the help of the thrust::raw_pointer_cast
method. We do this because the estimate_intersection
method can only accept a device pointer to the CIRCLE object array.
Simulation Method
The estimate_intersection
unary function implements the simulation logic. A unary function is a function which takes one argument, has a () operator overload and returns one value. In our case the function takes the thrust::counting_iterator
generated unique index number and returns the area of the intersection as float. Another important part of the method is the constructor (seen below) which takes in the device pointer to the CIRCLE array and the length of the allocated memory.
struct estimate_intersection : public thrust::unary_function { CIRCLE * Circles; int CircleCount; estimate_intersection(CIRCLE * circles, int circleCount) : Circles(circles), CircleCount(circleCount) { } __host__ __device__ float operator()(unsigned int thread_id) { float sum = 0; unsigned int N = 30000; // samples per thread unsigned int seed = hash(thread_id); // seed a random number generator thrust::default_random_engine rng(seed); // create a mapping from random numbers to [0,1) thrust::uniform_real_distribution u01(0,1); // take N samples for(unsigned int i = 0; i < N; ++i) { // draw a sample from the unit square double x = u01(rng); double y = u01(rng); bool inside = false; //check if the point is inside all circles for(unsigned int k = 0; k < CircleCount; ++k) { double dy,dx; //check if the point is further from //the center of the circle than the radius dx = Circles[k].x  x; dy = Circles[k].y  y; if ((dx*dx + dy*dy) <= 1) { inside = true; } else { inside = false; break; } } if (inside) sum += 1.0f; } // divide by N return sum / N; } };
In order to run the code on the device and call it from the host, the () operator overload has to be defined as __host__ __device__
. The rest of the code is the Monte Carlo simulation logic as follows:
1) Initiate the thrust default random number generator
2) Generate 30K random x and y values
3) Loop through all circles and check if the x and y value is inside the circle by calculating the hypotenuse
4) If all circles are inside the x and y coordinates then increase the found points count
5) return the average found points count
That’s it! I hope you enjoy it.
In addition to the code I included here, there are header includes and a hashing algorithm. You can download the code from here.
About the Implementations
The Monte Carlo simulation I provided on this post is an example and therefore I’m not guaranteeing that it will perform good enough in your particular solution. Also, for clarity there is almost no exception handling and logging implemented. This is not an API; my goal is to give you a high level idea how you can use utilize the GPU for simulations. Therefore, it is important that you refactor the code for your own use.
Some of the code is taken from original sample and is under Apache License V 2, the rest is my code which is free to use without any restriction or obligation.
Conclusion
Thrust is a powerful library providing you with simple ways to accomplish complicated parallel computation tasks. There are many libraries like Thrust which are built on CUDA C. These libraries will save you many engineering hours on parallel algorithm implementation and allow you to concentrate on your real business problem. You can check out GPU Computing Webinars for presentations on this area.
Massively Parallel RNG using CUDA C, Thrust and C#
Introduction
On this post I’ll give you some simple examples how to use the massively parallel GPU on your computer to generate uniformly distributed psuedorandom numbers. Why GPU? Because it is orders of magnitude faster than CPU, does not occupy your CPU time, it is already on all computers and many other reasons I mentioned in my previous post. While there are maybe hundreds of ways to generate pseudorandom numbers I only covered four ways to do it on NVIDIA cards using CUDA related APIs:
1) A Basic Linear Congruential Generator (LCG) implementation using CUDA C
2) A Thrust C++ template library implementation
3) An NVIDIA CURAND implementation
4) A Mersenne Twister implementation using CUDA C
In order to demonstrate how to utilize the GPU, the implementations are provided as DLL’s and used within C# sample application. Perhaps there are many other APIs and ways worth to talk about to utilize your GPU within your C# application, but this post’s scope is limited only to the subject I mentioned above. I’m suggesting you to visit http://gpgpu.org/, if you already did not, to see the endless possibilities in this area.
While I was preparing these samples I saw that visualizing the data is very important to understand the algorithms. Therefore, I used Microsoft WPF on C# to visualize the generated random numbers. You can use your own application and copy the classes under the RNGLibs folder.
All code can be downloaded from this link: RNGTests09182011.zip.
Background
About Random Number Generators (RNG)
The generation of random numbers is important in many applications like simulations, cryptography, sampling and mostly in statistics. A sequence of numbers is random when it does not have a recognizable pattern in it or in other words if it is nondeterministic. Although nondeterministic random numbers are ideal, the computer generated, deterministic random numbers can be statistically “random enough”. These random numbers are named as pseudorandom numbers and can have easily identifiable patterns if the algorithm is not chosen wisely. ( http://en.wikipedia.org/wiki/Pseudorandom_number_generator )
There are many pseudorandom number generators and also many different implementations of them in sequential and parallel environments. ( http://en.wikipedia.org/wiki/List_of_pseudorandom_number_generators ) On this post I used only Linear Congruential Generators, Xorshift and Mersenne Twister. Therefore, I explained only these three algorithm, but you can use CUDA to develop also other RNGs.
Thrust
As I mentioned in my previous post, writing code using CUDA API is very powerful in terms of controlling the hardware, but there are high level libraries like Thrust C++ template library, which provides many fundamental programming logic like sorting, prefixsums, reductions, transformations etc.. The best part is that Thrust consists only of header files and is distributed with CUDA 4.0 installation.
Projects
I’ve used Visual Studio 2010 to host one C# Windows Application and native C++ dlls for RNG implementations as seen in the solution structure below:
 RNGVisual (Main C# application)
 CURACRNGLib (CUDA C RNG implementation)
 CURANDRNGLib (CURAND RNG implementation)
 ThrustRNGLib (Thrust RNG Implementation)
 MersenneTwisterRNGLib (Mersenne Twister RNG implementation)
The only additional API is the NVIDIA CUDA Toolkit 4.0, which you will need to install along with a supported graphics device driver from the same link. Also do not forget to install the Build Customization BUG FIX Update from the same link or from here.
Once the CUDA Toolkit is installed creating CUDA enabled projects is really simple. For those who are not familiar using native C++ CUDA enabled projects, please follow the steps below to create one:
 Create a Visual C++ console project in Visual Studio 2010 by selecting DLL and Empty project on the wizard,
 Open Build Customization from the C++ project context menu, and check the CUDA 4.0(.targets, .props) checkbox,
 Open the project properties, expand the Configuration Properties, Linker and select Input. Edit the additional dependencies and add cudart.lib.
 Add a new empty source file ending with .cu.
Implementation
WPF Application
The RNGVisual C# WPF Application provides visualization of the random numbers in 2D and 3D. It allows you to select an RNG algorithm (.Net, CUDA C, CURAND, Thrust or Merseene Twister) and allows you to set some display parameters and processor parameters. With any number count below 10K, all RNGs calculate in about one millisecond and most of the time is spent drawing the squares to the screen. Therefore, the time should not confuse you in terms of performance comparison. You can run the algorithms with 100K numbers without the visualization and see the difference on your hardware. But please be aware that it is better to use CUDA events with cudaEventRecord to time GPU execution more precisely.
RNGVisual implements various proxy classes, which uses platform invoke to call RNG methods exported from the native C++ dlls. I used the same export and import technique in my previous post. The RNG libraries have the following two exports, one for CPU implementation and one for GPU implementation:
extern "C" __declspec(dllexport) void __cdecl GPU_RNG(float*, unsigned int, unsigned int); extern "C" __declspec(dllexport) void __cdecl CPU_RNG(float*, unsigned int, unsigned int);
The first argument is a pointer to the memory location to hold the random numbers. The second argument is the size of the array and the last argument is the initial seed.
An important point of random number generation is selecting the seed value, because the same seed will give the same result. While there are many different techniques studied, I used my own method of combining current time, CPU load and available physical memory with the help of the Windows Management Instrumentation (WMI); it still does not perform well in multithreaded solutions, but it gives at least a better random start. The implementation is in the CPUHelper
class of the RNGVisual application.
A Linear Congruential Generator (LCG) implementation using CUDA C
The first RNG project is using native CUDA Runtime API to implement the oldest and bestknown pseudorandom number generator algorithms named LCG. LCG is fast and requires minimal memory to retain state. Therefore, LCG is very efficient for simulating multiple independent streams. But LCGs have some disadvantages and should not be used for applications where highquality randomness is critical. In fact the simple example I implemented repeats numbers in a very short period and should be enhanced with methods like explained in GPUGems 3 (374).
LCG is as simple as seen in the formula below; starting with a seed (Xn), the next random number is determined with (a * seed + c) mod m
.
Xn+1 = (a Xn + c) (mod m)
where 0 < m, 0 < a < m, 0 <= x0 < m and 0 <= c < m
Below is a sequential implementation of the LCG algorithm, which generates 100 pseudorandom numbers:
random[0] = 123; //some initial seed for(int i = 1; i < 100; i++) random[i] = ( a * random[i1] + c) % m;
The CUDACRNGLib project has a very basic implementation of LCG by distributing the work onto 256 threads. Because the same seed will result in the same random number, first we generate different random seeds for every thread. When the kernel below is executed, every thread generates one section of the random number sequence:
__global__ void RNGKernel(float * randomNumbers, unsigned int numberCount, unsigned int * seeds, unsigned int c, unsigned int a, unsigned int M) { int startIdx = threadIdx.x * numberCount; unsigned int x = seeds[threadIdx.x]; for(int i=0; i < numberCount; i++) { x = (a * x + c) % M; //M is shown for purpose of example randomNumbers[startIdx + i]= (float)x / (float)M; //normalize } }
As I mentioned before, this implementation is simplified to give you an idea how you can start using CUDA C. It even has static block count of one and thread count of 256. If you plan to go for production code, it is good to start many blocks of threads. You may want to check out a better implementation on GPU Gems 3 (377) or check out Arnold and Meel’s implementation, which provides also better randomness.
A Thrust C++ template library implementation
The Thrust library default random engine ( default_random_engine ) is a Linear Congruential Generator ( may change in the future ) with a = 48271, c = 0 and m = 2^31. Because c equals to zero, the algorithm is also named multiplicative congruential method or Lehmer RNG.
The ThrustRNGLib has a very basic implementation of Thrust default random engine by running the following functor to generate one random number. A functor is a type of class in C++ that overloads the operator() and therefore allows to be called like an ordinary function. Thrust provides unary_function
and binary_function
functors. Below I used the unary_function because my functor requires on argument to passed into the function:
struct RandomNumberFunctor : public thrust::unary_function<unsigned int, float> { unsigned int mainSeed; RandomNumberFunctor(unsigned int _mainSeed) : mainSeed(_mainSeed) {} __host__ __device__ float operator()(unsigned int threadIdx) { unsigned int seed = hash(threadIdx) * mainSeed; // seed a random number generator thrust::default_random_engine rng(seed); // create a mapping from random numbers to [0,1) thrust::uniform_real_distribution<float> u01(0,1); return u01(rng); } };
Using thrust to utilize the GPU is the simplest way to go. You can see the difference by comparing the GPU_RNG from below to the CUDACRNGLib GPU_RNG implementation. Using CUDA C gives you full control of the toolkit but it comes with the price of writing more code.
extern void GPU_RNG(float * h_randomData, unsigned int dataCount, unsigned int mainSeed) { //Allocate device vector thrust::device_vector<float> d_rngBuffer(dataCount); //generate random numbers thrust::transform(thrust::counting_iterator<int>(0), thrust::counting_iterator<int>(dataCount), d_rngBuffer.begin(), RandomNumberFunctor(mainSeed)); //copy the random mask back to host thrust::copy(d_rngBuffer.begin(), d_rngBuffer.end(), h_randomData); }
Another good part of Thrust is that every implementation (except copy) exist for GPU as well as for CPU. The CPU implementation is also another three lines of code, this time by using the thrust::generate
to generate the random numbers by using the C++ standard template library rand
method and then after using thrust::transform
to normalize the integer result into float with the help of the [](float n) {return n / (RAND_MAX + 1);}
lambda expression. I used the lambda expression instead of a functor to show you also this possibility. Especially on the upcoming
Microsoft C++ AMP, lambda expressions play a big role. Lambda expression are handy in C++ as well as in C#, but it comes with a price of giving up unit testing of the inline expression.
An NVIDIA CURAND implementation
The NVIDIA CURAND library provides an API for simple and efficient generation of highquality pseudorandom and quasirandom numbers. The CURAND library default pseudorandom engine is a XORWOW implementation of the Xorshift RNG (page 5) and it produces higher quality random numbers than LCG.
In order to start using CURAND, you only need to include the curand.h
header and add the curand.lib
to the Linker additional dependencies on the Linker settings.
Like ThrustRNGLib Thrust implementation, the CURANDRNGLib has a very basic implementation by running the following main code to generate a series of random numbers:
.... //Create a new generator curandCreateGenerator(&m_prng, CURAND_RNG_PSEUDO_DEFAULT); //Set the generator options curandSetPseudoRandomGeneratorSeed(m_prng, (unsigned long) mainSeed); //Generate random numbers curandGenerateUniform(m_prng, d_randomData, dataCount); ....
CURAND provides the curandCreateGeneratorHost
method besides the curandCreateGenerator
method, to generate random numbers on the CPU instead of the GPU. Therefore the CPU part is as simple as the GPU part.
A Mersenne Twister implementation using CUDA C
Mersenne Twister ( MT ) is an algorithm developed by Makoto Matsumoto and provides very fast generation of highquality random numbers. ( MT Home Page ) A common Mersenne twister implementation, uses an LCG to generate seed data.
Originally there are two MT algorithms suitable to use with CUDA: TinyMT and Mersenne Twister for Graphics Processors (MTGP). But I implemented part of the code from the NVIDIA CUDA Toolkit 4.0 MersenneTwister sample, which uses the original code from Makoto Matsumoto anyways.
The Mersenne Twister RNG is maybe the most complicated implementation out of the other three RNGs I provided, but with that you can look into the algorithm, unlike CURAND. The MersenneTwisterRNG.cpp
file from the MersenneTwisterRNGLib
project is the entry point to the library and exports the same GPU_RNG
and CPU_RNG
methods as the other libraries. I simplified the host code as much as possible and placed all GPU logic into the GPURNG.cu
file. The remaining simple host code can be seen below:
extern void GPU_RNG(float * h_randomData, unsigned int dataCount, unsigned int mainSeed) { float * d_randomData = 0; //load GPU twisters configuration if(!loadMTGPU("MersenneTwister.dat")) return; seedMTGPU(mainSeed); //find the rounded up data count //because the generator generates in multiples of 4096 int numbersPerRNG = iAlignUp(iDivUp(dataCount, MT_RNG_COUNT), 2); int randomDataCount = MT_RNG_COUNT * numbersPerRNG; //allocate device memory size_t randomDataSize = randomDataCount * sizeof(float); cudaMalloc((void**)&d_randomData, randomDataSize); //Call the generator RNGOnGPU(32, 128, d_randomData, numbersPerRNG); //Make sure all GPU work is done cudaDeviceSynchronize(); //Copy memory back to the device cudaMemcpy(h_randomData, d_randomData, dataCount * sizeof(float), cudaMemcpyDeviceToHost); //free device memory cudaFree(d_randomData); }
About the Implementations
The Pseudorandom number generators I provided on this post are widely used algorithms, but still I’m not guaranteeing that any of them will perform good enough in your particular solution. In fact, I left some generators poor by purpose to point out the core algorithm and provide variations in randomness. Also, for sake of clarity there is almost no exception handling and logging implemented. This is not an API; my goal is to give you a high level idea how you can use Thrust, CUDA C, CURAND to generate pseudorandom number. Therefore, it is important that you research the algorithms online and refactor the code for your own use.
Some of the code is taken from original NVIDIA samples and have copyright notice on them, the rest is my code which is free to use without any restriction or obligation.
Conclusion
As in every field of computer science, there are many ways to solve a problem and the possibilities expand exponentially. I just scratched the surface of the possibility of using CUDA to add pseudorandom number generation to your C# application. I hope this post will help you in your project.
Faster Sorting in C# by Utilizing GPU with NVIDIA CUDA
Introduction
On this post I would like to give an entry level example how you can use NVIDIA CUDA technology to achieve better performance within C# with minimum possible amount of code. This is a very specific scenario where the main application is in C#, and the sorting algorithm is provided in C++ as an API and you have an NVIDIA CUDA capable graphics card. I hope this basic sample will give you some insight and encourage you to invest into parallel technologies.
Background
Parallel Computing
For more than two decades CPUs drove rapid performance increase and price reductions. With that, we may get used to having speed boost of our code running on a new piece of hardware. As you may recall, this relentless performance improvement is described with two words: Moore’s law. While the continuation of Moore’s law is debated on many platforms, because of the slow down starting at 2003, many of us already jumped to the parallel bandwagon using systems based on SIMD or MIMD models. With parallel computing, application logic suitable for parallelization will most likely run faster without waiting for tomorrow’s hardware. (It is “most likely” because there are theoretical limits explained by Amdahl’s law.)
GPGPU
A GPU is a massively multithreaded processor which supports thousands of concurrent threads. Using GPU to do general purpose scientific and engineering computing is named GPGPU. NVIDIA is one of the leading GPU manufacturers and they are providing fully programmable GPU technology for highlevel languages like C, C++ and Fortran with the architecture named CUDA. The NVIDIA hardware uses the SIMT (singleinstruction multiplethread) execution model rather than the popular SIMD model. I gave a SIMD example in one my previous posts, how to utilize an SSE4 CPU instruction within C#. And in this post we will be using SIMT execution model through NVIDIA CUDA. (Under the cover, on SIMD multiple data elements for a single instruction are collected and packed into a single register. On the other hand on SIMT, all threads process data in their own registers. )
Thrust
Writing code using CUDA API is very powerful in terms of controlling the hardware, but it requires to handle memory and execution which is outside the scope of this post. Therefore I’ll use a highlevel programming interface named Thrust. As describe at http://code.google.com/p/thrust/, Thrust is a CUDA library of parallel algorithms with an interface resembling the C++ Standard Template Library (STL). Thrust library provides many fundamental programming logic like sorting, prefixsums, reductions, transformations etc..
Thrust consists of header files and is distributed with CUDA 4.0 installation. Therefore, you don’t have to install it additionally.
Radix Sort
The reason I choose sorting is that it is a widely used fundamental computational building block to many problems. Radix sort is one of the oldest algorithms and is very efficient for sorting small keys sequentially with the algorithmic complexity of O(n). One of the two sorting algorithms Thrust provides is Radix Sort, therefore thrust fits our example very well. There is a paper about Designing Efficient Sorting Algorithms for Manycore GPUs, which explains how the algorithm works on the GPU.
Setup
We will be using Visual Studio 2010 to host one C# application and one C++ library. The only additional API is the NVIDIA CUDA Toolkit 4.0. You will need to install the CUDA Toolkit and a supported graphics device driver from the same link.
Also do not forget to install the Build Customization BUG FIX Update from the same link or from here.
Implementation
The solution consist of one C# console application and a C++ native dll. The C++ native dll exports a method which has the code to run the sort algorithm on the GPU. The C# code uses platform invoke to call the exported method of the C++ dll.
For comparison purpose I’m also providing a C# implementation of Radix Sort. The C# implementation takes an array of 32 bit integers and process it in three steps for every 8 bit part:
1) Creating histogram of data,
2) Running prefix sum(scan),
3) Adding the data to bin.
On some implementations the histogram is named as count or difference, but here I used the name histogram, because there is very good CUDA implementation I may write about in a future post. It is basically finding out how many of every byte is in the array. Because the max byte value is 0xFF, therefore the size of the histogram is 256, reflecting the maximum number of different values.
The second step is to run prefix sum on the histogram, which is simply adding one value to the right neighbor in the array. In that way every element in the array will contain it’s left neighbor. Actually, there is a very efficient way to to prefix sum on parallel environment explained by the paper Efficient Parallel Scan Algorithms for GPUs. Which you may find usefull for understanding the logic behind the GPU implementation.
The last step is to place the value into the correct bin by using the prefix sum result at the value index to determine the temporary array index.
The code is as follows:
public static void CPURadixSort(int[] sourceData) { int size = sourceData.Length; //temporary array for every byte iteration int[] tempData = new int[size]; //histogram of the last byte int[] histogram = new int[256]; //The prefix sum of the histogram int[] prefixSum = new int[256]; unsafe { fixed (int* pTempData = tempData) { fixed (int* pSourceData = sourceData) { int* pTemp = pTempData; int* pBck; int* pSource = pSourceData; //Loop through every byte of 4 byte integer for (int byteIdx = 0; byteIdx < 4; byteIdx++) { int shift = byteIdx * 8; //total bits to shift the numbers //Calculate histogram of the last byte of the data for (int i = 0; i < size; i++) histogram[(pSource[i] >> shift) & 0xFF]++; //Calculate prefixsum of the histogram prefixSum[0] = 0; for (int i = 1; i < 256; i++) prefixSum[i] = prefixSum[i  1] + histogram[i  1]; //Get the prefixsum array index of the last byte, increase //it by one. That gives us the the index we want to place //the data for (int i = 0; i < size; i++) pTemp[prefixSum[(pSource[i] >> shift) & 0xFF]++] = pSource[i]; //Swap the pointers pBck = pSource; pSource = pTemp; pTemp = pBck; //reset the histogram for (int i = 0; i < 256; i++) histogram[i] = 0; } } } } }
One more detail about the code is that I used unsafe code in order to swap the pointers for the temporary buffer and source data to gain another half second. But, it is possible to remove that part and use the tempData.CopyTo(sourceData, 0)
instead.
When I run the code above on an Intel Xeon 2.40 Ghz CPU for 2^25 (33,554,432) random integers, it executed in 3.038 seconds.
Next step is to create the sorting code on the GPU and call it within our C# code. For this we need to create a DLL in C++ by using the default .Net project wizard. Please follow the steps to acomplish this step:
Step 1) Create a Visual C++ console project named “RadixSortGPU.dll” in Visual Studio 2010 by selecting DLL and Empty project on the wizard,
Step 2) Open Build Customization from the C++ project context menu, and ceck the CUDA 4.0(.targets, .props) checkbox,
Step 3) Open the project properties, expand the Configuration Properties, Linker and select Input. Edit the additional dependencies and add cudart.lib.
Step 4) Add a new empty source file ending with .cu and paste the following code into it:
#include <thrust/device_vector.h> #include <thrust/sort.h> #include <thrust/copy.h> #include <thrust/detail/type_traits.h> extern "C" __declspec(dllexport) void __cdecl GPURadixSort(int*, unsigned int); extern void GPURadixSort(int* data, unsigned int numElements) { thrust::device_vector<int> d_data(data, data + numElements); thrust::stable_sort(d_data.begin(), d_data.end()); thrust::copy(d_data.begin(), d_data.end(), data); }
This code is the only program logic you need to sort a given integer array on the GPU. For the sake of this example I did not include any exception handling and validation. But for any production code you most probably want to add exception handling and validation logic.
As you can see in the code, starting using Thrust is very simple by just including the required headers.
The dllexport definition makes the GPURadixSort
usable from within another library. In this way we will be able to call this method from our C# library.
The GPURadixSort
method itself is simply copying the data array into a container named thrust::device_vector, resulting the data to be placed into the GPU global memory, sorting the data on the GPU and copying back the data from the GPU memory to our input array. and
the begin()
and end()
methods are iterators pointing to the data at the beginning or end of the array. For more information how to use vectors and iterators please visit the Thrust Quick Start Guide.
Step 5) I’m assuming you already have a C# project you can paste the following code into your class,
[DllImport("RadixSortGPU.dll", CallingConvention = CallingConvention.Cdecl)] public static extern void GPURadixSort( [MarshalAsAttribute(UnmanagedType.LPArray, ArraySubType = UnmanagedType.I4)] int[] data, uint numElements);
This code provides the information needed to call the GPURadixSort function exported from our native DLL.
Step 6) That’s it. Now you can call the GPURadixSort
within your C# code to sort your array.
Based on my environment with a NVIDIA Quadro 600 GPU, the sort takes about 1.211 seconds, which is almost 3x faster than the CPU version. The NVIDIA Quadro 600 GPU I used is a low profile graphics processor which cost about $150. 96 cores, 1GB memory and 25.6 GB/s memory bandwidth maybe sound powerful, but it is low profile compared to GPUs like Quadro 6000, with 448 cores, 6 GB memory and 144 GB/s memory bandwidth. There are also the Tesla family GPUs for high performance computing needs you may want to check out at NVIDIA.
Beyond
Below I provided some code to check if the GPU memory is enough to hold the given array, which is one of the important validations you want to perform if you target heterogeneous environments.
using namespace std; ... extern void GPURadixSort(int* data, unsigned int numElements) { int deviceID = 1; if (cudaSuccess == cudaGetDevice(&deviceID)) { cudaDeviceProp devprop; cudaGetDeviceProperties(&devprop, deviceID); unsigned int totalMem = numElements * sizeof(int); if (devprop.totalGlobalMem < totalMem) { //Handle error } } ... }
As I mentioned before, under the highlevel Thrust library the powerful CUDA API resides. If Thrust does not have the functionality you are looking for, you can always use CUDA directly. It is not as easy as using Thrust, but if you will follow the resources below, it will not take you much to start using this great technology.
 You can visit Syllabus for the CUDA Certification Exam for a quick start up.

I’m strongly recommending the following books:
– Programming Massively Parallel Processors: A handson Approach
– CUDA by Example: An Introduction to GeneralPurpose GPU Programming 
The following online courses are also very valueable:
– Programming Massively Parallel Processors with CUDA by Stanford University
– HighPerformance Computing for Engineering Applications by Wisconsin University
Conclusion
We are using parallelism mostly to gain performance when a program runs slow or we need to handle huge amount of calculation. But, I believe that parallelism is a very important software aspect allowing us to use the underlying hardware wisely and gain advantage in this competitive market, where two functionality wise similar applications can only compete on software qualities like performance. There are many parallel technologies which will allow you to achieve higher performance with the same hardware. One of them is definitely NVIDIA CUDA, and some others you want to look into are TPL, PPL, C++ AMP, CUDA, OpenCL, MPI.