C# language, C++, CodeProject, GPGPU, Technical

Faster Sorting in C# by Utilizing GPU with NVIDIA CUDA


On this post I would like to give an entry level example how you can use NVIDIA CUDA technology to achieve better performance within C# with minimum possible amount of code. This is a very specific scenario where the main application is in C#, and the sorting algorithm is provided in C++ as an API and you have an NVIDIA CUDA capable graphics card. I hope this basic sample will give you some insight and encourage you to invest into parallel technologies.


Parallel Computing
For more than two decades CPUs drove rapid performance increase and price reductions. With that, we may get used to having speed boost of our code running on a new piece of hardware. As you may recall, this relentless performance improvement is described with two words: Moore’s law. While the continuation of Moore’s law is debated on many platforms, because of the slow down starting at 2003, many of us already jumped to the parallel bandwagon using systems based on SIMD or MIMD models. With parallel computing, application logic suitable for parallelization will most likely run faster without waiting for tomorrow’s hardware. (It is “most likely” because there are theoretical limits explained by Amdahl’s law.)

A GPU is a massively multithreaded processor which supports thousands of concurrent threads. Using GPU to do general purpose scientific and engineering computing is named GPGPU. NVIDIA is one of the leading GPU manufacturers and they are providing fully programmable GPU technology for high-level languages like C, C++ and Fortran with the architecture named CUDA. The NVIDIA hardware uses the SIMT (single-instruction multiple-thread) execution model rather than the popular SIMD model. I gave a SIMD example in one my previous posts, how to utilize an SSE4 CPU instruction within C#. And in this post we will be using SIMT execution model through NVIDIA CUDA. (Under the cover, on SIMD multiple data elements for a single instruction are collected and packed into a single register. On the other hand on SIMT, all threads process data in their own registers. )

Writing code using CUDA API is very powerful in terms of controlling the hardware, but it requires to handle memory and execution which is outside the scope of this post. Therefore I’ll use a high-level programming interface named Thrust. As describe at http://code.google.com/p/thrust/, Thrust is a CUDA library of parallel algorithms with an interface resembling the C++ Standard Template Library (STL). Thrust library provides many fundamental programming logic like sorting, prefix-sums, reductions, transformations etc..

Thrust consists of header files and is distributed with CUDA 4.0 installation. Therefore, you don’t have to install it additionally.

Radix Sort
The reason I choose sorting is that it is a widely used fundamental computational building block to many problems. Radix sort is one of the oldest algorithms and is very efficient for sorting small keys sequentially with the algorithmic complexity of O(n). One of the two sorting algorithms Thrust provides is Radix Sort, therefore thrust fits our example very well. There is a paper about Designing Efficient Sorting Algorithms for Manycore GPUs, which explains how the algorithm works on the GPU.


We will be using Visual Studio 2010 to host one C# application and one C++ library. The only additional API is the NVIDIA CUDA Toolkit 4.0. You will need to install the CUDA Toolkit and a supported graphics device driver from the same link.
Also do not forget to install the Build Customization BUG FIX Update from the same link or from here.


The solution consist of one C# console application and a C++ native dll. The C++ native dll exports a method which has the code to run the sort algorithm on the GPU. The C# code uses platform invoke to call the exported method of the C++ dll.

For comparison purpose I’m also providing a C# implementation of Radix Sort. The C# implementation takes an array of 32 bit integers and process it in three steps for every 8 bit part:
1) Creating histogram of data,
2) Running prefix sum(scan),
3) Adding the data to bin.

On some implementations the histogram is named as count or difference, but here I used the name histogram, because there is very good CUDA implementation I may write about in a future post. It is basically finding out how many of every byte is in the array. Because the max byte value is 0xFF, therefore the size of the histogram is 256, reflecting the maximum number of different values.

The second step is to run prefix sum on the histogram, which is simply adding one value to the right neighbor in the array. In that way every element in the array will contain it’s left neighbor. Actually, there is a very efficient way to to prefix sum on parallel environment explained by the paper Efficient Parallel Scan Algorithms for GPUs. Which you may find usefull for understanding the logic behind the GPU implementation.

The last step is to place the value into the correct bin by using the prefix sum result at the value index to determine the temporary array index.

The code is as follows:

public static void CPURadixSort(int[] sourceData)
 int size = sourceData.Length;
 //temporary array for every byte iteration
 int[] tempData = new int[size]; 
 //histogram of the last byte 
 int[] histogram = new int[256]; 
 //The prefix sum of the histogram
 int[] prefixSum = new int[256]; 
  fixed (int* pTempData = tempData)
   fixed (int* pSourceData = sourceData)
    int* pTemp = pTempData;
    int* pBck;
    int* pSource = pSourceData;
    //Loop through every byte of 4 byte integer
    for (int byteIdx = 0; byteIdx < 4; byteIdx++)
     	int shift = byteIdx * 8; //total bits to shift the numbers
     	//Calculate histogram of the last byte of the data
     	for (int i = 0; i < size; i++)
     	 histogram[(pSource[i] >> shift) & 0xFF]++;
     	//Calculate prefix-sum of the histogram
     	prefixSum[0] = 0;
     	for (int i = 1; i < 256; i++)
     	 prefixSum[i] = prefixSum[i - 1] + histogram[i - 1];
     	//Get the prefix-sum array index of the last byte, increase 
        //it by one. That gives us the the index we want to place 
        //the data
     	for (int i = 0; i < size; i++)
     	 pTemp[prefixSum[(pSource[i] >> shift) & 0xFF]++] = pSource[i];
     	//Swap the pointers
     	pBck = pSource;
     	pSource = pTemp;
     	pTemp = pBck;
     	//reset the histogram
     	for (int i = 0; i < 256; i++)
     	 histogram[i] = 0;

One more detail about the code is that I used unsafe code in order to swap the pointers for the temporary buffer and source data to gain another half second. But, it is possible to remove that part and use the tempData.CopyTo(sourceData, 0) instead.

When I run the code above on an Intel Xeon 2.40 Ghz CPU for 2^25 (33,554,432) random integers, it executed in 3.038 seconds.

Next step is to create the sorting code on the GPU and call it within our C# code. For this we need to create a DLL in C++ by using the default .Net project wizard. Please follow the steps to acomplish this step:
Step 1) Create a Visual C++ console project named “RadixSortGPU.dll” in Visual Studio 2010 by selecting DLL and Empty project on the wizard,

Step 2) Open Build Customization from the C++ project context menu, and ceck the CUDA 4.0(.targets, .props) checkbox,

Step 3) Open the project properties, expand the Configuration Properties, Linker and select Input. Edit the additional dependencies and add cudart.lib.

Step 4) Add a new empty source file ending with .cu and paste the following code into it:

#include <thrust/device_vector.h>
#include <thrust/sort.h>
#include <thrust/copy.h>
#include <thrust/detail/type_traits.h>

extern "C" __declspec(dllexport) void __cdecl GPURadixSort(int*, unsigned int);

extern void GPURadixSort(int* data, unsigned int numElements)
	thrust::device_vector<int> d_data(data, data + numElements);

	thrust::stable_sort(d_data.begin(), d_data.end());

	thrust::copy(d_data.begin(), d_data.end(), data);

This code is the only program logic you need to sort a given integer array on the GPU. For the sake of this example I did not include any exception handling and validation. But for any production code you most probably want to add exception handling and validation logic.

As you can see in the code, starting using Thrust is very simple by just including the required headers.

The dllexport definition makes the GPURadixSort usable from within another library. In this way we will be able to call this method from our C# library.

The GPURadixSort method itself is simply copying the data array into a container named thrust::device_vector, resulting the data to be placed into the GPU global memory, sorting the data on the GPU and copying back the data from the GPU memory to our input array. and
the begin() and end() methods are iterators pointing to the data at the beginning or end of the array. For more information how to use vectors and iterators please visit the Thrust Quick Start Guide.

Step 5) I’m assuming you already have a C# project you can paste the following code into your class,

[DllImport("RadixSortGPU.dll", CallingConvention = CallingConvention.Cdecl)]
public static extern void GPURadixSort(
 [MarshalAsAttribute(UnmanagedType.LPArray, ArraySubType = UnmanagedType.I4)]
 int[] data, uint numElements);

This code provides the information needed to call the GPURadixSort function exported from our native DLL.

Step 6) That’s it. Now you can call the GPURadixSort within your C# code to sort your array.

Based on my environment with a NVIDIA Quadro 600 GPU, the sort takes about 1.211 seconds, which is almost 3x faster than the CPU version. The NVIDIA Quadro 600 GPU I used is a low profile graphics processor which cost about $150. 96 cores, 1GB memory and 25.6 GB/s memory bandwidth maybe sound powerful, but it is low profile compared to GPUs like Quadro 6000, with 448 cores, 6 GB memory and 144 GB/s memory bandwidth. There are also the Tesla family GPUs for high performance computing needs you may want to check out at NVIDIA.

Below I provided some code to check if the GPU memory is enough to hold the given array, which is one of the important validations you want to perform if you target heterogeneous environments.

using namespace std;
extern void GPURadixSort(int* data, unsigned int numElements)
 int deviceID = -1;
 if (cudaSuccess == cudaGetDevice(&deviceID))
     cudaDeviceProp devprop;
     cudaGetDeviceProperties(&devprop, deviceID);
     unsigned int totalMem = numElements * sizeof(int);
     if (devprop.totalGlobalMem < totalMem)
        //Handle error

As I mentioned before, under the high-level Thrust library the powerful CUDA API resides. If Thrust does not have the functionality you are looking for, you can always use CUDA directly. It is not as easy as using Thrust, but if you will follow the resources below, it will not take you much to start using this great technology.


We are using parallelism mostly to gain performance when a program runs slow or we need to handle huge amount of calculation. But, I believe that parallelism is a very important software aspect allowing us to use the underlying hardware wisely and gain advantage in this competitive market, where two functionality wise similar applications can only compete on software qualities like performance. There are many parallel technologies which will allow you to achieve higher performance with the same hardware. One of them is definitely NVIDIA CUDA, and some others you want to look into are TPL, PPL, C++ AMP, CUDA, OpenCL, MPI.