Archive

Archive for the ‘C# language’ Category

Large Scale Machine Learning using NVIDIA CUDA

February 25, 2012 Leave a comment

Introduction

You may have heard about the Stanford University’s machine learning on-line course given by Prof. Andrew Ng. in 2011; it was a great course with lots of real world examples. During the course I’ve realized that GPUs are the perfect solution for large scale machine learning problems. In fact, there are many examples about supervised and unsupervised learning all around the internet. Being a fan of both GPGPU and Machine Learning technologies, I came up with my own perspective to run machine learning algorithms with huge amount of data on the GPUs.

I’ve already presented the solution recently at the South Florida Code Camp 2012. Everybody was interested in these two subjects a lot; therefore I’ve decided to share it on my blog. The example in this post is neither the only solution nor the best solution. I hope it will help you one day solve your own machine learning problem.

There is a lot of concepts to machine learning but in this post I’m only scratching the surface. If you already know about GPGPU and Machine Learning you can just go to the source code at this link, download the Visual Studio 2010 projects and try it out.

I’ve also prepared the same example using CUBLAS with vectorized implementation of the polynomial regression algorithm, but the CUBLAS example would require more in depth explanations. Therefore I’m posting this example first which is a simplified implementation. If you are interested in CUBLAS implementation please let me know and I can send you that copy.

Background

Machine Learning

If you are already familiar with machine learning you can skip the brief introduction and jump directly to the Large Scale Machine Learning section. Or if you want to learn more about machine learning please follow the links or check out the Stanford course I’ve mentioned at the beginning.

Machine learning algorithms allow computers recognize complex patterns. It focuses on the prediction, based on known properties learned from the training data. We are using machine learning algorithms every day dozens of times maybe unknowingly: every time we get a book or movie recommendation or every time we do a web search. In 1959 Arthur Samuel described Machine learning as: Field of study that gives computers the ability to learn without being explicitly programmed. It has been a while machine learning was first introduced, and it is gaining popularity again with the rise of Big Data.

Figure 1 shows how some of the machine learning processes work. On phase 1, given a data set a machine learning algorithm can recognize complex patterns and come up with a model. In most cases this phase is the bulk of the computation. In the second phase any given data can run through the model to make a prediction. For example if you have a data set of house prices by size, you could let the machine learn from the data set and let it predict house price of any given size.


Figure 1

It does this by recognizing the function which defines the relation between the different features of the problem. A linear problem with two dimensions, like house price (the house size is the feature and the house price is the label data), can be expressed with the f(x) = ax + b model. Figure 2 shows how one feature can be used on a linear regression problem to predict new house prices. The term “hypothesis” was used in the Stanford course to describe the model.


Figure 2

Depending to the data set, more complex functions can be used. On Figure 3 you can see how the complexity can grow easily from 2 dimensions linear to hundreds of dimensions polynomial. In a spam filtering problem the different features could be words in the email or in a face recognition problem the features could be the pixels of the image. In the house price prediction example, features are properties of the house which are affecting the price. e.g. size, room count, floors, neighborhood, crime rate etc.


Figure 3

There are many machine learning algorithms for different problem types. The most common groups of these algorithms are Supervised Learning, Unsupervised Learning. Supervised learning is used on problems where we can provide the output values for the learning algorithm. For example: house prices for some house features is the output value, therefore house price prediction is a supervised learning problem. Data with these output values is named as “labeled data”. On the other hand unsupervised learning does not require output values, patterns or hidden structures can be recognized just with feature data. For example: clustering social data to determine groups of people by interest would not require to define any output value, therefore it is an unsupervised learning problem.

Gradient Descent

In supervised learning problems, the machine can learn the model and come up with a hypothesis by running a hypothesis with different variables and testing if the result is close to the provided labels (calculating the error). Figure 4 shows how a training data is plot and the error is calculated. An optimization algorithm named Gradient Descent (Figure 5) can be used to find the optimum hypothesis. In this simple two dimensional problem, the algorithm would run for every different value of “a” and “b”, and would try to find the minimum total error.


Figure 4

The pseudo code below shows how the gradient descent algorithm in Figure 5 works :

for every a and b loop until converge
errors = 0
for i = 1 to data.length
    fx = a * data[i] + b
    errors += (fx - labelData[i]) * data[i]
end for
gradient = gradient - learningRate * 1/data.length * errors  
end for


Figure 5 (From
ml-class.org)

Large Scale Machine Learning

Machine learning problems become computationally expensive when the complexity (dimensions and polynomial degree) increases and/or when the amount of data increases. Especially on big data sources with hundreds of millions of samples, the time to run optimization algorithms increases dramaticaly. That’s why we are looking for parallelization opportunities in the algorithms. The error summation of gradient descent algorithm is a perfect candidate for parallelization. We could split the data into multiple parts and run gradient descent on these parts in parallel. In Figure 6 you can see how the data is split into four parts and fed into four different processors. On the next step the result is gathered together to run the rest of the algorithm.


Figure 6

Clearly, this approach would speed up the machine learning computation by almost four times. But what if we would have more cores and split the data further? That is where GPUs step into the solution. With GPUs we can parallelize in two layers: multiple GPUs and multiple cores in every GPU. Assuming a configuration with 4 GPUs and 512 cores each, we could split down the data into 512 more pieces. Figure 7 shows this configuration along with the parallelized part on the GPU cores.


Figure 7

GPGPU

Utilizing GPUs to enable dramatic increases in computing performance of general purpose scientific and engineering computing is named GPGPU. NVIDIA is providing a parallel computing platform and programming model named CUDA to develop GPGPU software on C, C++ or Fortran which can run on any NVIDIA GPU. NVIDIA CUDA comes with many high level APIs and libraries like basic linear algebra, FFT, imaging etc. to allow you concentrate on the business logic rather than re-writing well known algorithms.

You can visit my previous blog posts where I’ve explained how to use NVIDIA CUDA capable GPUs to perform massively parallel computations. The examples include Monte Carlo simulation, random number generators and sorting algorithms.

House Price Prediction Example

On this post I’ll show you how you can implement house price prediction on NVIDIA CUDA. Given a house price data set based on bedrooms, square feet and year built, it is possible to let the machine learn from this data set and provide us with a model for future predictions. Because the error calculation part of the Gradient Descent algorithm is highly parallelizable, we can offload it to the GPUs.

The machine learning algorithm in this example is Polynomial Regression, a form of the well known Linear Regression algorithm. In Polynomial Regression the model is fit on a high order polynomial function. In our case we will be using bedrooms, square feet, year built, square root of bedrooms, square root of square feet, square root of year built and the product of bedrooms and square feet. The reason we added the four polynomial terms to the function is because of the nature of our data. Fitting the curve correctly is the main idea behind building a model for our machine leaning problem. Logically house prices increase by these features not in a linear or exponential way and they don’t drop after a certain peek. Therefore the graph is more like a square root function, where house prices increase less and less compared to increasing any other feature.

Finding the right polynomial terms is very important for the success of the machine learning algorithm: having a very complex, tightly fitting function would generate too specific model and end up with overfitting, having a very simple function, like a straight line would generate too general model and end up with under fitting. Therefore we are using additional methods like adding regularization terms to provide a better fit to your data. Figure 8 shows the gradient descent algorithm including with the regularization term lambda.


Figure 8 (From
ml-class.org)

Application Architecture

The sample application consist of a C++ native DLL named LR_GPULib, for the machine learning implementation on the GPU and a C# Windows application named TestLRApp for the user interface. The DLL implements Data Normalization and Polynomial Regression using the high level parallel algorithm library Thrust on NVIDIA CUDA. I’ve mentioned on my previous blog posts about Thrust more in detail, therefore I’m not going into much detail on this post. Figure 9 shows the application architecture and also the program flow from loading the training data all the way down to making a prediction.


Figure 9

The application provides the UI shown below on Figure 10 to load the data, train and make a prediction with new data set. The UI also shows the hypothesis on the bottom of the dialog with all constants and features.


Figure 10

Implementation

The LR_GPU_Functors.cu file in the DLL contains the functors used as kernels on Thrust methods. The LR_GPU.cu file in the DLL contains the normalization, learning and prediction methods. The Learn method accepts the training data and the label data, which are all the features and all prices in two float arrays. The first thing the Learn method does is to allocate memory, add bias term and normalize the features. The reason we added the bias term is to simplify the gradient loop and the reason we normalize the features is because the data ranges are too different. E.g. square feet is four digits and bedrooms is single digit. By normalizing the features we bring them into the same range, between zero and one. Normalization gets also executed on the GPU using the NormalizeFeatures. But the normalization requires the mean and standard deviation (std), therefore mean and std are calculated first and provided to the NormalizeFeaturesByMeanAndStd method to calculate the mean normalization.

void NormalizeFeaturesByMeanAndStd(unsigned int trainingDataCount, float * d_trainingData, 
thrust::device_vector<float> dv_mean, thrust::device_vector<float> dv_std)
{
	//Calculate mean norm: (x - mean) / std
	unsigned int featureCount = dv_mean.size();
	float * dvp_Mean = thrust::raw_pointer_cast( &dv_mean[0] );
	float * dvp_Std = thrust::raw_pointer_cast( &dv_std[0] );
	FeatureNormalizationgFunctor featureNormalizationgFunctor(dvp_Mean, dvp_Std, featureCount); 
		thrust::device_ptr<float> dvp_trainingData(d_trainingData); 
	thrust::transform(thrust::counting_iterator<int>(0), thrust::counting_iterator<int>
		(trainingDataCount * featureCount), dvp_trainingData, dvp_trainingData, featureNormalizationgFunctor);
}

The Normalization code running on the GPU is implemented in the FeatureNormalizationgFunctor functor, which is simply calculating data - mean / std in parallel for every element of the data, as seen below:

...
  __host__ __device__
  float operator()(int tid, float trainingData)
  {
	  int columnIdx = tid % featureCount;
	  float fnorm = trainingData - meanValue[columnIdx];
	  if (stdValue[columnIdx] > 0.0)
		fnorm /= stdValue[columnIdx];
	  return fnorm;
  }
...

On the next step in the Learn method, the gradient descent is calculated with the for(int i = 0; i < gdIterationCount; i++) loop. As I mentioned before, the error calculation part of the gradient descent is executed in parallel but the rest is calculated sequentialy. The thrust::transform is used with the TrainFunctor to calculate f(x)-y in parallel for every sample. f(x) is simply the A*x1 + Bx2 + Cx3 + Dx4 + Ex5 + Fx6 + Gx7 + H hypothesis where x1 through x7 are the features (x1=bedrooms, x2=square feet, x3=year built, x4=square root of bedrooms, x5=square root of square feet, x6=square root of year built and x7=the product of bedrooms and square feet) and A through H are the constants which gradient descent will find out. This is shown with the Green Square on Figure 11. The TrainFunctor code snippet and the usage code snippet are shown below:


Figure 11 (From
ml-class.org)

...
__host__ __device__
  float operator()(int tid, float labelData)
  {
		float h = 0;
		for (int f = 0; f < featureCount; f++)
			h += hypothesis[f] * trainingData[tid * featureCount + f];
		return h - labelData;
  }
...
...
thrust::transform(thrust::counting_iterator<int>(0), 
	thrust::counting_iterator<int>(trainingDataCount),  
	dv_labelData.begin(), dv_costData.begin(), tf);
...

The thrust::transform_reduce is used with the TrainFunctor2 to apply the features to the error result and sum up all of them. This is shown with the code snippet below and the Red Square on Figure 11. Rest of the Learn method calculates gradient descent part marked with Blue Square on Figure 11.

float totalCost = thrust::transform_reduce(thrust::counting_iterator<int>(0), 
	thrust::counting_iterator<int>(trainingDataCount),  tf2, 0.0f, thrust::plus<float>());

Once gradient descent converges, the constants A though H of the hypothesis is returned back to the TestLRApp with the result array.

As you may guess the prediction works by using the constants with new sample data on the hypothesis. This is done using the Predict method in the LR_GPULib library. As seen below the Predict method normalizes the given features set and calculates the hypothesis using the constants and the normalized data with the help of the PredictFunctor. The result is a the predicted house price for the given features.

...
	NormalizeFeaturesByMeanAndStd(testDataCount, pdv_testData, dv_mean, dv_std);

	//Predict
	PredictFunctor predictFunctor(pdv_testData, pdv_hypothesis, featureCount);
	thrust::transform(thrust::counting_iterator(0), 
		thrust::counting_iterator(testDataCount), dv_result.begin(), predictFunctor);
...
struct PredictFunctor : public thrust::unary_function
{
	float * testData;
	float * hypothesis;
	unsigned int featureCount;

	PredictFunctor(float * _testData, float * _hypothesis, unsigned int _featureCount) 
		: testData(_testData), hypothesis(_hypothesis), featureCount(_featureCount)
	{}

  __host__ __device__
  float operator()(int tid)
  {
	  float sum = 0;
	  for(int i = 0; i < featureCount; i++)
		sum += testData[tid * featureCount + i] * hypothesis[i];
	  return sum;
  }
};

Conclusion

GPGPU, Machine Learning and Big Data are three rising fields in the IT industry. There is so much more about these fields than what I’m providing in this post. As much as I get deeper into these fields I figure out how well they fit together. I hope this sample gave you some basic idea and maybe just one perspective how you can use NVIDIA CUDA easily on machine learning problems. As in any other software solution this example is not the only way to do polynomial regression on house price prediction with GPUs. In fact an enhancement would be supporting multiple GPUs and splitting down the data set into more parts.

Massively Parallel RNG using CUDA C, Thrust and C#

September 18, 2011 2 comments

Introduction

On this post I’ll give you some simple examples how to use the massively parallel GPU on your computer to generate uniformly distributed psuedo-random numbers. Why GPU? Because it is orders of magnitude faster than CPU, does not occupy your CPU time, it is already on all computers and many other reasons I mentioned in my previous post. While there are maybe hundreds of ways to generate pseudorandom numbers I only covered four ways to do it on NVIDIA cards using CUDA related APIs:

1) A Basic Linear Congruential Generator (LCG) implementation using CUDA C
2) A Thrust C++ template library implementation
3) An NVIDIA CURAND implementation
4) A Mersenne Twister implementation using CUDA C

In order to demonstrate how to utilize the GPU, the implementations are provided as DLL’s and used within C# sample application. Perhaps there are many other APIs and ways worth to talk about to utilize your GPU within your C# application, but this post’s scope is limited only to the subject I mentioned above. I’m suggesting you to visit http://gpgpu.org/, if you already did not, to see the endless possibilities in this area.

While I was preparing these samples I saw that visualizing the data is very important to understand the algorithms. Therefore, I used Microsoft WPF on C# to visualize the generated random numbers. You can use your own application and copy the classes under the RNGLibs folder.

All code can be downloaded from this link: RNGTests09182011.zip.

Background

About Random Number Generators (RNG)

The generation of random numbers is important in many applications like simulations, cryptography, sampling and mostly in statistics. A sequence of numbers is random when it does not have a recognizable pattern in it or in other words if it is non-deterministic. Although non-deterministic random numbers are ideal, the computer generated, deterministic random numbers can be statistically “random enough”. These random numbers are named as pseudo-random numbers and can have easily identifiable patterns if the algorithm is not chosen wisely. ( http://en.wikipedia.org/wiki/Pseudorandom_number_generator )

There are many pseudo-random number generators and also many different implementations of them in sequential and parallel environments. ( http://en.wikipedia.org/wiki/List_of_pseudorandom_number_generators ) On this post I used only Linear Congruential Generators, Xorshift and Mersenne Twister. Therefore, I explained only these three algorithm, but you can use CUDA to develop also other RNGs.

Thrust

As I mentioned in my previous post, writing code using CUDA API is very powerful in terms of controlling the hardware, but there are high level libraries like Thrust C++ template library, which provides many fundamental programming logic like sorting, prefix-sums, reductions, transformations etc.. The best part is that Thrust consists only of header files and is distributed with CUDA 4.0 installation.

Projects

I’ve used Visual Studio 2010 to host one C# Windows Application and native C++ dlls for RNG implementations as seen in the solution structure below:

  • RNGVisual (Main C# application)
  • CURACRNGLib (CUDA C RNG implementation)
  • CURANDRNGLib (CURAND RNG implementation)
  • ThrustRNGLib (Thrust RNG Implementation)
  • MersenneTwisterRNGLib (Mersenne Twister RNG implementation)

The only additional API is the NVIDIA CUDA Toolkit 4.0, which you will need to install along with a supported graphics device driver from the same link. Also do not forget to install the Build Customization BUG FIX Update from the same link or from here.

Once the CUDA Toolkit is installed creating CUDA enabled projects is really simple. For those who are not familiar using native C++ CUDA enabled projects, please follow the steps below to create one:

  • Create a Visual C++ console project in Visual Studio 2010 by selecting DLL and Empty project on the wizard,
  • Open Build Customization from the C++ project context menu, and check the CUDA 4.0(.targets, .props) checkbox,
  • Open the project properties, expand the Configuration Properties, Linker and select Input. Edit the additional dependencies and add cudart.lib.
  • Add a new empty source file ending with .cu.

Implementation

WPF Application

The RNGVisual C# WPF Application provides visualization of the random numbers in 2D and 3D. It allows you to select an RNG algorithm (.Net, CUDA C, CURAND, Thrust or Merseene Twister) and allows you to set some display parameters and processor parameters. With any number count below 10K, all RNGs calculate in about one millisecond and most of the time is spent drawing the squares to the screen. Therefore, the time should not confuse you in terms of performance comparison. You can run the algorithms with 100K numbers without the visualization and see the difference on your hardware. But please be aware that it is better to use CUDA events with cudaEventRecord to time GPU execution more precisely.

RNGVisual

RNGVisual


RNGVisual implements various proxy classes, which uses platform invoke to call RNG methods exported from the native C++ dlls. I used the same export and import technique in my previous post. The RNG libraries have the following two exports, one for CPU implementation and one for GPU implementation:

extern "C" __declspec(dllexport) void __cdecl
    GPU_RNG(float*, unsigned int, unsigned int);

extern "C" __declspec(dllexport) void __cdecl
    CPU_RNG(float*, unsigned int, unsigned int);

The first argument is a pointer to the memory location to hold the random numbers. The second argument is the size of the array and the last argument is the initial seed.

An important point of random number generation is selecting the seed value, because the same seed will give the same result. While there are many different techniques studied, I used my own method of combining current time, CPU load and available physical memory with the help of the Windows Management Instrumentation (WMI); it still does not perform well in multi-threaded solutions, but it gives at least a better random start. The implementation is in the CPUHelper class of the RNGVisual application.

A Linear Congruential Generator (LCG) implementation using CUDA C

The first RNG project is using native CUDA Runtime API to implement the oldest and best-known pseudorandom number generator algorithms named LCG. LCG is fast and requires minimal memory to retain state. Therefore, LCG is very efficient for simulating multiple independent streams. But LCGs have some disadvantages and should not be used for applications where high-quality randomness is critical. In fact the simple example I implemented repeats numbers in a very short period and should be enhanced with methods like explained in GPUGems 3 (37-4).

LCG is as simple as seen in the formula below; starting with a seed (Xn), the next random number is determined with (a * seed + c) mod m.
Xn+1 = (a Xn + c) (mod m)
where 0 < m, 0 < a < m, 0 <= x0 < m and 0 <= c < m

Below is a sequential implementation of the LCG algorithm, which generates 100 pseudo-random numbers:

random[0] = 123; //some initial seed
for(int i = 1; i < 100; i++)
 random[i] = ( a * random[i-1] + c) % m;
 


The CUDACRNGLib project has a very basic implementation of LCG by distributing the work onto 256 threads. Because the same seed will result in the same random number, first we generate different random seeds for every thread. When the kernel below is executed, every thread generates one section of the random number sequence:

__global__ void RNGKernel(float * randomNumbers, unsigned int numberCount,
    unsigned int * seeds, unsigned int c, unsigned int a, unsigned int M) 
{ 
    int startIdx = threadIdx.x * numberCount; 
    unsigned int x = seeds[threadIdx.x]; 
    for(int i=0; i < numberCount; i++) { 
        x = (a * x + c) % M; //M is shown for purpose of example 
        randomNumbers[startIdx + i]= (float)x / (float)M; //normalize  
    } 
} 

As I mentioned before, this implementation is simplified to give you an idea how you can start using CUDA C. It even has static block count of one and thread count of 256. If you plan to go for production code, it is good to start many blocks of threads. You may want to check out a better implementation on GPU Gems 3 (37-7) or check out Arnold and Meel’s implementation, which provides also better randomness.

A Thrust C++ template library implementation

The Thrust library default random engine ( default_random_engine ) is a Linear Congruential Generator ( may change in the future ) with a = 48271, c = 0 and m = 2^31. Because c equals to zero, the algorithm is also named multiplicative congruential method or Lehmer RNG.

The ThrustRNGLib has a very basic implementation of Thrust default random engine by running the following functor to generate one random number. A functor is a type of class in C++ that overloads the operator() and therefore allows to be called like an ordinary function. Thrust provides unary_function and binary_function functors. Below I used the unary_function because my functor requires on argument to passed into the function:

struct RandomNumberFunctor : 
    public thrust::unary_function<unsigned int, float>
{
    unsigned int mainSeed;
    RandomNumberFunctor(unsigned int _mainSeed) : 
        mainSeed(_mainSeed) {}

    __host__ __device__
        float operator()(unsigned int threadIdx) 
    {
        unsigned int seed = hash(threadIdx) * mainSeed;

        // seed a random number generator
        thrust::default_random_engine rng(seed);

        // create a mapping from random numbers to [0,1)
        thrust::uniform_real_distribution<float> u01(0,1);

        return u01(rng);
    }
};

Using thrust to utilize the GPU is the simplest way to go. You can see the difference by comparing the GPU_RNG from below to the CUDACRNGLib GPU_RNG implementation. Using CUDA C gives you full control of the toolkit but it comes with the price of writing more code.

extern void GPU_RNG(float * h_randomData, unsigned int dataCount, unsigned int mainSeed)
{
    //Allocate device vector
    thrust::device_vector<float> d_rngBuffer(dataCount);

    //generate random numbers
    thrust::transform(thrust::counting_iterator<int>(0), 
        thrust::counting_iterator<int>(dataCount), 
        d_rngBuffer.begin(), RandomNumberFunctor(mainSeed));

    //copy the random mask back to host
    thrust::copy(d_rngBuffer.begin(), d_rngBuffer.end(), h_randomData);
}

Another good part of Thrust is that every implementation (except copy) exist for GPU as well as for CPU. The CPU implementation is also another three lines of code, this time by using the thrust::generate to generate the random numbers by using the C++ standard template library rand method and then after using thrust::transform to normalize the integer result into float with the help of the [](float n) {return n / (RAND_MAX + 1);} lambda expression. I used the lambda expression instead of a functor to show you also this possibility. Especially on the upcoming
Microsoft C++ AMP, lambda expressions play a big role. Lambda expression are handy in C++ as well as in C#, but it comes with a price of giving up unit testing of the inline expression.

An NVIDIA CURAND implementation

The NVIDIA CURAND library provides an API for simple and efficient generation of high-quality pseudorandom and quasirandom numbers. The CURAND library default pseudorandom engine is a XORWOW implementation of the Xorshift RNG (page 5) and it produces higher quality random numbers than LCG.
In order to start using CURAND, you only need to include the curand.h header and add the curand.lib to the Linker additional dependencies on the Linker settings.

Like ThrustRNGLib Thrust implementation, the CURANDRNGLib has a very basic implementation by running the following main code to generate a series of random numbers:

....
    //Create a new generator
    curandCreateGenerator(&m_prng, CURAND_RNG_PSEUDO_DEFAULT);
    //Set the generator options
    curandSetPseudoRandomGeneratorSeed(m_prng, (unsigned long) mainSeed);
    //Generate random numbers
    curandGenerateUniform(m_prng, d_randomData, dataCount);
....

CURAND provides the curandCreateGeneratorHost method besides the curandCreateGenerator method, to generate random numbers on the CPU instead of the GPU. Therefore the CPU part is as simple as the GPU part.

A Mersenne Twister implementation using CUDA C

Mersenne Twister ( MT ) is an algorithm developed by Makoto Matsumoto and provides very fast generation of high-quality random numbers. ( MT Home Page ) A common Mersenne twister implementation, uses an LCG to generate seed data.
Originally there are two MT algorithms suitable to use with CUDA: TinyMT and Mersenne Twister for Graphics Processors (MTGP). But I implemented part of the code from the NVIDIA CUDA Toolkit 4.0 MersenneTwister sample, which uses the original code from Makoto Matsumoto anyways.

The Mersenne Twister RNG is maybe the most complicated implementation out of the other three RNGs I provided, but with that you can look into the algorithm, unlike CURAND. The MersenneTwisterRNG.cpp file from the MersenneTwisterRNGLib project is the entry point to the library and exports the same GPU_RNG and CPU_RNG methods as the other libraries. I simplified the host code as much as possible and placed all GPU logic into the GPURNG.cu file. The remaining simple host code can be seen below:

extern void GPU_RNG(float * h_randomData, unsigned int dataCount, 
    unsigned int mainSeed)
{
	float * d_randomData = 0;

	//load GPU twisters configuration
	if(!loadMTGPU("MersenneTwister.dat"))
		return;
    seedMTGPU(mainSeed);

	//find the rounded up data count 
    //because the generator generates in multiples of 4096
	int numbersPerRNG = iAlignUp(iDivUp(dataCount, MT_RNG_COUNT), 2);
	int randomDataCount = MT_RNG_COUNT * numbersPerRNG;

	//allocate device memory
	size_t randomDataSize = randomDataCount * sizeof(float);
	cudaMalloc((void**)&d_randomData, randomDataSize);


	//Call the generator
	RNGOnGPU(32, 128, d_randomData, numbersPerRNG);

	//Make sure all GPU work is done
	cudaDeviceSynchronize();

	//Copy memory back to the device
	cudaMemcpy(h_randomData, d_randomData, dataCount * sizeof(float), 
        cudaMemcpyDeviceToHost);

	//free device memory
	cudaFree(d_randomData);
}

About the Implementations

The Pseudo-random number generators I provided on this post are widely used algorithms, but still I’m not guaranteeing that any of them will perform good enough in your particular solution. In fact, I left some generators poor by purpose to point out the core algorithm and provide variations in randomness. Also, for sake of clarity there is almost no exception handling and logging implemented. This is not an API; my goal is to give you a high level idea how you can use Thrust, CUDA C, CURAND to generate pseudo-random number. Therefore, it is important that you research the algorithms on-line and re-factor the code for your own use.

Some of the code is taken from original NVIDIA samples and have copyright notice on them, the rest is my code which is free to use without any restriction or obligation.

Conclusion

As in every field of computer science, there are many ways to solve a problem and the possibilities expand exponentially. I just scratched the surface of the possibility of using CUDA to add pseudorandom number generation to your C# application. I hope this post will help you in your project.

Faster Sorting in C# by Utilizing GPU with NVIDIA CUDA

July 27, 2011 Leave a comment

Introduction

On this post I would like to give an entry level example how you can use NVIDIA CUDA technology to achieve better performance within C# with minimum possible amount of code. This is a very specific scenario where the main application is in C#, and the sorting algorithm is provided in C++ as an API and you have an NVIDIA CUDA capable graphics card. I hope this basic sample will give you some insight and encourage you to invest into parallel technologies.

Background

Parallel Computing
For more than two decades CPUs drove rapid performance increase and price reductions. With that, we may get used to having speed boost of our code running on a new piece of hardware. As you may recall, this relentless performance improvement is described with two words: Moore’s law. While the continuation of Moore’s law is debated on many platforms, because of the slow down starting at 2003, many of us already jumped to the parallel bandwagon using systems based on SIMD or MIMD models. With parallel computing, application logic suitable for parallelization will most likely run faster without waiting for tomorrow’s hardware. (It is “most likely” because there are theoretical limits explained by Amdahl’s law.)

GPGPU
A GPU is a massively multithreaded processor which supports thousands of concurrent threads. Using GPU to do general purpose scientific and engineering computing is named GPGPU. NVIDIA is one of the leading GPU manufacturers and they are providing fully programmable GPU technology for high-level languages like C, C++ and Fortran with the architecture named CUDA. The NVIDIA hardware uses the SIMT (single-instruction multiple-thread) execution model rather than the popular SIMD model. I gave a SIMD example in one my previous posts, how to utilize an SSE4 CPU instruction within C#. And in this post we will be using SIMT execution model through NVIDIA CUDA. (Under the cover, on SIMD multiple data elements for a single instruction are collected and packed into a single register. On the other hand on SIMT, all threads process data in their own registers. )

Thrust
Writing code using CUDA API is very powerful in terms of controlling the hardware, but it requires to handle memory and execution which is outside the scope of this post. Therefore I’ll use a high-level programming interface named Thrust. As describe at http://code.google.com/p/thrust/, Thrust is a CUDA library of parallel algorithms with an interface resembling the C++ Standard Template Library (STL). Thrust library provides many fundamental programming logic like sorting, prefix-sums, reductions, transformations etc..

Thrust consists of header files and is distributed with CUDA 4.0 installation. Therefore, you don’t have to install it additionally.

Radix Sort
The reason I choose sorting is that it is a widely used fundamental computational building block to many problems. Radix sort is one of the oldest algorithms and is very efficient for sorting small keys sequentially with the algorithmic complexity of O(n). One of the two sorting algorithms Thrust provides is Radix Sort, therefore thrust fits our example very well. There is a paper about Designing Efficient Sorting Algorithms for Manycore GPUs, which explains how the algorithm works on the GPU.

Setup

We will be using Visual Studio 2010 to host one C# application and one C++ library. The only additional API is the NVIDIA CUDA Toolkit 4.0. You will need to install the CUDA Toolkit and a supported graphics device driver from the same link.
Also do not forget to install the Build Customization BUG FIX Update from the same link or from here.

Implementation

The solution consist of one C# console application and a C++ native dll. The C++ native dll exports a method which has the code to run the sort algorithm on the GPU. The C# code uses platform invoke to call the exported method of the C++ dll.

For comparison purpose I’m also providing a C# implementation of Radix Sort. The C# implementation takes an array of 32 bit integers and process it in three steps for every 8 bit part:
1) Creating histogram of data,
2) Running prefix sum(scan),
3) Adding the data to bin.

On some implementations the histogram is named as count or difference, but here I used the name histogram, because there is very good CUDA implementation I may write about in a future post. It is basically finding out how many of every byte is in the array. Because the max byte value is 0xFF, therefore the size of the histogram is 256, reflecting the maximum number of different values.

The second step is to run prefix sum on the histogram, which is simply adding one value to the right neighbor in the array. In that way every element in the array will contain it’s left neighbor. Actually, there is a very efficient way to to prefix sum on parallel environment explained by the paper Efficient Parallel Scan Algorithms for GPUs. Which you may find usefull for understanding the logic behind the GPU implementation.

The last step is to place the value into the correct bin by using the prefix sum result at the value index to determine the temporary array index.

The code is as follows:

public static void CPURadixSort(int[] sourceData)
{
 int size = sourceData.Length;
 //temporary array for every byte iteration
 int[] tempData = new int[size]; 
 //histogram of the last byte 
 int[] histogram = new int[256]; 
 //The prefix sum of the histogram
 int[] prefixSum = new int[256]; 
 			
 unsafe
 {
  fixed (int* pTempData = tempData)
  {
   fixed (int* pSourceData = sourceData)
   {
    int* pTemp = pTempData;
    int* pBck;
    int* pSource = pSourceData;
     
    //Loop through every byte of 4 byte integer
    for (int byteIdx = 0; byteIdx < 4; byteIdx++)
    {
     	int shift = byteIdx * 8; //total bits to shift the numbers
     
     	//Calculate histogram of the last byte of the data
     	for (int i = 0; i < size; i++)
     	 histogram[(pSource[i] >> shift) & 0xFF]++;
     
     	//Calculate prefix-sum of the histogram
     	prefixSum[0] = 0;
     	for (int i = 1; i < 256; i++)
     	 prefixSum[i] = prefixSum[i - 1] + histogram[i - 1];
     
     	//Get the prefix-sum array index of the last byte, increase 
        //it by one. That gives us the the index we want to place 
        //the data
     	for (int i = 0; i < size; i++)
     	 pTemp[prefixSum[(pSource[i] >> shift) & 0xFF]++] = pSource[i];
     
     	//Swap the pointers
     	pBck = pSource;
     	pSource = pTemp;
     	pTemp = pBck;
     
     	//reset the histogram
     	for (int i = 0; i < 256; i++)
     	 histogram[i] = 0;
    }
   }
  }
 }
}

One more detail about the code is that I used unsafe code in order to swap the pointers for the temporary buffer and source data to gain another half second. But, it is possible to remove that part and use the tempData.CopyTo(sourceData, 0) instead.

When I run the code above on an Intel Xeon 2.40 Ghz CPU for 2^25 (33,554,432) random integers, it executed in 3.038 seconds.

Next step is to create the sorting code on the GPU and call it within our C# code. For this we need to create a DLL in C++ by using the default .Net project wizard. Please follow the steps to acomplish this step:
Step 1) Create a Visual C++ console project named “RadixSortGPU.dll” in Visual Studio 2010 by selecting DLL and Empty project on the wizard,

Step 2) Open Build Customization from the C++ project context menu, and ceck the CUDA 4.0(.targets, .props) checkbox,

Step 3) Open the project properties, expand the Configuration Properties, Linker and select Input. Edit the additional dependencies and add cudart.lib.

Step 4) Add a new empty source file ending with .cu and paste the following code into it:

#include <thrust/device_vector.h>
#include <thrust/sort.h>
#include <thrust/copy.h>
#include <thrust/detail/type_traits.h>

extern "C" __declspec(dllexport) void __cdecl GPURadixSort(int*, unsigned int);

extern void GPURadixSort(int* data, unsigned int numElements)
{
	thrust::device_vector<int> d_data(data, data + numElements);

	thrust::stable_sort(d_data.begin(), d_data.end());

	thrust::copy(d_data.begin(), d_data.end(), data);
}

This code is the only program logic you need to sort a given integer array on the GPU. For the sake of this example I did not include any exception handling and validation. But for any production code you most probably want to add exception handling and validation logic.

As you can see in the code, starting using Thrust is very simple by just including the required headers.

The dllexport definition makes the GPURadixSort usable from within another library. In this way we will be able to call this method from our C# library.

The GPURadixSort method itself is simply copying the data array into a container named thrust::device_vector, resulting the data to be placed into the GPU global memory, sorting the data on the GPU and copying back the data from the GPU memory to our input array. and
the begin() and end() methods are iterators pointing to the data at the beginning or end of the array. For more information how to use vectors and iterators please visit the Thrust Quick Start Guide.

Step 5) I’m assuming you already have a C# project you can paste the following code into your class,

[DllImport("RadixSortGPU.dll", CallingConvention = CallingConvention.Cdecl)]
public static extern void GPURadixSort(
 [MarshalAsAttribute(UnmanagedType.LPArray, ArraySubType = UnmanagedType.I4)]
 int[] data, uint numElements);


This code provides the information needed to call the GPURadixSort function exported from our native DLL.

Step 6) That’s it. Now you can call the GPURadixSort within your C# code to sort your array.

Based on my environment with a NVIDIA Quadro 600 GPU, the sort takes about 1.211 seconds, which is almost 3x faster than the CPU version. The NVIDIA Quadro 600 GPU I used is a low profile graphics processor which cost about $150. 96 cores, 1GB memory and 25.6 GB/s memory bandwidth maybe sound powerful, but it is low profile compared to GPUs like Quadro 6000, with 448 cores, 6 GB memory and 144 GB/s memory bandwidth. There are also the Tesla family GPUs for high performance computing needs you may want to check out at NVIDIA.

Beyond
Below I provided some code to check if the GPU memory is enough to hold the given array, which is one of the important validations you want to perform if you target heterogeneous environments.

using namespace std;
...
extern void GPURadixSort(int* data, unsigned int numElements)
{
 int deviceID = -1;
 if (cudaSuccess == cudaGetDevice(&deviceID))
 {
     cudaDeviceProp devprop;
     cudaGetDeviceProperties(&devprop, deviceID);
     unsigned int totalMem = numElements * sizeof(int);
     if (devprop.totalGlobalMem < totalMem)
     {
        //Handle error
     }
 }
 ...
}

As I mentioned before, under the high-level Thrust library the powerful CUDA API resides. If Thrust does not have the functionality you are looking for, you can always use CUDA directly. It is not as easy as using Thrust, but if you will follow the resources below, it will not take you much to start using this great technology.

Conclusion

We are using parallelism mostly to gain performance when a program runs slow or we need to handle huge amount of calculation. But, I believe that parallelism is a very important software aspect allowing us to use the underlying hardware wisely and gain advantage in this competitive market, where two functionality wise similar applications can only compete on software qualities like performance. There are many parallel technologies which will allow you to achieve higher performance with the same hardware. One of them is definitely NVIDIA CUDA, and some others you want to look into are TPL, PPL, C++ AMP, CUDA, OpenCL, MPI.

How to use CPU instructions in C# to gain performance

February 26, 2011 Leave a comment

Introduction

Today, .NetFramework and C# became very common for developing even the most complex application very easily. I remember that before getting our hands-on on C# in 2002 we were using all kind of programming languages for different purposes ranging from Assembly, C++ to PowerBuilder. But, I also remember the power of using assembly or C++ to use every small drop of power of your hardware resources. Every once in a while I’m getting to a project where using regular framework functionality puts my computer to grill for couple days to calculate something. In those cases I go back to the good old C++ or Assembly routines to use the power of my computer. On this blog I will be showing you the simplest way to take advantage of your hardware without introducing any code complexity.

Background

I believe that samples are the best teacher; therefore, I’ll be using a sample CPU instruction from Streaming SIMD Extensions (SSE). SSE is just one of the many instruction set extension to X86 architecture. I’ll be using an instruction named PTEST from the SSE4 instructions, which is almost in all Intel and AMD CPUs today. You can visit the links above for supported CPUs. PTEST helps us to perform bitwise comparison between two 128-bit parameters. I picked this because it was a good sample using also data structures. You can easily lookup online for any other instruction set for your project requirements.

We will be using unmanaged C++, wrap it with managed C++ and call it from C#. Don’t worry, it is easier than it sounds.
Thanks to MSDN we will have all necessary information at Alphabetical Listing of Intrinsic Functions. The PTEST _mm_testc_si128 is documented at http://msdn.microsoft.com/en-us/library/bb513983.aspx. If we would use C++ we would end up having the code from the MSDN link:

#include <stdio.h>
#include <smmintrin.h>

int main ()
{
    __m128i a, b;

    a.m128i_u64[0] = 0xAAAA55551111FFFF;
    b.m128i_u64[0] = 0xAAAA55551111FFFF;

    a.m128i_u64[1] = 0xFEDCBA9876543210;
    b.m128i_u64[1] = 0xFEDCBA9876543210;

    int res1 = _mm_testc_si128(a, b);

    a.m128i_u64[0] = 0xAAAA55551011FFFF;

    int res2 = _mm_testc_si128(a, b);

    printf_s("First result should be 1: %d\nSecond result should be 0: %d\n",
                res1, res2);

    return 0;
}

I would like to point out that there are many ways to develop a software and the one I’m providing here is maybe not the best solution for your requirement. I’m just providing one way to accomplish a task, it depends to you to fit into your solution.

Ok, let’s start with a fresh new solution: ( I’m assumig you have Visual Studio 2010 with C++ language installed on it.)

1 ) Add a new C# console application to your solution, for the testing purpose
2 ) Add a new Visual C++ > CLR > Class Library project to your solution, named TestCPUInt
3 ) Add a reference to your console application and select TestCPUInt from the projects

Now we are ready to code.
4 ) Open the TestCPUInt.h file, if it is already not opened
5 ) Insert the following code right on top of the document

#include <smmintrin.h>
#include <memory.h>

#pragma unmanaged

class SSE4_CPP
{
//Code here
};

#pragma managed

This is the infrastructure for our unmanaged C++ code which will do the SSE4 call. As you can see I placed our unmanaged code between #pragma unmanaged and #pragme managed. I think it is a great feature to be able to write unmanaged and managed code together. You may prefer to place the unmanaged code to another file, but I used one file for clarity. We used here two include header files, smmintrin.h and memory.h: the first one is for the SSE4 instructions and the other one is for a method I used to copy memory.
6 ) Now paste the following code at the //Code here location:

public:
	int PTEST( __int16* bufferA, __int16* bufferB)
	{
		__m128i a, b;

		//transfer the buffers to the _m128i data type, because we do not want to handle with that in managed code
		memcpy(a.m128i_i16, bufferA, sizeof(a.m128i_i16));
		memcpy(b.m128i_i16, bufferB, sizeof(b.m128i_i16));

		//Call the SSE4 PTEST instructions 
		return _mm_testc_si128(a, b);
	}

This _mm_testc_si128 will emit the SSE4 PTEST instruction. We have a little memory operation right before it to fill out the __m128i data structure on the C++ code. I used memcpy to trasfer the data from the bufferA and bufferB arguments to the __m128i data structure to push it to the PTEST. I prefered to do this here to separate tthe whole SSE4 specific implementation. I could also send the __m128i to the PTEST method, but that would be more complex.

As I mentioned before, in this example I used the PTEST sample with a data structure, you may run into other instructions which require only a pointer, in that case you don’t need to do the memcpy operation. There will maybe some challenges if you are not familiar with C++, especially when the IntelliSense is removed in Visual Studio 2010 VC++, but you can always check out for online answers. For example the __m128i data structure is visible in the emmintrin.h file, which is located somewhere [Program Files]\Microsoft Visual Studio 10.0\VC\include. Or you can check all fundamental data types if you are not sure what to use instead of __int16*.

7 ) Now paste the following code on top of your managed C++ code. Which is on the bottom of your TestCPUInt.h file in the namespace section.

namespace TestCPUInt {

	public ref class SSE4
	{
	public:
		int PTestWPointer(__int16* pBufferA, __int16* pBufferB)
		{
			SSE4_CPP * sse4_cpp = new SSE4_CPP();
			return sse4_cpp->PTEST(pBufferA, pBufferB);
		}

}

What we did here is to pass forward the pointers pBufferA and pBufferB, which we are going to call from C#, into the unmanaged code. For those who are not familiar with pointers, the * sign defines a pointer: __int16* means a pointer to a 16 bit integer. In our case that is the address of the first element of an array.
There are also ways without using managed C++ to call a dynamic library, but as I mentioned before, I’m showing only the simplest way for a C# developer.

8 ) Let’s go to our C# code in the console application to use this functionality.
9 ) First we have to switch the application to allow unsafe code. For that goto the project properties and check the “Allow unsafe code” under the build tab.
10 ) Add the following code to your Program.cs file:

static int TestCPUWithPointer(short[] bufferA, short[] bufferB)
{
	SSE4 sse4 = new SSE4();
	unsafe
	{
		//fix the buffer variables in memory to prevent from getting moved by the garbage collector
		fixed (short* pBufferA = bufferA)
		{
			fixed (short* pBufferB = bufferB)
			{
				return sse4.PTestWPointer(pBufferA, pBufferB);
			}
		}
	}
			
}

If you never used unsafe code before, you can check out unsafe (C# Reference). Actually, it is fairly simple logic; the PTestWPointer required a pointer to an array and the only way to get the pointer to an array is to use the fixed statement. The fixed statement pins my buffer array in memory in order to prevent the garbage collector to move it around. But that comes with a cost: in one of my projects the system was slowing down because of too many fixed objects in memory. Anyways, you may have to experiment for your own project.
That’s it!

But we will not stop here, for comparison I did the same operation in C#, as seen below:

static int TestCLR(short[] bufferA, short[] bufferB)
{
	//We want to test if all bits set in bufferB are also set in bufferA
	for (int i = 0; i < bufferA.Length; i++)
	{
		if ((bufferA[i] & bufferB[i]) != bufferB[i])
			return 0;
	}
	return 1;
}

Here I simply calculate if every bit of bufferB is in bufferA; PTEST does the same.

On the rest of the application I compared the performance of these two methods. Below is a code which does the comparison for sake of testing:

static void Main(string[] args)
{
	int testCount = 10000000;
	short[] buffer1 = new short[8];
	short[] buffer2 = new short[8];

	for (int i = 0; i < 8; i++)
	{
		buffer1[i] = 32100;
		buffer2[i] = 32100;
	}
					
	Stopwatch sw = new Stopwatch();
	sw.Start();
	int testResult = 0;
	for (int i = 0; i < testCount; i++)
		testResult = TestCPUWithPointer(buffer1, buffer2);	
	sw.Stop();
	Console.WriteLine("SSE4 PTEST took {0:G} and returned {1}", sw.Elapsed, testResult);

	sw.Start();
	for (int i = 0; i < testCount; i++)
		testResult = TestCLR(buffer1, buffer2);
	sw.Stop();
	Console.WriteLine("C# Test took {0:G} and returned {1}", sw.Elapsed, testResult);

	Console.ReadKey();
}

On my environment I gained %20 performance. On some of my projects I gained up to 20 fold performance.
A last thing I would like to do is to show you how to move the fixed usage from C# to managed C++. That makes you code little cleaner like the one below:

static int TestCPU(short[] bufferA, short[] bufferB)
{
	SSE4 sse4 = new SSE4();
	return sse4.PTest(bufferA, bufferB);
}

As you can see, it is only a method call. In order to do this we have to add the following code to the TestCPUInt class in the TestCPUInt.h file:

int PTest(array<__int16>^ bufferA, array<_int16>^ bufferB)
{
	pin_ptr<__int16> pinnedBufferA = &bufferA[0]; // pin pointer to first element in arr
	__int16* pBufferA = (__int16*)pinnedBufferA; // pointer to the first element in arr
	pin_ptr<__int16> pinnedBufferB = &bufferB[0]; // pin pointer to first element in arr
	__int16* pBufferB = (__int16*)pinnedBufferB; // pointer to the first element in arr

	SSE4_CPP * sse4_cpp = new SSE4_CPP();
	return sse4_cpp->PTEST(pBufferA, pBufferB);
}

This time our method takes a managed array object instead of a __int16 pointer and pins it in the memory like we did using fixed in C#.

Conclusion

I believe that as much as higher level frameworks we are using there will always be situations where we have to use our hardware resources more wisely. Sometimes these performance improvements save us big amount of hardware investment.
Please go ahead look into CPU instructions, Parallel computing, GPGPU etc. and understand the hardware; knowing your tools better will make you a better software architect.

Careful with Optional Arguments in C#4.0

February 20, 2011 2 comments

Introduction

Some time passed since optional arguments are introduced with Visual C# 2010 and we all got used to the convenience of not having to define method overloads for every different method signature. But, recently I came across of a limitation using optional arguments on enterprise solutions and now I’m using it with care.

The limitation is that if you use the optional arguments across libraries, the compiler will hard code the default value to the consumer and prevent you from re-deploying the provider library separately. It is very common scenario on enterprise applications where we have many libraries with different versions married to each other and this limitation makes it impossible to re-deploy only one dll without re-deploying all related libraries.

On this post I will explain you the details of this limitation. As I mentioned in my previous post Under the hood of anonymous methods in c#, it is very important to know the underlying architecture of a functionality you are using. This post is similar to that by explaining mechanics of the optional arguments.

Background

I will not explain what optional arguments are, because it is already out since a while. But I will just give a quick reference to the description for those who are new to C#:

Optional arguments enable you to omit arguments for some parameters. … . Each optional parameter has a default value as part of its definition. If no argument is sent for that parameter, the default value is used. Default values must be constants.” (Named and Optional Arguments (C# Programming Guide) )

I run into this limitation while I was using some functionality from a secondary library, injected through an IoC container. The secondary library was accessed through an interface where some methods had optional arguments. I had everything deployed and working until I had to make some changes to the secondary library and alter the optional argument. After re-deploying the secondary library I figured out that the changes did not take effect and went I went into the IL code I figured out that the main library had the constant hard-coded into it.

In my situation I had interfaces, injections and a lot of complexity; to better picture the situation I will be using the simplest form of the limitation as in the following sample:

  • ProjectB has a method named Test with the following signature:


public void Test(string arg1 = "none")
{
}

  • ProjectA is referencing ProjectB and using the Test method with its default argument by making the following call:


static void Main(string[] args)
{
Class1 class1 = new Class1();
class1.Test();
}

This works very well, because ProjectA is just using arg1 value as “none”. Now let’s look what is happening behing the scene:

If we compile ProjectA along with ProjectB and analyze the IL code, we will see that the optional argument is nothing more than a compiler feature. Because calling Test() is no different from calling Test(“none”), the compiler decides to compile our code as Test(“none”). That can be seen in the IL code and disassembled C# code below; the string constant “none” is hard-coded into ProjectA.


.method private hidebysig static void Main(string[] args) cil managed
{
...
L_0008: ldstr "none"
L_000d: callvirt instance void [ProjectB]ProjectB.Class1::Test(string)
...
}


private static void Main(string[] args)
{
new Class1().Test("none");
}

For libraries tightly coupled or in-library usage it is good that the compiler helps us eliminating some code and makes our life easier. But this comes with a price:

Let’s say we had to modify the Test method in ProjectB as Test( string arg1 = “something” ) and re-deploy it without re-deploying ProjectA. In this case ProjectA would still be calling the Test method with “none”.

Conclusion

Knowing this, it is good to use the optional arguments with caution across libraries when you have to support deployment scenarios with partial solutions.

Under the hood of anonymous methods in C#

January 24, 2011 Leave a comment

Introduction

On this post I’ll help you to make better decisions on using C# anonymous methods. We all have our favorite or known ways to program certain solutions. But I believe that providing the right solution for the problem requires us to free our mind of all biases and think in the context of the problem. Every feature, including GOTO, has a place in this endless solution universe, therefore I’m not telling what not to use, I’m just showing under the hood, so you can make your own decision based on your unique solution.

Background

There are situations anonymous methods are really useful; especially if you have a small operation you want to do inline. Another point is that you loose on Readability and Testability; two important qualities of enterprise software. In smaller teams or developer owned projects, these qualities are sometimes omitted. (Nothing bad with that; as I said: there is a place for every feature). In bigger teams, especially if you have to review or continue another programmer’s code, it is really hard to read a line with complex inline code. I faced this problem many times on code review and code ownership changes.

So, let me show you what is going on under the hood, and you can make your own decision:

class Anonymous
{
	public void Method()
	{
		bool test = true;
		Action a = ()=> test = false;
	}
}

Figure – 1

In the example I used the anonymous method on an Action to simplify the disassembled IL code. The code looks very straightforward and simple to write; there is only one delegate which captures the boolean variable test and sets the value to false.

I kept the code simple to show the structural changes. There is even no call to the method; structurally it does not have any affect. What happens to the structure in the background after you compile, is shown in the disassemled MSIL code below (Collapsed IL code by purpose;to point to the structure):

.class private auto ansi beforefieldinit Anonymous
    extends [mscorlib]System.Object
{
    .method public hidebysig specialname rtspecialname instance void .ctor() cil managed
    {
    }

    .method public hidebysig instance void Method() cil managed
    {
    }

    .class auto ansi sealed nested private beforefieldinit <>c__DisplayClass1
        extends [mscorlib]System.Object
    {
        .custom instance void [mscorlib]System.Runtime.CompilerServices.CompilerGeneratedAttribute::.ctor()
        .method public hidebysig specialname rtspecialname instance void .ctor() cil managed
        {
        }

        .method public hidebysig instance void b__0() cil managed
        {
        }

        .field public bool test

    }
}

Figure – 2

As you can see, a private sealed nested class named c_DisplayClass1 is created with one method named b__0 for our anonymous method. In the collapsed code our original Method method is modified in a way to create a new instance of c_DisplayClass1 to call the b__0 method. For those who are new to IL, I manually assembled it into C# code below:

class ILAssembled
{
	public void Method()
	{
		nested nt = new nested();
		nt.test = true;
	}

	private sealed class nested
	{
		public bool test;

		public void MethodNested()
		{
			test = false;
		}
	}
}

Figure – 3

The compiled IL code of the C# code above is exactly the same (excep some not important small changes). It is important to know that this IL result is for the code above. In fact, if I would move the local variable test to an instance variable, a method would be created instead of a nested class. Therefore, it is always safer to take a look at the disassembled IL code. My favorite tool is red Gate’s .Net Reflector, which can disassemble into many languages. Another big advantage of looking into the post compiled code is to study .Net Framework namespaces and learn good practices.

Let’s go back to our code. It does not mean that you could write the code in Figure-3 instead of in Figure-1 and save some compile time:). The point is that the magic is in the compiler in exchange of testability, readability and maybe some performance because you end up with a nested class instead of a method.

Instead, if I would write the code below, I would gain all the qualities I was looking for my unique solution:

public void Method()
{
	bool test = true;
	MyAction(test);
}
private void MyAction(bool test)
{
	test = false;
}

Figure – 4

As you can see, I created a method named MyAction to implement the functionality. Now I can create a unit test to test the MyAction and my colleagues can also understand the code at first look. It is really important to understand the code at first look if you are reviewing tens of classes; encrypting variable and method names, over using implicit type var and over using anonymous methods double or triple the review time.

For clarity I also provided below the disassembled IL code structure of Figure 4.(Collapsed IL code by purpose;to point to the structure):

.class private auto ansi beforefieldinit NonAnonymous
    extends [mscorlib]System.Object
{
    .method public hidebysig specialname rtspecialname instance void .ctor() cil managed
    {
    }

    .method public hidebysig instance void Method() cil managed
    {
    }

    .method private hidebysig instance void MyAction(bool test) cil managed
    {
    }

}

Figure – 5

As you can see, there is no nested class declared to run the anonymous method.

Conclusion

I believe knowing what you are using is the key of making the right decision for you unique solution.