Business, Machine Learning

Alien, Zombie, Astroid and now “AI Bias”

In 2016, a team of scientists from Microsoft Research and Boston University researched how machine learning runs the risk of amplifying biases present in data, especially the gender biases (Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings] ). The research team revealed that word embeddings trained even on Google News articles exhibit female/male gender stereotypes to a disturbing extent.

Until 2009, Amazon was de-ranking LGBT books by mistakingly classifying them as “adult literature” ( Amazon stated: “We recently discovered a glitch to our Amazon sales rank feature that is in the process of being fixed. We’re working to correct the problem as quickly as possible.”

Amazon maybe fixed the problem in one algorithm but in 2016, Bloomberg analysts revealed that Amazon prime same-day delivery service areas excluded ZIP codes by race to varying degrees (

Even today, in 2018 we still see gender bias in machine learning powered general tools as Google translate. The underlying algorithm associates doctor with men and nurse with women when translating between gender-neutral and gender-inclusive languages.

I’m confident that most of the solutions don’t have the intention to include biases, but ignorance is not the same as innocence. The human has a long history of violence and discrimination, and the default tendency of a machine-learned system based on human data is inheriting these biases, causing disastrous effects on the so-known data hungry Artificial Intelligence (AI).

The problem is much broader than to be solved just by changing the algorithms; it is related to every part of businesses, from engineers to product managers and executives. In this article, I’ll surface some of the causes of the “bias” problem and provide few suggestions to prevent it. I’ll be using the term “bias” as a disadvantageous treatment or consideration towards anybody or any group; meaning, being treated worse than others for some arbitrary reason.


From the civic announcements at the Agora in ancient Greece to your news notifications on your mobile app historically information has mostly been “served” to you. It means that unless you did your scientific research and experimentation the information you receive always has some bias. And, unlike some observational error, you would otherwise encounter, the bias in served information will have much more complicated reasons. To start with, I’ll divide the bias in served information into two categories: Intentional and Unintentional.

An example of Unintentional bias is maybe the gender bias being seen between the relationship of the words taken from Google news. An example of Intentional bias is the incredible amount of cryptocurrency news or sided political news I see nowadays.

Until information was in manageable amounts the only source of Intentional and Unintentional bias was human, but once we crossed the line where we had more information than we could consume we gave rise to technologies which added error as well as amplified the source bias significantly. Today, these technologies are in our search engines, news feeds, social news feeds, translators and in many more tools.

One of these technologies is recommendation engines. Recommendation engines solve a massive problem of the digital age: the Information Overload. Even though they can also introduce a filter bubble, it is still the best technology which is available today. For example, at Yahoo, the recommendation engine we’ve developed was able to select handful of news articles with outstanding relevance, out of a million news articles, in single digit millisecond time frame and for each of the billion users. In the absence of a recommendation engine, you would need to read every day through a million news articles to find the relevant articles to your liking. This approach is similar to your LinkedIn, Twitter, Facebook feeds and even for the search engines you use for a general search on your hotel and flight search.

Based on my observation, all information I’m receiving on the internet using standard tools is from one or another search or recommendation machine learning algorithm. And, underneath the surface of these machine learning algorithms, I see four different areas where we need to monitor and control bias:

  1. Data source
  2. Data processing
  3. Model
  4. Inference

Since machine learning algorithms model the data, which can be anything from digitized information to the environment and digital bits, identifying the bias in the source information is very crucial for the down-stream systems to function in a fair way to the human.

Drilling little more into the concept of data sources, I see that there are at least three types of machine learning data sources when it comes to serving information to human: Content, Context, and Activity. Content is the data which is produced with the purpose to be presented directly to a human. Context is the state of the environment relative to the content. And, Activity data is generated in the result of the interaction between a user and content in a context.

Every one of us has a temporal objective, a subjective and intersubjective worldview which we carry into everything we create; into our articles, photos, movies, songs, music, paintings, software and more. On the intentional side, we are consciously aware of worldviews we choose, but there are also some worldviews we are oblivious, the unintentional, which originate from our paradigms and our social circles. These biasses can arise from cognitive error, conflicts of interest, context/environment or prejudices. For example, an analysis done on user comments on daily popular news articles revealed that the average user comment always has a negative tone regardless of the news topic. In contrary, most of the people would disagree that they are negative.

Given the problems in the human history, these biases are not surprising, but things get little dangerous when we use these unintentionally biased content to create machine learning models; the models inherit the biasses with a degree of error and operate on it. Unfortunately, these biasses carried into machine learning algorithms are not visible to human eye unless we deliberately expose them.

Black box AI is the name we give to machine learning models we don’t care to understand. I said “we don’t care” because in most cases, like deep learning, it can become very labor intensive to explain every factor. Especially understanding the bias is a project on its own.

Systems like recommendation engines mainly try to predict user’s behavior based on historical and collaborative internet activity. This approach causes algorithms to create information isolation named Filter Bubble ( The nature of this isolation depends on the representations of the user activities in the system and can be anything from social, cultural, economic, ideologic to behavioral. If not given attention, filter bubbles can be intentionally or unintentionally used to control public opinion towards a particular bias. For example, in 2013, Yahoo researchers found out that web browsing on Yahoo Finance can anticipate stock trading volumes ( This means a bias in the financial news ranking could affect user activity and hence affect stock trading volumes.



It is every data scientist, product manager, and engineer’s responsibility to have a robust strategy to detect, expose and debias the biases in AI products and services. While there are hundreds of possible biases, I think the following most critical biases are a good start for every content based machine learning system:

  • Racism
  • Sexism
  • Cynicism
  • Framing
  • Bullying
  • Favoritism
  • Lobbying
  • Classism
  • Polarity

One of the ways to detect these biasses is to model the bias by using a class of NLP (Natural Language Processing) techniques named Sentiment Analysis ( Today, sentiment analysis is possible by using human-provided training data (e.g., sentiment labels) as well as unsupervised learning techniques like Unsupervised Sentiment Neuron ( Also, in the recent years, RNN (Recurrent Neural Networks) algorithms became very popular in solving NLP problems.

Preventing Filter Bubbles

One approach to avoid filter bubbles is building exploration and exploitation tradeoff strategies. Exploration and exploitation tradeoff allows the system to create a balance between serving information “from outside” and “more about” the filter bubble. Some techniques involve addressing the problem using Multi-armed bandit solutions (

Glass Box AI

Today, we see more and more researchers and companies moving into this area, and creating technologies to explain machine learning models. One of these technologies is LIME ( LIME is based on the paper, and currently, it can explain any black box classifier, with two or more classes.

Another step towards transparency is DARPA’s Explainable AI (XAI) program ( which aims to produce “glass box” models that are explainable to a “human-in-the-loop” (read more about XAI at .) Also, leading researchers like Kate Crawford ( are studying the social implications of AI and bringing more and more awareness to the industry.

On the commercial side, companies like Optimizing Mind ( develop technologies which understand how deep learning models interpret each component of the input.


While we are introducing more an more AI technologies into our processes, it is everybody’s responsibility to understand the bias issues and take necessary precautions.

In this article, I presented just a few aspects of the dangers of artificial intelligence solutions. In our AI courses for Product Managers and C-level executives, we provide strategies to prevent AI issues in the organizations, products, and services. If you are interested, please email me or check out our courses at and


Product Management skills to survive in the AI era

In 2016, tech giants such as Baidu & Google alone spent $20B-$30B on AI, and 62% of all enterprises expect to hire a Chief AI Officer in the future. The share of jobs requiring Artificial Intelligence skills in the US has grown 450% since 2013 and corporations are seeking relentlessly for technical professionals as well as product leaders who can utilize AI technologies on their products and services to improve the company’s bottom line or top line. It is named the Fourth Industrial Revolution, and it is happening right now, right here.

However, as a product manager (PM) or as a potential product manager, how do you gain the necessary knowledge to analyze, understand, plan, and design products based on Artificial Intelligence technologies? Since you can not get today a college degree in AI Product Management, how do you adapt to this rapid change?

As an AI consultant in Silicon Valley, I get to talk to many C-level execs, product managers as well as software engineers who want to move to product management. These are professionals from all size of companies and have two things in common: First, they see the fourth industrial revolution happening and want to make smart moves into AI domains and technologies at the right time. Second, they have a hard time defining a framework and where their responsibilities start and end in this new domain.

Regardless if your organization is willing to use AI technologies in their products, services or internally, the PM role alone is hugely impacted, and drawing the lines between technical responsibilities and product responsibilities is not very easy. To help you with both of these questions as well as guide your way to become an AI Product Manager (AIPM) I want to introduce you a four-step framework.

In the rest of this article I am giving an overview of this framework, and if you are interested in more details, please feel free to contact me. Alternatively, please check out our AIPM certification course where we give the entire training in-depth:

AI Product Manager’s Skills:

  1. Have solid core Product Management skills
  2. Have industry specific business domain expertise
  3. Gain specific AI solution understanding
  4. Gain AI Product Lifecycle knowledge

1. Solid Core Product Management Skills

In the AI era, It is crucial to success not just to understand the technology but also to have the core PM skills. Since AI solutions can touch the vision, strategy, team, product, marketing, partnerships, and support, it is essential to have the understanding how these business aspects operate together. For example, trying to implement an AI solution without the know-how to balance between customer needs, team capabilities and business constraints it is almost guaranteed that the time-features-cost-quality equation will be unbalanced.

Product Managers Need To Be Able To Balance Contradicting Business Aspects

The ways to gain core Product Management skills is beyond the scope of this article. In addition to online product management courses, I have seen that the Business Model Canvas (by strategyzer, frameworks and techniques help enormously to see the bigger picture and be able to put on the second CEO hat in any organization.

Business Model Canvas (

2. Industry Specific Business Domain Expertise

Without specific domain knowledge, a PM will not be able to ideate, design, create and release viable products in that domain. The domain is, in this case, a specific industry or product. This requirement is no different with AI Product Management; the market, the regulations and the business model of the organization need to be understood.

However, even with core business model understanding, it is not always possible to implement AI solutions in those businesses. The reason is that unlike other technologies, AI can bring change to every aspect of the organization and can require different business perspectives. Therefore, it is ubiquitous that the business processes need analysis with an AI perspective. Take for example a visual quality inspection process. The solution is not as isolated as it sounds; we could integrate a feedback mechanism and make the whole production line automatically optimize itself. In such case, not just the end but also the rest of the processes needs evaluation.

Below I will go more into the details how to analyze from an AI perspective, but in general, a Business Analysis Framework like in the following diagram helps to organize gathered information.

Business Analysis Framework

Based on my experience I have seen that for any given business process there are at least four AI opportunities. It is the product manager’s role to go over each business process and identify which of these opportunities are available:

  1. Automation Opportunities
  2. Optimization Opportunities
  3. Expansion Opportunities
  4. Innovation Opportunities

Automation opportunities exist in proven and well-working business processes. One can not automate a broken process. Therefore it is essential to understand the requirements and performance metrics before an automation decision. Some examples of these processes are where human error is high, human performance is too low or recalls rate is low.

Optimization opportunities exist in well working automated processes where usually the software and hardware technology is old and new alternatives are available. PMs need to know the baseline key metrics, the goal and be able to walk through. During optimization projects, the interaction with the science team is usually more frequent than during automation projects. I will mention more about the AI product lifecycle later but at a very high level the PM needs to be able to follow a rapid experimentation cycle with various AI solutions. Also, optimization efforts get more difficult when the actual performance approaches Bayes error, which is the lowest possible error rate for your AI solution. For example, object detection task in the Large Scale Visual Recognition Challenge (LSVRC) Competition already exceeded human performance ( and improving this algorithm further requires effort and new approaches.

Expansion opportunities arise when the goal is to apply working automated processes to different geographical regions, to different product or services. These opportunities are common in large organizations, where in some cases AI capabilities are underutilized and newer technology is available. For example, applying a chatbot solution of one product to another product, after making small changes. Alternatively, expanding an e-commerce recommendation engine to international markets.

Innovation opportunities arise when a new and maybe unproven business process is needed. These opportunities are comparable to creating a new product or starting a new start-up where there is a continuous search for a model. It is an iterative process of defining, measuring, validating various hypothesis to achieve the desired goal. On the technical side, in most cases, a new approach or algorithm is needed which increases uncertainty and overall project complexity.

Complexity is a significant factor when deciding on a technology, methodology as well as team structure in every AI project. It is usually the case that research-oriented solution is more complicated in the areas of the organization, technology, process, and regulation. The complexity of the four AI opportunities is in the diagram below:

Complexity In AI Projects

It is also essential to know the driving factors of an AI project. Since these factors differ from organization to organization, and even from project to project, it is the PMs responsibility to identify them. Below are the top five elements I’ve seen:

  1. Competition
  2. Customer Demands
  3. Market
  4. Corporate Goals
  5. Venture Capital

3. Specific AI Solution Understanding

Today, there is endless information available about AI Solutions on the Internet, and articles range from marketing solutions to how to train an image recognition algorithm. But, the commonly missing information is the strategy and explanation about how a PM can design a solution for their specific business. The truth is that there is no one-size-fits-all solution, and the PM needs to gain the necessary understanding and equip themselves with the right framework and techniques to build a custom strategy, per project basis.

Below is an excellent framework to follow to match any business process to the AI solution space:

  1. Study the relevant AI technology landscape
  2. Study the corresponding AI solution domain
  3. Evaluate AI solution alternatives

The first step is to study to study the AI landscape to understand where the AI technologies related to the business are standing. When we look at different companies and even different business processes, we see that technology did not evolve equally in every area. For example, fraud detection and even advertising are using today very advanced algorithms. One of the reasons is that these industries have been competing for a decade. On the other hand, some industries like healthcare and especially drug development is not using AI as much as advertising.

We dive into the aspects of our AIPM course ( but to give an idea I provided below a high-level timeline for the PMs to determine the stage of the business process.

The AI Timeline

The second step is to study the relevant AI solution domains. AI technologies can be utilized to play the following roles in any business process:

  • Task Processing
  • Decision Support
  • Decision Making

These roles directly map to AI stages shown in the previous diagram and are not always viable for every business process. For example, it looks like critical healthcare systems will never play the 3rd role. On the other hand, today, online advertising automatic bidding systems are making every second substantial number decisions. Therefore, the PMs role is to consider these roles when looking for an AI opportunity and analyze aspects like operability, reliability, and compliance related to each of these roles during requirement analysis.

Nowadays, we categorize the AI solution domain based on three fundamental techniques:

  • Supervised Learning
  • Unsupervised Learning
  • Reinforcement learning

The PM needs to have an understanding of the business solutions in each of these categories and be able to foresee the project lifecycle involved with it. For example, a solution to predict diabetes cases from historical patient lab data will fall into supervised learning algorithms and the nature of this category is to have labeled data and therefore PM has to plan for the labeling aspect.

4. AI Product Lifecycle knowledge

If you are an experienced PM, then you most probably know that when it comes to project methodologies, frameworks, and techniques, there is no one-size-fits-all solution. As soon as I say I’ve seen it all, another method comes along. I think the most crucial PM trait is to know the pros and cons of each methodology and then work with the project manager.

What is the right project management methodology, framework or technique?

AI projects tend to be different from regular software projects by having procedures like data analysis, model research, and experimentation, etc. Therefore, given the requirements and business case, the PM is responsible for identifying the right method for the project.

Below is an example AI product lifecycle with the research and A/B test iterations. I find it also valuable to adopt an agile model and adhere to agile software development manifesto.

product lifecycle2

In this article, I presented just a few aspects of the broader frameworks and methods we teach at our AIPM course. If you are interested in an AIPM career, please email me or contact at ( Also, please join our LinkedIn group at “From Engineering to PM” (

Business, E-commerce

If Your E-commerce Sales Didn’t Grow Last Year By 30% Here Is How Artificial Intelligence Can Help


The $410 billion E-commerce sales are only 9% of all retail sales in the U.S. and it keeps increasing by a whopping 14% YoY. It is predicted to grow to $600 billion by 2020. Perhaps you may not have seen a 32% growth in your e-commerce revenues like Amazon did last year, but did you at least hit the 14% average? If not, here are a few possible reasons why and how Artificial Intelligence can help.

You probably have thousands of products, a website with the latest e-commerce features, a mobile app. You may also have invested millions in brand awareness, user acquisition, and user behavior analysis but did not see any significant lift in your top-line.

When it comes to marketing funnels there isn’t really a magic formula; the success comes by doing small things in a big way. There are countless aspects like customer service, return policies, deals, customer reviews, seller reviews, autocomplete suggestions, finding items, displaying customer reviews, product promotions, price comparison, aggregated listings, rich product information, discounted shipping and much more which you need to get right.

In fact, take a look at the solutions of the top e-commerce retailers today. Their product offerings, websites, and apps are very similar on the surface yet Amazon has 43.5% market share compared to 6.8% of its closest competitor eBay: (source Recode )

  • 43.5% Amazon
  • 6.8% eBay
  • 3.6% Apple
  • 3.6% Walmart
  • 1.5% Homedepot
  • 1.4% Best Buy
  • 1.2% Macy’s
  • 0.9% Wayfair
  • 0.9% Costco

But, what are these small differences between solutions and how can you improve upon them? Let’s take a very simple feature like “search autocomplete suggestions”, and see how this technology differs. Below are the results of “search autocomplete suggestions” feature for the “Desk lamp” keyword on three different websites:

One easy-to-spot difference is that Wayfair is expanding the query by prefixing categories, whereas Amazon and Walmart are using some kind of intelligence. At a very high level, the intelligence seems to be a combination of past public query patterns and maybe some additional logic. Since “Desk lamp white” is not even a correct English sentence I would say there is a flavor of similarity based on language understanding. The “Desk lamp for an office in home” top suggestion of Wallmart is a very niche category and isn’t anything I’m looking for. It is also surprising that neither of the solutions seems to have personalization built-in; regardless of the account, the autosuggest results are the same. It is possible to fix the irrelevant suggestions with a personalization solution which includes user’s own query and purchase history.

There are also many more data points we can associate with a search term to make it more intelligent. These include:

1) the quality and quantity of the product data related to the search term,

2) the cumulative product review scores related to the search term,

3) the cumulative product sales volumes related to the search term.

In summary, optimization objective of a “search autocomplete suggestions” is maximizing sales and that requires finding the balance between many attributes.

Another example of intelligent features on Amazon is its real-time price comparison. We know that Amazon does not always have the lowest prices but it has the intelligence to fine-tune the prices in real-time such that it will produce wider margins on less competitive items and private-label goods where the user is less likely to compare prices.

This kind of intelligence requires vasts amount of data extracted from pricing, customer reviews, product rank, customer-product behavior, collaborative aspects etc. and a vast amount of processing power to support the volume of 183 million monthly active users. This is where artificial intelligent solutions come into the picture.

In addition to Amazon, many large tech companies like Apple, Google, Facebook, Intel, and Microsoft have invested decades into artificial intelligence and now they are riding the wave by reaping the benefits. Currently, the reason that the sales gap between Amazon and its competitors is growing is due to Amazon’s ability to collect vast amounts of data and use artificial intelligence to understand it and then feed it back to the marketing and sales channels. Hopefully, this is not a complete surprise to you but perhaps you feel constrained in jumping on the artificial intelligence bandwagon due to constraints such as domain expertise, resources, and costs.

(Image from AI Index 2017 Annual Report


According to Stanford University’s 2017 AI Index Report between the years 2000 and 2017 the number of AI papers increased 8 fold, annual VC investments into AI startups increased 6 fold and the number of startups increased 14 fold. In 2017 alone a record $6.5B of capital was deployed across 650+ deals, surpassing all 2016 numbers.

Today AI is increasingly being viewed as the 4th industrial revolution; it is the new electricity.
(Image from AI Index 2017 Annual Report


As with any new technology, especially at this scale, we see more questions than answers. The most common question is “how can I move my company to the AI era?”. One of the reasons for the abundance of questions is the lack of AI expertise in the market. The demand for AI experts is high but there are not enough of them out there. The AI Index report sheds light on this expertise shortage and shows that Stanford had only 1500 AI/ML course enrollments in the last year. Comparing this number to 3.8 million software engineers in the US it seems like it will take quite a bit of time to have AI skills commonly available. Also, you will be competing with the large tech companies like Apple, Google, Amazon, Facebook, Intel, Microsoft and others for such talent. Which brings us actually to another shortage, that of AI product and service companies. AI is trending and this has affected the start-up ecosystem as well. According to CBInsight in 2017 alone, over 55 private AI companies were acquired, mainly by the most active large tech companies Google, Apple, Facebook, Intel, Microsoft, and Amazon. Until the large companies are saturated with enough AI startups we will have a shortage.

So, what should you do?

Luckily, even if you don’t have an AI lab with hundreds of experts with PhDs, most of the AI components have now become mainstream cloud technologies and you can work with an AI integrator to speed up your integration. Today, e-commerce solutions like the ones below have become readily available to you at affordable prices:

  • Personalized search
  • Optimized rankings
  • Email recommendations
  • Behavioral triggers
  • Smart content personalization
  • Content sequencing
  • Product recommendations
  • Personalized engagement
  • Persistent visitor intelligence
  • Actionable insights
  • Conversion rate optimization
  • Omnichannel marketing
  • Price Optimization
  • Product insight

Ironically, you can compete with Amazon by using another Amazon product, namely Amazon Web Services (AWS). Besides its broad solution space, AWS has a set of AI solutions which can be combined and integrated easily in a matter of weeks. Going back to our example, AWS even has a commercial“search autocomplete suggestions” solution you can integrate into your system.

As a final note, I want to share with you a list of almost all Amazon AI services ready for integration as of today ( Of course, as much as there are other cloud AI service providers there is also a great deal of AI technologies more on the horizon to come.

Amazon Rekognition Solutions

  • Image Object detection
  • Image Face detection, search, tracking, compare
  • Image Celebrity detection
  • Image Unsafe content detection
  • Video Object detection
  • Video Face detection, search, tracking
  • Video Celebrity detection
  • Video Unsafe content detection
  • Video Person detection

Amazon Comprehend Solutions

  • Key phrases extraction
  • Sentiment Analysis
  • Entity Recognition: places, people, brands, events
  • Language Detection
  • Topic Modeling

Amazon Lex (Chatbot) Solutions

  • Automatic speech recognition (ASR) (speech to text)
  • Natural language understanding (NLU)

Amazon Polly Solutions (~25 Languages)

  • Text-to-speech
  • Text-to-speech marks

Other Amazon AI Solutions

In my next article I’ll cover more of the solution space and examples. If you want to learn more about how your business can benefit from AI please visit “Move to AI” at, or stop by at one of our Silicon Valley Executive meetups.

C# language, C++, CodeProject, GPGPU, Machine Learning, Technical

Large Scale Machine Learning using NVIDIA CUDA


You may have heard about the Stanford University’s machine learning on-line course given by Prof. Andrew Ng. in 2011; it was a great course with lots of real-world examples. During the course, I’ve realized that GPUs are the perfect solution for large-scale machine learning problems. In fact, there are many examples of supervised and unsupervised learning all around the internet. Being a fan of both GPGPU and Machine Learning technologies, I came up with my own perspective to run machine learning algorithms with a huge amount of data on the GPUs.

I’ve already presented the solution recently at the South Florida Code Camp 2012. Everybody was interested in these two subjects a lot; therefore I’ve decided to share it on my blog. The example in this post is neither the only solution nor the best solution. I hope it will help you one day solve your own machine learning problem.

There is a lot of concepts to machine learning but in this post, I’m only scratching the surface. If you already know about GPGPU and Machine Learning you can just go to the source code at this link, download the Visual Studio 2010 projects and try it out.

I’ve also prepared the same example using CUBLAS with vectorized implementation of the polynomial regression algorithm, but the CUBLAS example would require more in-depth explanations. Therefore I’m posting this example first which is a simplified implementation. If you are interested in CUBLAS implementation please let me know and I can send you that copy.




Machine Learning

If you are already familiar with machine learning you can skip the brief introduction and jump directly to the Large Scale Machine Learning section. Or if you want to learn more about machine learning please follow the links or check out the Stanford course I’ve mentioned at the beginning.

Machine learning algorithms allow computers recognize complex patterns. It focuses on the prediction, based on known properties learned from the training data. We are using machine learning algorithms every day dozens of times maybe unknowingly: every time we get a book or movie recommendation or every time we do a web search. In 1959 Arthur Samuel described Machine learning as Field of study that gives computers the ability to learn without being explicitly programmed. It has been a while machine learning was first introduced, and it is gaining popularity again with the rise of Big Data.

Figure 1 shows how some of the machine learning processes work. On phase 1, given a data set a machine learning algorithm can recognize complex patterns and come up with a model. In most cases, this phase is the bulk of the computation. In the second phase, any given data can run through the model to make a prediction. For example if you have a data set of house prices by size, you could let the machine learn from the data set and let it predict house price of any given size.

Figure 1

It does this by recognizing the function which defines the relation between the different features of the problem. A linear problem with two dimensions, like house price (the house size is the feature and the house price is the label data), can be expressed with the f(x) = ax + b model. Figure 2 shows how one feature can be used on a linear regression problem to predict new house prices. The term “hypothesis” was used in the Stanford course to describe the model.

Figure 2

Depending to the data set, more complex functions can be used. On Figure 3 you can see how the complexity can grow easily from 2 dimensions linear to hundreds of dimensions polynomial. In a spam filtering problem the different features could be words in the email or in a face recognition problem the features could be the pixels of the image. In the house price prediction example, features are properties of the house which are affecting the price. e.g. size, room count, floors, neighborhood, crime rate etc.

Figure 3

There are many machine learning algorithms for different problem types. The most common groups of these algorithms are Supervised Learning, Unsupervised Learning. Supervised learning is used on problems where we can provide the output values for the learning algorithm. For example: house prices for some house features is the output value, therefore house price prediction is a supervised learning problem. Data with these output values is named as “labeled data”. On the other hand unsupervised learning does not require output values, patterns or hidden structures can be recognized just with feature data. For example: clustering social data to determine groups of people by interest would not require to define any output value, therefore it is an unsupervised learning problem.

Gradient Descent

In supervised learning problems, the machine can learn the model and come up with a hypothesis by running a hypothesis with different variables and testing if the result is close to the provided labels (calculating the error). Figure 4 shows how a training data is plot and the error is calculated. An optimization algorithm named Gradient Descent (Figure 5) can be used to find the optimum hypothesis. In this simple two dimensional problem, the algorithm would run for every different value of “a” and “b”, and would try to find the minimum total error.

Figure 4

The pseudo code below shows how the gradient descent algorithm in Figure 5 works :

for every a and b loop until converge
errors = 0
for i = 1 to data.length
    fx = a * data[i] + b
    errors += (fx - labelData[i]) * data[i]
end for
gradient = gradient - learningRate * 1/data.length * errors  
end for

Figure 5 (

Large Scale Machine Learning

Machine learning problems become computationally expensive when the complexity (dimensions and polynomial degree) increases and/or when the amount of data increases. Especially on big data sources with hundreds of millions of samples, the time to run optimization algorithms increases dramaticaly. That’s why we are looking for parallelization opportunities in the algorithms. The error summation of gradient descent algorithm is a perfect candidate for parallelization. We could split the data into multiple parts and run gradient descent on these parts in parallel. In Figure 6 you can see how the data is split into four parts and fed into four different processors. On the next step the result is gathered together to run the rest of the algorithm.

Figure 6

Clearly, this approach would speed up the machine learning computation by almost four times. But what if we would have more cores and split the data further? That is where GPUs step into the solution. With GPUs we can parallelize in two layers: multiple GPUs and multiple cores in every GPU. Assuming a configuration with 4 GPUs and 512 cores each, we could split down the data into 512 more pieces. Figure 7 shows this configuration along with the parallelized part on the GPU cores.

Figure 7



Utilizing GPUs to enable dramatic increases in computing performance of general purpose scientific and engineering computing is named GPGPU. NVIDIA is providing a parallel computing platform and programming model named CUDA to develop GPGPU software on C, C++ or Fortran which can run on any NVIDIA GPU. NVIDIA CUDA comes with many high level APIs and libraries like basic linear algebra, FFT, imaging etc. to allow you concentrate on the business logic rather than re-writing well known algorithms.

You can visit my previous blog posts where I’ve explained how to use NVIDIA CUDA capable GPUs to perform massively parallel computations. The examples include Monte Carlo simulation, random number generators and sorting algorithms.


House Price Prediction Example

On this post I’ll show you how you can implement house price prediction on NVIDIA CUDA. Given a house price data set based on bedrooms, square feet and year built, it is possible to let the machine learn from this data set and provide us with a model for future predictions. Because the error calculation part of the Gradient Descent algorithm is highly parallelizable, we can offload it to the GPUs.

The machine learning algorithm in this example is Polynomial Regression, a form of the well known Linear Regression algorithm. In Polynomial Regression the model is fit on a high order polynomial function. In our case we will be using bedrooms, square feet, year built, square root of bedrooms, square root of square feet, square root of year built and the product of bedrooms and square feet. The reason we added the four polynomial terms to the function is because of the nature of our data. Fitting the curve correctly is the main idea behind building a model for our machine leaning problem. Logically house prices increase by these features not in a linear or exponential way and they don’t drop after a certain peek. Therefore the graph is more like a square root function, where house prices increase less and less compared to increasing any other feature.

Finding the right polynomial terms is very important for the success of the machine learning algorithm: having a very complex, tightly fitting function would generate too specific model and end up with overfitting, having a very simple function, like a straight line would generate too general model and end up with under fitting. Therefore we are using additional methods like adding regularization terms to provide a better fit to your data. Figure 8 shows the gradient descent algorithm including with the regularization term lambda.

Figure 8 (

Application Architecture

The sample application consist of a C++ native DLL named LR_GPULib, for the machine learning implementation on the GPU and a C# Windows application named TestLRApp for the user interface. The DLL implements Data Normalization and Polynomial Regression using the high level parallel algorithm library Thrust on NVIDIA CUDA. I’ve mentioned on my previous blog posts about Thrust more in detail, therefore I’m not going into much detail on this post. Figure 9 shows the application architecture and also the program flow from loading the training data all the way down to making a prediction.

Figure 9

The application provides the UI shown below on Figure 10 to load the data, train and make a prediction with new data set. The UI also shows the hypothesis on the bottom of the dialog with all constants and features.

Figure 10


The file in the DLL contains the functors used as kernels on Thrust methods. The file in the DLL contains the normalization, learning and prediction methods. The Learn method accepts the training data and the label data, which are all the features and all prices in two float arrays. The first thing the Learn method does is to allocate memory, add bias term and normalize the features. The reason we added the bias term is to simplify the gradient loop and the reason we normalize the features is because the data ranges are too different. E.g. square feet is four digits and bedrooms is single digit. By normalizing the features we bring them into the same range, between zero and one. Normalization gets also executed on the GPU using the NormalizeFeatures. But the normalization requires the mean and standard deviation (std), therefore mean and std are calculated first and provided to the NormalizeFeaturesByMeanAndStd method to calculate the mean normalization.

void NormalizeFeaturesByMeanAndStd(unsigned int trainingDataCount, float * d_trainingData, 
thrust::device_vector<float> dv_mean, thrust::device_vector<float> dv_std)
	//Calculate mean norm: (x - mean) / std
	unsigned int featureCount = dv_mean.size();
	float * dvp_Mean = thrust::raw_pointer_cast( &dv_mean[0] );
	float * dvp_Std = thrust::raw_pointer_cast( &dv_std[0] );
	FeatureNormalizationgFunctor featureNormalizationgFunctor(dvp_Mean, dvp_Std, featureCount); 
		thrust::device_ptr<float> dvp_trainingData(d_trainingData); 
	thrust::transform(thrust::counting_iterator<int>(0), thrust::counting_iterator<int>
		(trainingDataCount * featureCount), dvp_trainingData, dvp_trainingData, featureNormalizationgFunctor);

The Normalization code running on the GPU is implemented in the FeatureNormalizationgFunctor functor, which is simply calculating data - mean / std in parallel for every element of the data, as seen below:

  __host__ __device__
  float operator()(int tid, float trainingData)
	  int columnIdx = tid % featureCount;
	  float fnorm = trainingData - meanValue[columnIdx];
	  if (stdValue[columnIdx] > 0.0)
		fnorm /= stdValue[columnIdx];
	  return fnorm;

On the next step in the Learn method, the gradient descent is calculated with the for(int i = 0; i < gdIterationCount; i++) loop. As I mentioned before, the error calculation part of the gradient descent is executed in parallel but the rest is calculated sequentialy. The thrust::transform is used with the TrainFunctor to calculate f(x)-y in parallel for every sample. f(x) is simply the A*x1 + Bx2 + Cx3 + Dx4 + Ex5 + Fx6 + Gx7 + H hypothesis where x1 through x7 are the features (x1=bedrooms, x2=square feet, x3=year built, x4=square root of bedrooms, x5=square root of square feet, x6=square root of year built and x7=the product of bedrooms and square feet) and A through H are the constants which gradient descent will find out. This is shown with the Green Square on Figure 11. The TrainFunctor code snippet and the usage code snippet are shown below:

Figure 11 (

__host__ __device__
  float operator()(int tid, float labelData)
		float h = 0;
		for (int f = 0; f < featureCount; f++)
			h += hypothesis[f] * trainingData[tid * featureCount + f];
		return h - labelData;
	dv_labelData.begin(), dv_costData.begin(), tf);

The thrust::transform_reduce is used with the TrainFunctor2 to apply the features to the error result and sum up all of them. This is shown with the code snippet below and the Red Square on Figure 11. Rest of the Learn method calculates gradient descent part marked with Blue Square on Figure 11.

float totalCost = thrust::transform_reduce(thrust::counting_iterator<int>(0), 
	thrust::counting_iterator<int>(trainingDataCount),  tf2, 0.0f, thrust::plus<float>());

Once gradient descent converges, the constants A though H of the hypothesis is returned back to the TestLRApp with the result array.

As you may guess the prediction works by using the constants with new sample data on the hypothesis. This is done using the Predict method in the LR_GPULib library. As seen below the Predict method normalizes the given features set and calculates the hypothesis using the constants and the normalized data with the help of the PredictFunctor. The result is a the predicted house price for the given features.

	NormalizeFeaturesByMeanAndStd(testDataCount, pdv_testData, dv_mean, dv_std);

	PredictFunctor predictFunctor(pdv_testData, pdv_hypothesis, featureCount);
		thrust::counting_iterator(testDataCount), dv_result.begin(), predictFunctor);
struct PredictFunctor : public thrust::unary_function
	float * testData;
	float * hypothesis;
	unsigned int featureCount;

	PredictFunctor(float * _testData, float * _hypothesis, unsigned int _featureCount) 
		: testData(_testData), hypothesis(_hypothesis), featureCount(_featureCount)

  __host__ __device__
  float operator()(int tid)
	  float sum = 0;
	  for(int i = 0; i < featureCount; i++)
		sum += testData[tid * featureCount + i] * hypothesis[i];
	  return sum;


GPGPU, Machine Learning and Big Data are three rising fields in the IT industry. There is so much more about these fields than what I’m providing in this post. As much as I get deeper into these fields I figure out how well they fit together. I hope this sample gave you some basic idea and maybe just one perspective how you can use NVIDIA CUDA easily on machine learning problems. As in any other software solution this example is not the only way to do polynomial regression on house price prediction with GPUs. In fact an enhancement would be supporting multiple GPUs and splitting down the data set into more parts.

AWS, C++, GPGPU, Technical

How to set up Amazon EC2 Windows GPU instance for NVIDIA CUDA development


Amazon Elastic Compute Cloud web service provides a very useful platform on the cloud. Especially for software developers who don’t have access to expensive hardware. Some time ago as I was looking for a better CUDA enabled GPU solution than my Mac Book Pro, I’ve realized that it is time to switch from laptop to a desktop. But luckily, Amazon introduced couple months ago the GPU instances, running on Windows Server 2008 OS. I’ve been using the scalable and cost efficient Amazon EC2’s since couple years without any problem and now that they are providing a platform with two Tesla M2050s to test my CUDA apps, I just want to say Thank You Amazon.

On this post I want to share with you my experience how to set up a full NVIDIA CUDA development environment on a Windows EC2 GPU instance. And I’ll also walk you through couple CUDA examples.

If you were following my previous blog posts and were not able to try them out because of not having a CUDA capable hardware, you will have a chance to do it after reading this blog.

One of the reasons I’m providing this blog post is also to use this information in our HPC & GPU Supercomputing group of South Florida hands-on lab meetups. If you are from the group, you’ve most probably received already the AMI. Therefore you can skip the set up part.


About Amazon EC2 GPU Instances

Amazon Elastic Compute Cloud (Amazon EC2) is a web service that provides resizable compute capacity in the cloud. It is designed to make web-scale computing easier for developers.

The GPU instances provide general-purpose graphics processing units (GPUs) with proportionally high CPU and increased network performance for applications benefitting from highly parallelized processing, including HPC, rendering and media processing applications. The Windows GPU instance is named Cluster GPU Quadruple Extra Large instance and has

22 GB memory, 33.5 EC2 Compute Units, 2 x NVIDIA Tesla “Fermi” M2050 GPUs, 1690 GB of local instance storage, 64-bit platform, 10 Gigabit Ethernet.


Utilizing GPUs to do general purpose scientific and engineering computing is named GPGPU. You can visit my previous blog posts where I’ve explained how to use NVIDIA CUDA capable GPUs to perform massively parallel computations.


Browse to and click the link on the top of the page saying “Sign in to the AWS Management Console”.

Please be aware that you will get charged by Amazon for the usage of their services. Therefore, check for running objects before leaving the AWS Management console. Please check out the Amazon pricing web page for more information.

The next couple paragraphs explain how to create your AWS account and set up your environment. You can skip this section if you already have and account and familiar with Amazon EC2.

Registering for Amazon AWS

If you already have an Amazon account you can use it to log in, otherwise you can create a new account from the same screen. Once you logged into the AWS console, it may ask you to sign up for a Amazon S3 account. In that case just follow the links to finish the sign up. Once it is done, you should receive the confirmation email. Now, login to you account to finish the registration and to go through a phone verification.

Setting up your AWS environment

Login to your Amazon AWS account, the AWS management console will show up. Select the EC2 tab from the top to see your EC2 dashboard. We will create a security group and a key pair for later use.

First, click the Key Pairs link on the right and after that click the Create Key Pair button. Enter a name for your private key file, like My_KeyPair and then after save the .pem file somewhere to use it later. You will also see the new key pair on the screen.

Go back to the EC2 dashboard and click the Security Group link on the right. This will open the security group console. Click the Create Security Group button and create a group named GPGPU_SecurityGroup. Select the Inbound tab for the new group and the rule editor will open. Add an RDP group by selecting RDP from the rules drop down and clicking the Add Rule button. Now click the Apply Rule Changes button to save the changes.

Creating the GPU EC2 Instance

  1. Go to the EC2 dashboard and click the Launch Instance button.
  2. Select the Launch Classic Wizard and click Continue.
  3. Find the Microsoft Windows 2008 R2 64-bit for Cluster Instances (AMI Id: ami-c7d81aae) in the list and click the Select button right next to it.
  4. Select Cluster GPU(cg1.4xlarge, 22GB) from the Instance Type drop down and click continue. If you have other instances and you are planning to transfer data between your instances, I’m suggesting selecting the same region for all of them to prevent in cloud data transfer charges.
  5. Select Continue on the Advanced Instance Options page.
  6. Give a name to your instance. e.g. GPGPU.
  7. Select the Key Pair you have created and click the continue button.
  8. Select the Security Group you have created and click the continue button.
  9. Click the launch button to finish the wizard.

Running the GPU EC2 Instance

You can click the instances link on the left hand Navigation menu to see the instance you’ve just created. The instance will be in pending state for a while until it will boot up completely.
Right click on the newly created instance and select Get Windows Password. You may have to come back after couple minutes if the password generation is pending.
Paste the content of the .pem file you’ve received while creating the key pair, to the Private Key field on the password retrieval dialog and click the Decrypt Password button.
Copy the Decrypted Password to use it later to log into the instance.

Connecting to the Instance using RDP

In order to connect to the newly created instance :

  1. Right click on it and select Connect.
  2. Click “Download shortcut file” link and save the RDP shortcut to your local machine.
  3. Open the saved RDP shortcut and logon to the instance by enter the retrieved password.
  4. Change your random generated password from the Control Panel / User Accounts section.

Installing GPGPU Developer Tools

Go to the CUDA Downloads website to see available downloads. At this time we will download the 4.1 RC2 version from CUDA Toolkit 4.1 web site.
Download and install the following items in the same order :

  1. Visual Studio C++ 2010 Express.
  2. CUDA Toolkit.
  3. GPU Computing SDK.
  4. Developer Drivers for WinVista and Win7 (285.86). The default drivers coming
  5. (Optional) Parallel Nsight 2.1RC2. In order to download this you have to sign up for the Parallel Nsight Registered Developer Program.

Backup the GPU EC2 instance

You will get charged for any instance which is not terminated, even for those in stopped state. Therefore, it is a good practice to backup to S3 and terminate your instance once you are done with testing to prevent any charges in downtime. You can do this in two ways: you can detach the EBS volume (storage) and terminate the instance or you can take a snapshot and delete the instance and volume. As of today the EBS volume costs $0.10 per GB-month and the snapshot costs $0.14 per GB-month. You can visit the Amazon EC2 pricing web site for a more up to date pricing.

Please follow the steps below for a snapshot backup:

  1. Click the volumes link on the navigation bar on the left hand side. You will see the volume ( storage ) attached to your EC2 instance.
  2. Right click on the volume and select Create Snapshot.
  3. Provide a name for the new snapshot and click the Yes, Create button.
  4. Go to the Snapshots section from the navigation menu and click refresh. You should see the new snapshot in pending mode. It will take a while to create the snapshot.

Running CUDA Samples

Now you are ready to compile and run a CUDA sample from the GPU Computing SDK. Please follow these steps :

  1. Login to the instance using the RDP shortcut.
  2. The samples require cutil32d.lib in order to function, therefore you need to compile the cutil project first. For that browse to the C:\ProgramData\NVIDIA Corporation\NVIDIA GPU Computing SDK 4.1\C\common folder and open the cutil_vs2010.sln visual studio solution file. Compile the solution.
  3. It is convenient to have syntax highlighting on .cu files. Therefore go to C:\ProgramData\NVIDIA Corporation\NVIDIA GPU Computing SDK 4.1\C\doc\syntax_highlighting\visual_studio_8 folder and follow the instructions in the readme.txt file.
  4. Our first example is the deviceQuery, which shows the properties of your GPU. Browse to the C:\ProgramData\NVIDIA Corporation\NVIDIA GPU Computing SDK 4.1\C\src\deviceQuery folder and open the deviceQuery_vs2010.sln. Compile the solution.
  5. The output executable will be placed into the C:\ProgramData\NVIDIA Corporation\NVIDIA GPU Computing SDK 4.1\C\bin\win32\Debug folder. Open a administrative command prompt and run the deviceQuery.exe.
  6. You should see two Tesla M2050 devices each with device capability 2.0, 448 CUDA cores, 3GB memory, 515 GFlops, 148 GB/sec memory bandwidth. This feels like 400hp under the hood!

Let’s run one more sample to see the performance difference of our GPUs. The sample we are going to run is matrixMul, located under the same C:\ProgramData\NVIDIA Corporation\NVIDIA GPU Computing SDK 4.1\C\src root folder. On Tesla M2050 this sample will multiply a 640 x 640 matrix with a 640 x 960 matrix to generate a 640 x 960 matrix.

Open the solution, go to the project properties and add the C:\ProgramData\NVIDIA Corporation\NVIDIA GPU Computing SDK 4.1\shared\inc path to the Include Directories under the VC++ Directories configuration properties. ( I’ve noticed that the path can not be found. )

Compile and run the project in a command window. You should see 0.001 sec for CUBLAS kernel execution and 0.021 sec for CUDA execution. CUBLAS is CUDA’s Basic Linear Algebra Library with optimized algorithm.

Let’s compare the GPU with the Intel Xeon 2.93Ghz CPU of the current instance. In order to do this we need to modify the code a little :

  1. Open the file.
  2. Add the #include <time.h> at line 41, under the kernel include.
  3. Find the line with computeGold(reference, h_A, h_B, uiHA, uiWA, uiWB); ( around line 417) and replace it with the following code.

    clock_t startTime,endTime;
    startTime = clock() * CLK_TCK;
    computeGold(reference, h_A, h_B, uiHA, uiWA, uiWB);
    endTime = clock() * CLK_TCK;
    shrLogEx(LOGBOTH | MASTER, 0, "> Host matrixMul Time = %.5f s\n", 
    				(double)(endTime - startTime) / 1000000.0 );
  4. Compile the code and execute it. You should see something around 3.463 sec. This means that the CUBLAS GPU version is about 3500x faster than the single core CPU version. A fair comparison with all cores utilized can be found on the CUBLAS web site, which is about 6-17x.


GPGPU is rising since the last couple years and now that Amazon provides a Windows GPU instance, it is much easier to jump onto the massively parallel software track as a Windows developer.

C++, CodeProject, GPGPU, Technical

Massively Parallel Monte Carlo Simulation using GPU


In my previous blog posts I’ve explained how you can utilize the GPU on your computer to perform massively parallel computation with the help of NVIDIA CUDA and Thrust technologies. On this blog post I’m diving deeper into Thrust usage scenarios with a simple implementation of Monte Carlo simulation.

My influence was the PI prediction sample on Thrust web site. The sample is running Monte Carlo simulation with 10K samples on a unit circle to estimate the PI number. You can visit this Wikipedia page if you are interested into how Monte Carlo Simulation can be used to approximate PI. Actually it is a solution to the famous Buffon’s Needle problem.

I’m taking the original example one step further to show you how to send device variables to functors in Thrust methods, and also using a slightly different problem. Perhaps there are many other methods to do the same logic, but on this blog post I’m just concentrating on this specific implementation.


About Monte Carlo Simulation

Monte Carlo simulation is an approach to solve deterministic problems with probabilistic analog. That is exactly what we are accomplishing in our example: estimating the area of intersecting disks. Monte Carlo methods are especially useful for simulating systems with many coupled degrees of freedom, such as fluids, disordered materials, strongly coupled solids, and cellular structures.

Our simulation is about predicting the intersect area of four overlapping unit disks as seen on the image below. (intersection of A,B,C and D disks) Actually, the problem can also be solved easily with the help of Geometry as explained here. I’ve calculated the area as 0.31515. On the other hand the simulation estimated 0.3149.

About Thrust

Writing code using CUDA API is very powerful in terms of controlling the hardware, but there are high level libraries like Thrust C++ template library, which provides many fundamental programming logic like sorting, prefix-sums, reductions, transformations etc.. The best part is that Thrust consists only of header files and is distributed with CUDA 4.0 installation.

If you are not familiar with terms like GPGPU and Thrust, I’m suggesting you to check out the background information on my previous posts.


The example is a console application written in C++. But you can easily transform it to a DLL to use it from your C# application. (previous posts)

I’ve used Visual Studio 2010 create the C++ console application. If you already did not, you need to install the NVIDIA CUDA Toolkit 4.0 and a supported graphics device driver from the same link. The new CUDA Toolkit 4.1 RC1 is also available at CUDA zone, but the project files are built on 4.0. Also do not forget to install the Build Customization BUG FIX Update from the same link for CUDA Toolkit 4.0.

Once the CUDA Toolkit is installed, creating CUDA enabled projects is really simple. For those who are not familiar using native C++ CUDA enabled projects, please follow the steps below to create one:

  • Create a Visual C++ console project in Visual Studio 2010 by selecting Empty project on the wizard,
  • Open Build Customization from the C++ project context menu, and check the CUDA 4.0(.targets, .props) checkbox,
  • Open the project properties, expand the Configuration Properties, Linker and select Input. Edit the additional dependencies and add cudart.lib.
  • Add a new empty source file ending with .cu.

You can also skip the steps above and download the example solution and project files directly from here.


The main application consists of calling thrust::transform_reduce 50 times to run the intersection estimation simulation. transform_reduce performs a reduction on the transformation of the sequence [first, last) according to unary_op. The unary_op is applied to each element of the sequence and then the result is reduced to a single value with binary_op.

The main code is as follows:

int main(void)
  // use 50 independent seeds
  int M = 50;
  //Create some circles in the device
  thrust::host_vector dCircles;
  dCircles.push_back(CIRCLE(0.0f, 0.0f));
  dCircles.push_back(CIRCLE(1.0f, 0.0f));
  dCircles.push_back(CIRCLE(1.0f, 1.0f));
  dCircles.push_back(CIRCLE(0.0f, 1.0f));

 //The kernel can not access host or device vector directly,
 //therefore get the device pointer to the circles to pass to the kernel
  thrust::device_vector circles = dCircles;
  CIRCLE * circleArray = thrust::raw_pointer_cast( &circles[0] );
  float estimate = thrust::transform_reduce(thrust::counting_iterator(0),
           estimate_intersection(circleArray, circles.size()),
  estimate /= M;
  std::cout << std::setprecision(6);
  //calculate area with gemometry : (pi + 3 - 3*sqrt(3)) / 3 = 0.31515s
  std::cout << "the area is estimated as " << estimate
            << std::endl << ". It should be 0.31515." ;
  return 0;

The unary_op has the Monte Carlo simulation logic implemented in the estimate_intersection functor. The estimate_intersection is a method derived from the thrust::unary_function class and returning the estimated intersect area as float. Using estimate_intersection in tranform_reduce means estimating the intersect area for every data element provided to tranform_reduce. For the data elements we are using two thrust::counting_iterators. This creates a range filled with a sequence from 1 to 50, without explicitly storing anything in the memory. Using a sequence of numbers helps us to assign different thread id for every estimate_intersection call. This is important to generate distinct seed for the random number generator of the simulation. (I’ve mentioned about random number generator seeds in my previous posts.)

For the reduction part of the tranform_reduce we are using the thrust::plus() binary functor, which sums all results into one number. At last we divide the result into 50 to find the average intersect area value.

Our goal with this code is to run the simulation on the device (GPU) and retrieve the result back to the host. Therefore any data we are going to use on the simulation must be placed into the device memory. That is exactly what is happening before we call thrust::transform_reduce. We are preparing properties of all circles we will try to intersect using the CIRCLE object defined below.

struct CIRCLE{
   float x,y;
   CIRCLE(float _x, float _y) : x(_x), y(_y){}
} ;

With thrust::host_vector dCircles; in the main code, we are defining a vector object in the host memory. Using a Thrust host vector object over a custom memory simplifies transferring data directly to device with the thrust::device_vector circles = dCircles; call. As you may know, transferring data between device and host memory in CUDA C is handled with cudaMemcpy. But Thrust has the equal operator overload, which allows you to copy memory easily.

On the next line we access the raw pointer of the circles object with the help of the thrust::raw_pointer_cast method. We do this because the estimate_intersection method can only accept a device pointer to the CIRCLE object array.

Simulation Method

The estimate_intersection unary function implements the simulation logic. A unary function is a function which takes one argument, has a () operator overload and returns one value. In our case the function takes the thrust::counting_iterator generated unique index number and returns the area of the intersection as float. Another important part of the method is the constructor (seen below) which takes in the device pointer to the CIRCLE array and the length of the allocated memory.

struct estimate_intersection : public thrust::unary_function
CIRCLE * Circles;
int CircleCount;

estimate_intersection(CIRCLE * circles, int circleCount) :
   Circles(circles), CircleCount(circleCount)

  __host__ __device__
  float operator()(unsigned int thread_id)
    float sum = 0;
    unsigned int N = 30000; // samples per thread
    unsigned int seed = hash(thread_id);
    // seed a random number generator
    thrust::default_random_engine rng(seed);
    // create a mapping from random numbers to [0,1)
    thrust::uniform_real_distribution u01(0,1);

    // take N samples
    for(unsigned int i = 0; i < N; ++i)
      // draw a sample from the unit square
      double x = u01(rng);
      double y = u01(rng);
      bool inside = false;

      //check if the point is inside all circles
      for(unsigned int k = 0; k < CircleCount; ++k)
       double dy,dx;
       //check if the point is further from
       //the center of the circle than the radius
       dx = Circles[k].x - x;
       dy = Circles[k].y - y;
       if ((dx*dx + dy*dy) <= 1)
        inside = true;
        inside = false;
      if (inside)
       sum += 1.0f;
    // divide by N
   return sum / N;

In order to run the code on the device and call it from the host, the () operator overload has to be defined as __host__ __device__. The rest of the code is the Monte Carlo simulation logic as follows:

1) Initiate the thrust default random number generator

2) Generate 30K random x and y values

3) Loop through all circles and check if the x and y value is inside the circle by calculating the hypotenuse

4) If all circles are inside the x and y coordinates then increase the found points count

5) return the average found points count

That’s it! I hope you enjoy it.

In addition to the code I included here, there are header includes and a hashing algorithm. You can download the code from here.

About the Implementations

The Monte Carlo simulation I provided on this post is an example and therefore I’m not guaranteeing that it will perform good enough in your particular solution. Also, for clarity there is almost no exception handling and logging implemented. This is not an API; my goal is to give you a high level idea how you can use utilize the GPU for simulations. Therefore, it is important that you re-factor the code for your own use.

Some of the code is taken from original sample and is under Apache License V 2, the rest is my code which is free to use without any restriction or obligation.


Thrust is a powerful library providing you with simple ways to accomplish complicated parallel computation tasks. There are many libraries like Thrust which are built on CUDA C. These libraries will save you many engineering hours on parallel algorithm implementation and allow you to concentrate on your real business problem. You can check out GPU Computing Webinars for presentations on this area.

C# language, C++, CodeProject, GPGPU, Technical

Massively Parallel RNG using CUDA C, Thrust and C#


On this post I’ll give you some simple examples how to use the massively parallel GPU on your computer to generate uniformly distributed psuedo-random numbers. Why GPU? Because it is orders of magnitude faster than CPU, does not occupy your CPU time, it is already on all computers and many other reasons I mentioned in my previous post. While there are maybe hundreds of ways to generate pseudorandom numbers I only covered four ways to do it on NVIDIA cards using CUDA related APIs:

1) A Basic Linear Congruential Generator (LCG) implementation using CUDA C
2) A Thrust C++ template library implementation
3) An NVIDIA CURAND implementation
4) A Mersenne Twister implementation using CUDA C

In order to demonstrate how to utilize the GPU, the implementations are provided as DLL’s and used within C# sample application. Perhaps there are many other APIs and ways worth to talk about to utilize your GPU within your C# application, but this post’s scope is limited only to the subject I mentioned above. I’m suggesting you to visit, if you already did not, to see the endless possibilities in this area.

While I was preparing these samples I saw that visualizing the data is very important to understand the algorithms. Therefore, I used Microsoft WPF on C# to visualize the generated random numbers. You can use your own application and copy the classes under the RNGLibs folder.

All code can be downloaded from this link:


About Random Number Generators (RNG)

The generation of random numbers is important in many applications like simulations, cryptography, sampling and mostly in statistics. A sequence of numbers is random when it does not have a recognizable pattern in it or in other words if it is non-deterministic. Although non-deterministic random numbers are ideal, the computer generated, deterministic random numbers can be statistically “random enough”. These random numbers are named as pseudo-random numbers and can have easily identifiable patterns if the algorithm is not chosen wisely. ( )

There are many pseudo-random number generators and also many different implementations of them in sequential and parallel environments. ( ) On this post I used only Linear Congruential Generators, Xorshift and Mersenne Twister. Therefore, I explained only these three algorithm, but you can use CUDA to develop also other RNGs.


As I mentioned in my previous post, writing code using CUDA API is very powerful in terms of controlling the hardware, but there are high level libraries like Thrust C++ template library, which provides many fundamental programming logic like sorting, prefix-sums, reductions, transformations etc.. The best part is that Thrust consists only of header files and is distributed with CUDA 4.0 installation.


I’ve used Visual Studio 2010 to host one C# Windows Application and native C++ dlls for RNG implementations as seen in the solution structure below:

  • RNGVisual (Main C# application)
  • CURACRNGLib (CUDA C RNG implementation)
  • CURANDRNGLib (CURAND RNG implementation)
  • ThrustRNGLib (Thrust RNG Implementation)
  • MersenneTwisterRNGLib (Mersenne Twister RNG implementation)

The only additional API is the NVIDIA CUDA Toolkit 4.0, which you will need to install along with a supported graphics device driver from the same link. Also do not forget to install the Build Customization BUG FIX Update from the same link or from here.

Once the CUDA Toolkit is installed creating CUDA enabled projects is really simple. For those who are not familiar using native C++ CUDA enabled projects, please follow the steps below to create one:

  • Create a Visual C++ console project in Visual Studio 2010 by selecting DLL and Empty project on the wizard,
  • Open Build Customization from the C++ project context menu, and check the CUDA 4.0(.targets, .props) checkbox,
  • Open the project properties, expand the Configuration Properties, Linker and select Input. Edit the additional dependencies and add cudart.lib.
  • Add a new empty source file ending with .cu.


WPF Application

The RNGVisual C# WPF Application provides visualization of the random numbers in 2D and 3D. It allows you to select an RNG algorithm (.Net, CUDA C, CURAND, Thrust or Merseene Twister) and allows you to set some display parameters and processor parameters. With any number count below 10K, all RNGs calculate in about one millisecond and most of the time is spent drawing the squares to the screen. Therefore, the time should not confuse you in terms of performance comparison. You can run the algorithms with 100K numbers without the visualization and see the difference on your hardware. But please be aware that it is better to use CUDA events with cudaEventRecord to time GPU execution more precisely.



RNGVisual implements various proxy classes, which uses platform invoke to call RNG methods exported from the native C++ dlls. I used the same export and import technique in my previous post. The RNG libraries have the following two exports, one for CPU implementation and one for GPU implementation:

extern "C" __declspec(dllexport) void __cdecl
    GPU_RNG(float*, unsigned int, unsigned int);

extern "C" __declspec(dllexport) void __cdecl
    CPU_RNG(float*, unsigned int, unsigned int);

The first argument is a pointer to the memory location to hold the random numbers. The second argument is the size of the array and the last argument is the initial seed.

An important point of random number generation is selecting the seed value, because the same seed will give the same result. While there are many different techniques studied, I used my own method of combining current time, CPU load and available physical memory with the help of the Windows Management Instrumentation (WMI); it still does not perform well in multi-threaded solutions, but it gives at least a better random start. The implementation is in the CPUHelper class of the RNGVisual application.

A Linear Congruential Generator (LCG) implementation using CUDA C

The first RNG project is using native CUDA Runtime API to implement the oldest and best-known pseudorandom number generator algorithms named LCG. LCG is fast and requires minimal memory to retain state. Therefore, LCG is very efficient for simulating multiple independent streams. But LCGs have some disadvantages and should not be used for applications where high-quality randomness is critical. In fact the simple example I implemented repeats numbers in a very short period and should be enhanced with methods like explained in GPUGems 3 (37-4).

LCG is as simple as seen in the formula below; starting with a seed (Xn), the next random number is determined with (a * seed + c) mod m.
Xn+1 = (a Xn + c) (mod m)
where 0 < m, 0 < a < m, 0 <= x0 < m and 0 <= c < m

Below is a sequential implementation of the LCG algorithm, which generates 100 pseudo-random numbers:

random[0] = 123; //some initial seed
for(int i = 1; i < 100; i++)
 random[i] = ( a * random[i-1] + c) % m;

The CUDACRNGLib project has a very basic implementation of LCG by distributing the work onto 256 threads. Because the same seed will result in the same random number, first we generate different random seeds for every thread. When the kernel below is executed, every thread generates one section of the random number sequence:

__global__ void RNGKernel(float * randomNumbers, unsigned int numberCount,
    unsigned int * seeds, unsigned int c, unsigned int a, unsigned int M) 
    int startIdx = threadIdx.x * numberCount; 
    unsigned int x = seeds[threadIdx.x]; 
    for(int i=0; i < numberCount; i++) { 
        x = (a * x + c) % M; //M is shown for purpose of example 
        randomNumbers[startIdx + i]= (float)x / (float)M; //normalize  

As I mentioned before, this implementation is simplified to give you an idea how you can start using CUDA C. It even has static block count of one and thread count of 256. If you plan to go for production code, it is good to start many blocks of threads. You may want to check out a better implementation on GPU Gems 3 (37-7) or check out Arnold and Meel’s implementation, which provides also better randomness.

A Thrust C++ template library implementation

The Thrust library default random engine ( default_random_engine ) is a Linear Congruential Generator ( may change in the future ) with a = 48271, c = 0 and m = 2^31. Because c equals to zero, the algorithm is also named multiplicative congruential method or Lehmer RNG.

The ThrustRNGLib has a very basic implementation of Thrust default random engine by running the following functor to generate one random number. A functor is a type of class in C++ that overloads the operator() and therefore allows to be called like an ordinary function. Thrust provides unary_function and binary_function functors. Below I used the unary_function because my functor requires on argument to passed into the function:

struct RandomNumberFunctor : 
    public thrust::unary_function<unsigned int, float>
    unsigned int mainSeed;
    RandomNumberFunctor(unsigned int _mainSeed) : 
        mainSeed(_mainSeed) {}

    __host__ __device__
        float operator()(unsigned int threadIdx) 
        unsigned int seed = hash(threadIdx) * mainSeed;

        // seed a random number generator
        thrust::default_random_engine rng(seed);

        // create a mapping from random numbers to [0,1)
        thrust::uniform_real_distribution<float> u01(0,1);

        return u01(rng);

Using thrust to utilize the GPU is the simplest way to go. You can see the difference by comparing the GPU_RNG from below to the CUDACRNGLib GPU_RNG implementation. Using CUDA C gives you full control of the toolkit but it comes with the price of writing more code.

extern void GPU_RNG(float * h_randomData, unsigned int dataCount, unsigned int mainSeed)
    //Allocate device vector
    thrust::device_vector<float> d_rngBuffer(dataCount);

    //generate random numbers
        d_rngBuffer.begin(), RandomNumberFunctor(mainSeed));

    //copy the random mask back to host
    thrust::copy(d_rngBuffer.begin(), d_rngBuffer.end(), h_randomData);

Another good part of Thrust is that every implementation (except copy) exist for GPU as well as for CPU. The CPU implementation is also another three lines of code, this time by using the thrust::generate to generate the random numbers by using the C++ standard template library rand method and then after using thrust::transform to normalize the integer result into float with the help of the [](float n) {return n / (RAND_MAX + 1);} lambda expression. I used the lambda expression instead of a functor to show you also this possibility. Especially on the upcoming
Microsoft C++ AMP, lambda expressions play a big role. Lambda expression are handy in C++ as well as in C#, but it comes with a price of giving up unit testing of the inline expression.

An NVIDIA CURAND implementation

The NVIDIA CURAND library provides an API for simple and efficient generation of high-quality pseudorandom and quasirandom numbers. The CURAND library default pseudorandom engine is a XORWOW implementation of the Xorshift RNG (page 5) and it produces higher quality random numbers than LCG.
In order to start using CURAND, you only need to include the curand.h header and add the curand.lib to the Linker additional dependencies on the Linker settings.

Like ThrustRNGLib Thrust implementation, the CURANDRNGLib has a very basic implementation by running the following main code to generate a series of random numbers:

    //Create a new generator
    curandCreateGenerator(&m_prng, CURAND_RNG_PSEUDO_DEFAULT);
    //Set the generator options
    curandSetPseudoRandomGeneratorSeed(m_prng, (unsigned long) mainSeed);
    //Generate random numbers
    curandGenerateUniform(m_prng, d_randomData, dataCount);

CURAND provides the curandCreateGeneratorHost method besides the curandCreateGenerator method, to generate random numbers on the CPU instead of the GPU. Therefore the CPU part is as simple as the GPU part.

A Mersenne Twister implementation using CUDA C

Mersenne Twister ( MT ) is an algorithm developed by Makoto Matsumoto and provides very fast generation of high-quality random numbers. ( MT Home Page ) A common Mersenne twister implementation, uses an LCG to generate seed data.
Originally there are two MT algorithms suitable to use with CUDA: TinyMT and Mersenne Twister for Graphics Processors (MTGP). But I implemented part of the code from the NVIDIA CUDA Toolkit 4.0 MersenneTwister sample, which uses the original code from Makoto Matsumoto anyways.

The Mersenne Twister RNG is maybe the most complicated implementation out of the other three RNGs I provided, but with that you can look into the algorithm, unlike CURAND. The MersenneTwisterRNG.cpp file from the MersenneTwisterRNGLib project is the entry point to the library and exports the same GPU_RNG and CPU_RNG methods as the other libraries. I simplified the host code as much as possible and placed all GPU logic into the file. The remaining simple host code can be seen below:

extern void GPU_RNG(float * h_randomData, unsigned int dataCount, 
    unsigned int mainSeed)
	float * d_randomData = 0;

	//load GPU twisters configuration

	//find the rounded up data count 
    //because the generator generates in multiples of 4096
	int numbersPerRNG = iAlignUp(iDivUp(dataCount, MT_RNG_COUNT), 2);
	int randomDataCount = MT_RNG_COUNT * numbersPerRNG;

	//allocate device memory
	size_t randomDataSize = randomDataCount * sizeof(float);
	cudaMalloc((void**)&d_randomData, randomDataSize);

	//Call the generator
	RNGOnGPU(32, 128, d_randomData, numbersPerRNG);

	//Make sure all GPU work is done

	//Copy memory back to the device
	cudaMemcpy(h_randomData, d_randomData, dataCount * sizeof(float), 

	//free device memory

About the Implementations

The Pseudo-random number generators I provided on this post are widely used algorithms, but still I’m not guaranteeing that any of them will perform good enough in your particular solution. In fact, I left some generators poor by purpose to point out the core algorithm and provide variations in randomness. Also, for sake of clarity there is almost no exception handling and logging implemented. This is not an API; my goal is to give you a high level idea how you can use Thrust, CUDA C, CURAND to generate pseudo-random number. Therefore, it is important that you research the algorithms on-line and re-factor the code for your own use.

Some of the code is taken from original NVIDIA samples and have copyright notice on them, the rest is my code which is free to use without any restriction or obligation.


As in every field of computer science, there are many ways to solve a problem and the possibilities expand exponentially. I just scratched the surface of the possibility of using CUDA to add pseudorandom number generation to your C# application. I hope this post will help you in your project.