AWS, C++, GPGPU, Technical

How to set up Amazon EC2 Windows GPU instance for NVIDIA CUDA development


Amazon Elastic Compute Cloud web service provides a very useful platform on the cloud. Especially for software developers who don’t have access to expensive hardware. Some time ago as I was looking for a better CUDA enabled GPU solution than my Mac Book Pro, I’ve realized that it is time to switch from laptop to a desktop. But luckily, Amazon introduced couple months ago the GPU instances, running on Windows Server 2008 OS. I’ve been using the scalable and cost efficient Amazon EC2’s since couple years without any problem and now that they are providing a platform with two Tesla M2050s to test my CUDA apps, I just want to say Thank You Amazon.

On this post I want to share with you my experience how to set up a full NVIDIA CUDA development environment on a Windows EC2 GPU instance. And I’ll also walk you through couple CUDA examples.

If you were following my previous blog posts and were not able to try them out because of not having a CUDA capable hardware, you will have a chance to do it after reading this blog.

One of the reasons I’m providing this blog post is also to use this information in our HPC & GPU Supercomputing group of South Florida hands-on lab meetups. If you are from the group, you’ve most probably received already the AMI. Therefore you can skip the set up part.


About Amazon EC2 GPU Instances

Amazon Elastic Compute Cloud (Amazon EC2) is a web service that provides resizable compute capacity in the cloud. It is designed to make web-scale computing easier for developers.

The GPU instances provide general-purpose graphics processing units (GPUs) with proportionally high CPU and increased network performance for applications benefitting from highly parallelized processing, including HPC, rendering and media processing applications. The Windows GPU instance is named Cluster GPU Quadruple Extra Large instance and has

22 GB memory, 33.5 EC2 Compute Units, 2 x NVIDIA Tesla “Fermi” M2050 GPUs, 1690 GB of local instance storage, 64-bit platform, 10 Gigabit Ethernet.


Utilizing GPUs to do general purpose scientific and engineering computing is named GPGPU. You can visit my previous blog posts where I’ve explained how to use NVIDIA CUDA capable GPUs to perform massively parallel computations.


Browse to and click the link on the top of the page saying “Sign in to the AWS Management Console”.

Please be aware that you will get charged by Amazon for the usage of their services. Therefore, check for running objects before leaving the AWS Management console. Please check out the Amazon pricing web page for more information.

The next couple paragraphs explain how to create your AWS account and set up your environment. You can skip this section if you already have and account and familiar with Amazon EC2.

Registering for Amazon AWS

If you already have an Amazon account you can use it to log in, otherwise you can create a new account from the same screen. Once you logged into the AWS console, it may ask you to sign up for a Amazon S3 account. In that case just follow the links to finish the sign up. Once it is done, you should receive the confirmation email. Now, login to you account to finish the registration and to go through a phone verification.

Setting up your AWS environment

Login to your Amazon AWS account, the AWS management console will show up. Select the EC2 tab from the top to see your EC2 dashboard. We will create a security group and a key pair for later use.

First, click the Key Pairs link on the right and after that click the Create Key Pair button. Enter a name for your private key file, like My_KeyPair and then after save the .pem file somewhere to use it later. You will also see the new key pair on the screen.

Go back to the EC2 dashboard and click the Security Group link on the right. This will open the security group console. Click the Create Security Group button and create a group named GPGPU_SecurityGroup. Select the Inbound tab for the new group and the rule editor will open. Add an RDP group by selecting RDP from the rules drop down and clicking the Add Rule button. Now click the Apply Rule Changes button to save the changes.

Creating the GPU EC2 Instance

  1. Go to the EC2 dashboard and click the Launch Instance button.
  2. Select the Launch Classic Wizard and click Continue.
  3. Find the Microsoft Windows 2008 R2 64-bit for Cluster Instances (AMI Id: ami-c7d81aae) in the list and click the Select button right next to it.
  4. Select Cluster GPU(cg1.4xlarge, 22GB) from the Instance Type drop down and click continue. If you have other instances and you are planning to transfer data between your instances, I’m suggesting selecting the same region for all of them to prevent in cloud data transfer charges.
  5. Select Continue on the Advanced Instance Options page.
  6. Give a name to your instance. e.g. GPGPU.
  7. Select the Key Pair you have created and click the continue button.
  8. Select the Security Group you have created and click the continue button.
  9. Click the launch button to finish the wizard.

Running the GPU EC2 Instance

You can click the instances link on the left hand Navigation menu to see the instance you’ve just created. The instance will be in pending state for a while until it will boot up completely.
Right click on the newly created instance and select Get Windows Password. You may have to come back after couple minutes if the password generation is pending.
Paste the content of the .pem file you’ve received while creating the key pair, to the Private Key field on the password retrieval dialog and click the Decrypt Password button.
Copy the Decrypted Password to use it later to log into the instance.

Connecting to the Instance using RDP

In order to connect to the newly created instance :

  1. Right click on it and select Connect.
  2. Click “Download shortcut file” link and save the RDP shortcut to your local machine.
  3. Open the saved RDP shortcut and logon to the instance by enter the retrieved password.
  4. Change your random generated password from the Control Panel / User Accounts section.

Installing GPGPU Developer Tools

Go to the CUDA Downloads website to see available downloads. At this time we will download the 4.1 RC2 version from CUDA Toolkit 4.1 web site.
Download and install the following items in the same order :

  1. Visual Studio C++ 2010 Express.
  2. CUDA Toolkit.
  3. GPU Computing SDK.
  4. Developer Drivers for WinVista and Win7 (285.86). The default drivers coming
  5. (Optional) Parallel Nsight 2.1RC2. In order to download this you have to sign up for the Parallel Nsight Registered Developer Program.

Backup the GPU EC2 instance

You will get charged for any instance which is not terminated, even for those in stopped state. Therefore, it is a good practice to backup to S3 and terminate your instance once you are done with testing to prevent any charges in downtime. You can do this in two ways: you can detach the EBS volume (storage) and terminate the instance or you can take a snapshot and delete the instance and volume. As of today the EBS volume costs $0.10 per GB-month and the snapshot costs $0.14 per GB-month. You can visit the Amazon EC2 pricing web site for a more up to date pricing.

Please follow the steps below for a snapshot backup:

  1. Click the volumes link on the navigation bar on the left hand side. You will see the volume ( storage ) attached to your EC2 instance.
  2. Right click on the volume and select Create Snapshot.
  3. Provide a name for the new snapshot and click the Yes, Create button.
  4. Go to the Snapshots section from the navigation menu and click refresh. You should see the new snapshot in pending mode. It will take a while to create the snapshot.

Running CUDA Samples

Now you are ready to compile and run a CUDA sample from the GPU Computing SDK. Please follow these steps :

  1. Login to the instance using the RDP shortcut.
  2. The samples require cutil32d.lib in order to function, therefore you need to compile the cutil project first. For that browse to the C:\ProgramData\NVIDIA Corporation\NVIDIA GPU Computing SDK 4.1\C\common folder and open the cutil_vs2010.sln visual studio solution file. Compile the solution.
  3. It is convenient to have syntax highlighting on .cu files. Therefore go to C:\ProgramData\NVIDIA Corporation\NVIDIA GPU Computing SDK 4.1\C\doc\syntax_highlighting\visual_studio_8 folder and follow the instructions in the readme.txt file.
  4. Our first example is the deviceQuery, which shows the properties of your GPU. Browse to the C:\ProgramData\NVIDIA Corporation\NVIDIA GPU Computing SDK 4.1\C\src\deviceQuery folder and open the deviceQuery_vs2010.sln. Compile the solution.
  5. The output executable will be placed into the C:\ProgramData\NVIDIA Corporation\NVIDIA GPU Computing SDK 4.1\C\bin\win32\Debug folder. Open a administrative command prompt and run the deviceQuery.exe.
  6. You should see two Tesla M2050 devices each with device capability 2.0, 448 CUDA cores, 3GB memory, 515 GFlops, 148 GB/sec memory bandwidth. This feels like 400hp under the hood!

Let’s run one more sample to see the performance difference of our GPUs. The sample we are going to run is matrixMul, located under the same C:\ProgramData\NVIDIA Corporation\NVIDIA GPU Computing SDK 4.1\C\src root folder. On Tesla M2050 this sample will multiply a 640 x 640 matrix with a 640 x 960 matrix to generate a 640 x 960 matrix.

Open the solution, go to the project properties and add the C:\ProgramData\NVIDIA Corporation\NVIDIA GPU Computing SDK 4.1\shared\inc path to the Include Directories under the VC++ Directories configuration properties. ( I’ve noticed that the path can not be found. )

Compile and run the project in a command window. You should see 0.001 sec for CUBLAS kernel execution and 0.021 sec for CUDA execution. CUBLAS is CUDA’s Basic Linear Algebra Library with optimized algorithm.

Let’s compare the GPU with the Intel Xeon 2.93Ghz CPU of the current instance. In order to do this we need to modify the code a little :

  1. Open the file.
  2. Add the #include <time.h> at line 41, under the kernel include.
  3. Find the line with computeGold(reference, h_A, h_B, uiHA, uiWA, uiWB); ( around line 417) and replace it with the following code.

    clock_t startTime,endTime;
    startTime = clock() * CLK_TCK;
    computeGold(reference, h_A, h_B, uiHA, uiWA, uiWB);
    endTime = clock() * CLK_TCK;
    shrLogEx(LOGBOTH | MASTER, 0, "> Host matrixMul Time = %.5f s\n", 
    				(double)(endTime - startTime) / 1000000.0 );
  4. Compile the code and execute it. You should see something around 3.463 sec. This means that the CUBLAS GPU version is about 3500x faster than the single core CPU version. A fair comparison with all cores utilized can be found on the CUBLAS web site, which is about 6-17x.


GPGPU is rising since the last couple years and now that Amazon provides a Windows GPU instance, it is much easier to jump onto the massively parallel software track as a Windows developer.