CS 385 Lab 9, Spring 2009

In this lab, you will find the Euclidean distance between two points in N-dimensional space using a combination of GPU-based and host-based computation. Along the way, you'll explore more of the CUDA API.

The Problem

You may have already learned how to compute the distance between two points in 2-dimensional space. If point p has components px and py, and point q has components qx and qy, then the distance between p and q is sqrt[(px-qx)^2 + (py-qy)^2].

In N-dimensional space, we just tack on terms for the additional components: sqrt[(p1-q1)^2 + (p2-q2)^2 + ... + (pN-qN)^2].
For prettier formatting of these equations, see the Wikipedia article on Euclidean distance.

The easiest way to think about a point in N-dimensional space for very large N is probably not visualizing it physically. Rather, think of each dimension representing some quantity that is independent of all the other dimensions. For example, we could think about each point representing a person's various likes and dislikes: the first dimension might represent how much you like pizza on a scale from 1 to 10, the second dimension might represent how much you like Grey's Anatomy, and so on for N different preferences. Finding the distance between two points would, in this case, give us a measure of the overall similarity or dissimilarity between two people's likes and dislikes.

We are going to compute the Euclidean distance between two N-dimensional points, where N is at least 1 million. This computation consists of a fully data-parallel phase (computing the square of the difference of the components for each dimension), a reduction (adding together all of these squares), and finally taking the square root of the reduced sum. For this assignment, you are required to do the data-parallel phase on the GPU, but you may do the reduction on the host. We will soon find that parallel reduction in CUDA requires more work from the programmer than it does in OpenMP and TBB.

Infrastructure

Our server is up and running! It works best if you log into your cwolf account and then ssh to rickroll.cs.sonoma.edu from cwolf. All your cwolf files are visible on rickroll. You may also do this assignment at home if you have a CUDA-capable graphics card; if you do, mention in your initial comment that you worked at home and note the model of your graphics card (e.g. NVidia GeForce 9800 GT).

To compile programs, type nvcc -o outputfile sourcefile.cu
If you run into problems other than syntax errors, the likely culprit is your environment variables. You may need to source your ~/.bash_profile again.

Directions

Your file should be called yourlastnameL9.cu and should contain the following:

Timing functionality and results (see below)
A function called ComputeDistance that takes as input two numerical arrays of size N and computes the Euclidean distance entirely on the host according to the formula above. It should return the Euclidean distance as a double.
A function called InitArray that initializes an integer array with random numbers. You should constrain these random numbers to fall within a fairly small range (0 to 99 is fine).
A CUDA kernel that takes as input pointers to 3 numerical arrays of size N. Call the input arrays A and B and the output array C. The kernel will take a pair of elements from A and B and compute the square of A[i] - B[i].
A main function that does the following:
- Declares the arrays (host and device). All arrays should be dynamically allocated; the host arrays can be allocated either with malloc or new, while the device arrays should be allocated with cudaMalloc.
- Prints the number of CUDA-enabled hardware devices attached to the system by calling cudaGetDeviceCount. You can find documentation for this function in the "Device Management" section of the CUDA Programming Guide.
- Prints at least 3 interesting properties of Device 0, including the device name, by calling cudaGetDeviceProperties. The first argument to this function is a pointer to a struct of type cudaDeviceProp. The fields of this struct are listed in the earlier version of the CUDA programming guide.
- Calls InitArray to initialize the host arrays.
- Calls PrintArray to print the original values of the host arrays.
- Calls cudaMemcpy to copy the host input arrays to the device.
- Calls the CUDA kernel.
- Calls cudaMemcpy to copy the device output array back to the host.
- Adds all the elements of the output array and takes the square root of the sum.
- Prints the calculated result.
- Calls ComputeDistance to compute the distance sequentially, and prints the result.
- Frees the memory of the arrays. The host arrays can be deallocated using delete or free, while the device arrays should be deallocated with cudaFree.

Check that your code works for values of N that are both large and small, as well as values that do not evenly divide the number of thread blocks or threads.

Timing

You should insert code to time your sequential and parallel computations:

The sequential timing code can be copied from previous labs; start the timer in main, just before the call to ComputeDistance, and stop it immediately after the call.
The parallel code should use CUDA's timing functions (see "Event Management" in the Programming Guide). Your measurement should capture the time to do the computation (both the CUDA kernel and the reduction) and the time to copy the arrays from the host to the device and back. It should not capture the time taken by memory allocations and deallocations.

In your initial comment, include the following time measurements (and list the value of N you used for them):

3 time measurements for the sequential code and their average
3 time measurements and their average for each of at 2 different CUDA configurations (different block dimensions/grid dimensions)

Submitting your file

Make sure that your file is correctly named and then copy it to ~srivoire/cs385/submit. Wait 2 minutes, and then check that it was correctly submitted by visiting http://rivoire.cs.sonoma.edu/cs385/lab9sub.txt.