CS 425 Homework 5, Fall 2015

Due date

You are welcome to work in groups of 3 and submit one assignment for the group. Submit your files as a .tar, .zip, or .tgz file on Moodle by Monday, Nov. 23 at 10:45 AM.

Problem 1: Vector addition [20 points]

In this problem, you will look at the vector addition

for (int i = 0; i < N; i++) {
	A[i] += B[i];
}

for two arrays A and B of size N.

You should read Section 4.2.1 of CUDA by Example before proceeding.

[8 points] As these examples show, normally each CUDA thread handles one value of i. Write (on paper) code for a CUDA kernel that handles two values of i per thread. Explain how you decided which elements to assign to each thread and why.
[6 points] You are writing the C host code to invoke the kernel you wrote in Part A for a vector of 50,000 elements. For your version of CUDA, the maximum block size is 1024. How many blocks will you create, and how many threads per block will you use?
[6 points] For the kernel that you wrote in Part A, do you expect any of the warps (groups of 32 threads with consecutive IDs) to take divergent paths through the code? If so, how many, and how do you expect this divergence to affect the performance of the kernel?

Submit your answer to this problem in PDF form as part of your tarfile.

Problem 2: Euclidean distance [80 points]

Although you will be writing a CUDA program, I don't expect you to have access to CUDA hardware or software tools, and I will be lenient on small CUDA syntax errors. However, your host code (normal C/C++) should be correct and debugged.

If you don't have access to CUDA, surround all of the CUDA-specific parts of your code with

#ifdef CUDA
    ... C code ...
#endif

When I test your code on CUDA-capable hardware, I will define CUDA (and again, it's OK if there are small errors).

Problem Description

You may have already learned how to compute the distance between two points in 2-dimensional space. If point p has components px and py, and point q has components qx and qy, then the distance between p and q is

$$\sqrt{(p_{x}-q_{x})^{2} + (p_{y}-q_{y})^{2}}$$

In N-dimensional space, we just tack on terms for the additional components:

$$\sqrt{(p_{1}-q_{1})^{2} + (p_{2}-q_{2})^{2} + \dots + (p_{N}-q_{N})^{2}}$$

The easiest way to think about a point in N-dimensional space for very large N is probably not visualizing it physically. Rather, think of each dimension representing some quantity that is independent of all the other dimensions. For example, we could think about each point representing a person's various likes and dislikes: the first dimension might represent how much you like pizza on a scale from 1 to 10, the second dimension might represent your love of Beyoncé on a scale from 1 to 10, and so on for N different preferences. Finding the distance between two points would, in this case, give us a measure of the overall similarity or dissimilarity between two people's likes and dislikes.

You are going to compute the Euclidean distance between two N-dimensional points, where N is at least 1 million. This computation consists of a fully data-parallel phase (computing the square of the difference of the components for each dimension), a reduction (adding together all of these squares), and finally taking the square root of the reduced sum.

Directions

Your code should contain the following:

A makefile
Files named support_code.hpp/.cpp that define ONLY the following two functions:
- ComputeDistance(int A[], int B[], int N): takes two arrays of size N and computes the Euclidean distance entirely on the host according to the formula above. It should return the Euclidean distance as a double.
- InitArray(int A[], int N): initializes an integer array of size N with random numbers. You should constrain these random numbers to fall within a fairly small range (0 to 99 is fine).
A CUDA kernel that takes 4 input parameters: 3 input arrays A, B, and C, and their size N. The kernel will take a pair of elements from A and B, compute $$(A_{i} - B_{i})^{2}$$ and store the result in C[i].
A CUDA kernel that takes the same 4 input parameters, but ends with C[0] containing the reduction result (either the Euclidean distance or its square) -- see below.
Host code as specified below.
A text file that explains any control divergence in the kernel that computes the full Euclidean distance (see below).

Your host program should make sure that the following happens:

Declares the arrays (host and device). All arrays should be dynamically allocated; the host arrays can be allocated either with malloc or new, while the device arrays should be allocated with cudaMalloc.
Calls InitArray to initialize the host arrays.
Calls cudaMemcpy to copy the host input arrays to the device.
Calls ComputeDistance to compute the distance sequentially on the host, and prints the result, in exactly this format (where ____ is the default output with no special formatting):
```
Host result: ______
```
Invokes the first CUDA kernel to get the square of the difference of each pair of elements into a third array C.
Calls cudaMemcpy to copy the device output array back to the host.
Adds all the elements of the output array and takes the square root of the sum.
Prints the calculated result, in exactly this format:
```
CUDA result with host reduction: ______
```
Calls the second CUDA kernel to compute the full result, and calls cudaMemcpy to copy it back to the host.
Prints the calculated result, in exactly this format:
```
CUDA result with device reduction: ______
```
Frees the memory of the arrays. The host arrays can be deallocated using delete or free, while the device arrays should be deallocated with cudaFree.

For the CUDA kernel that does the second reduction, Mark Harris's CUDA reduction slides are a FANTASTIC resource. You only need to understand the first and second versions of his code. Remember to have a text file explaining any divergence among warps. You don't need to actually minimize divergence, but you should explain whether any exists or not.

Design your code to work for values of N that are both large and small, as well as values that do not evenly divide the number of thread blocks or threads.

Tar or zip these files along with your PDF for Problem 1, and submit on Moodle.