In this lab, you will find the Euclidean distance between two points in *N*-dimensional space using a combination of GPU-based and host-based computation. Along the way, you'll explore more of the CUDA API.

You may have already learned how to compute the distance between two points in 2-dimensional space. If point *p* has components *px* and *py*, and point *q* has components *qx* and *qy*, then the distance between *p* and *q* is `sqrt[(px-qx)^2 + (py-qy)^2]`.

In *N*-dimensional space, we just tack on terms for the additional components:
`sqrt[(p1-q1)^2 + (p2-q2)^2 + ... + (pN-qN)^2]`.

For prettier formatting of these equations, see the Wikipedia article on Euclidean distance.

The easiest way to think about a point in *N*-dimensional space for very large *N* is probably not visualizing it physically. Rather, think of each dimension representing some quantity that is independent of all the other dimensions. For example, we could think about each point representing a person's various likes and dislikes: the first dimension might represent how much you like pizza on a scale from 1 to 10, the second dimension might represent how much you like *Grey's Anatomy*, and so on for *N* different preferences. Finding the distance between two points would, in this case, give us a measure of the overall similarity or dissimilarity between two people's likes and dislikes.

We are going to compute the Euclidean distance between two *N*-dimensional points, where *N* is at least 1 million. This computation consists of a fully data-parallel phase (computing the square of the difference of the components for each dimension), a reduction (adding together all of these squares), and finally taking the square root of the reduced sum. For this assignment, you are required to do the data-parallel phase on the GPU, but you may do the reduction on the host. We will soon find that parallel reduction in CUDA requires more work from the programmer than it does in OpenMP and TBB.

Our server is up and running! It works best if you log into your `cwolf` account and then `ssh` to `rickroll.cs.sonoma.edu` from `cwolf`. All your `cwolf` files are visible on `rickroll`. You may also do this assignment at home if you have a CUDA-capable graphics card; if you do, mention in your initial comment that you worked at home and note the model of your graphics card (e.g. NVidia GeForce 9800 GT).

To compile programs, type `nvcc -o outputfile sourcefile.cu`

If you run into problems other than syntax errors, the likely culprit is your environment variables. You may need to

Your file should be called ` yourlastnameL9.cu` and should contain the following:

- Timing functionality and results (see below)
- A function called
`ComputeDistance`that takes as input two numerical arrays of size*N*and computes the Euclidean distance*entirely on the host*according to the formula above. It should return the Euclidean distance as a`double`. - A function called
`InitArray`that initializes an integer array with random numbers. You should constrain these random numbers to fall within a fairly small range (0 to 99 is fine). - A CUDA kernel that takes as input pointers to 3 numerical arrays of size
*N*. Call the input arrays`A`and`B`and the output array`C`. The kernel will take a pair of elements from`A`and`B`and compute the square of`A[i] - B[i]`. - A main function that does the following:
- Declares the arrays (host and device). All arrays should be dynamically allocated; the host arrays can be allocated either with
`malloc`or`new`, while the device arrays should be allocated with`cudaMalloc`. - Prints the number of CUDA-enabled hardware devices attached to the system by calling
`cudaGetDeviceCount`. You can find documentation for this function in the "Device Management" section of the CUDA Programming Guide. - Prints at least 3 interesting properties of Device 0, including the device name, by calling
`cudaGetDeviceProperties`. The first argument to this function is a pointer to a`struct`of type`cudaDeviceProp`. The fields of this`struct`are listed in the earlier version of the CUDA programming guide. - Calls
`InitArray`to initialize the host arrays. - Calls
`PrintArray`to print the original values of the host arrays. - Calls
`cudaMemcpy`to copy the host input arrays to the device. - Calls the CUDA kernel.
- Calls
`cudaMemcpy`to copy the device output array back to the host. - Adds all the elements of the output array and takes the square root of the sum.
- Prints the calculated result.
- Calls
`ComputeDistance`to compute the distance sequentially, and prints the result. - Frees the memory of the arrays. The host arrays can be deallocated using
`delete`or`free`, while the device arrays should be deallocated with`cudaFree`.

- Declares the arrays (host and device). All arrays should be dynamically allocated; the host arrays can be allocated either with

Check that your code works for values of N that are both large and small, as well as values that do not evenly divide the number of thread blocks or threads.

You should insert code to time your sequential and parallel computations:

- The sequential timing code can be copied from previous labs; start the timer in
`main`, just before the call to`ComputeDistance`, and stop it immediately after the call. - The parallel code should use CUDA's timing functions (see "Event Management" in the Programming Guide). Your measurement should capture the time to do the computation (both the CUDA kernel and the reduction) and the time to copy the arrays from the host to the device and back. It should not capture the time taken by memory allocations and deallocations.

In your initial comment, include the following time measurements (and list the value of N you used for them):

- 3 time measurements for the sequential code and their average
- 3 time measurements and their average for each of at 2 different CUDA configurations (different block dimensions/grid dimensions)

Make sure that your file is correctly named and then copy it to `~srivoire/cs385/submit`. Wait 2 minutes, and then check that it was correctly submitted by visiting http://rivoire.cs.sonoma.edu/cs385/lab9sub.txt.