Published: by

# The Problem

Suppose we have a vector $\{x_i: i = 1, \dots, N\}$ and a collection of $\{s_j: j = 1, \dots, M\}$ random samples. For each $x_j$, we want to calculate

# Naive R Solution

If you were new to R, you might do something like this:

If you’ve had any experience with R, you probably recoiled in horror at the sight of those nested for loops. For loops are notoriously slow in R, and the function above is doing $N \times M$ iterations with them. In fact, that’s probably the single worst way I could have written that function. If $N$ and $M$ are small then it doesn’t matter much, but even if $N=M=1000$ (not big at all for practical applications), look what happens to the runtime:

Unit: seconds
expr      min       lq     mean  median      uq      max neval
func1(x, s) 1.608869 1.650864 1.718337 1.67365 1.77996 2.070576   100


It takes about 1.7 seconds on average to run. This is not ideal, especially in my research, where I need $N=80000$ and $M=10000$ at least, and then I need it to run tens of thousands of times.

# A Faster R Solution

Let’s try a different approach. R has vectorized functions, like mean(x), which are much, much faster than doing the equivalent calculation with a loop.

The calculation is taken care of in one line, mean(cos(x*s)). The function is wrapped in Vectorize, which tells R to apply the function to every element of x, one at a time. If we run this one with $N=M=1000$, we get

Unit: milliseconds
expr      min       lq     mean   median       uq      max neval
func2(x, s) 27.50761 29.18643 31.93724 31.30861 33.46475 50.32115   100


That’s more like it. This runs in about 30 milliseconds, 50 times faster than the first function. Let’s bump up the sizes to $N=M=10000$ and see what happens.

Unit: seconds
expr      min       lq     mean   median       uq      max neval
func2(x, s) 1.956879 2.039474 2.186676 2.151696 2.296424 3.011084   100


Hmm. The evaluation time has climbed up to more than 2 seconds. It doesn’t sound like much, but it really adds up when you’re calling this function thousands of times.

# Parallelization with CUDA

We could rewrite the function in C, a lower-level language than R, but this only speeds things up by a factor of 2 or 3 at best. The main reason why this is so slow is that R (and ordinary C as well) is single-threaded. This means that R can only do one thing at a time. It takes $x_1$, calculates $\frac{1}{M}\sum^M_{j=1}\cos(x_1s_j)$, then takes $x_2$, calculates $\frac{1}{M}\sum^M_{j=1}\cos(x_2s_j)$, and so on until it hits $x_N$. But there’s no reason that things must be evaluated so sequentially. It would save a lot of time if we could do that calculation for lots of different $x_i$’s simultaneously. This is called parallelization.

CUDA is an API developed by Nvidia for doing parallel computing on a GPU (graphics processing unit). GPUs are designed with parallel computing in mind. They are essentially made up of thousands of little processors that are organized in such a way that they can split up a task, execute part of it, and recombine the results. Here is a CUDA file that accomplishes our task. I named it mc.cu.

# Using CUDA in R

To use this in R, we need to compile it as a shared library object. I do this in two steps, although I'm not sure if it's strictly necessary. First I compile the source code (the mc.cu file above) into an object file called mc.o.

Then, in order for R to use it, I turn it into a shared library file called mc.so.

I used a makefile so I wouldn't have to remember what to do every time. The makefile looks like this:

At this point, here is the structure of my directory.

.
├── bin
│   └── mc.o
├── makefile
├── mc.so
└── src
├── mc.cu
└── mc.h


Now, in an R session, first we need to load that compiled library with dyn.load(), and then we can write a wrapper function around the call to C.

Finally, we can use this function just like any other regular R function. When I tried this with $N=80200$ and $M=10000$, it ran in about half a second. Pretty amazing if you ask me.

   user  system elapsed
0.284   0.176   0.463


NOTE: If you don’t have an Nvidia GPU in your computer, this will not work for you. I don’t have one in mine, so I needed to use Penn State’s ICS-ACI computing cluster, which was an adventure in itself. I have written another blog post about how to do that.