Suppose we have a vector and a collection of random samples. For each , we want to calculate
Naive R Solution
If you were new to R, you might do something like this:
If you’ve had any experience with R, you probably recoiled in horror at the sight of those nested for loops. For loops are notoriously slow in R, and the function above is doing iterations with them. In fact, that’s probably the single worst way I could have written that function. If and are small then it doesn’t matter much, but even if (not big at all for practical applications), look what happens to the runtime:
Unit: seconds expr min lq mean median uq max neval func1(x, s) 1.608869 1.650864 1.718337 1.67365 1.77996 2.070576 100
It takes about 1.7 seconds on average to run. This is not ideal, especially in my research, where I need and at least, and then I need it to run tens of thousands of times.
A Faster R Solution
Let’s try a different approach. R has vectorized functions, like
mean(x), which are much, much faster than doing the equivalent calculation with a loop.
The calculation is taken care of in one line,
mean(cos(x*s)). The function is wrapped in
Vectorize, which tells R to apply the function to every element of
x, one at a time. If we run this one with , we get
Unit: milliseconds expr min lq mean median uq max neval func2(x, s) 27.50761 29.18643 31.93724 31.30861 33.46475 50.32115 100
That’s more like it. This runs in about 30 milliseconds, 50 times faster than the first function. Let’s bump up the sizes to and see what happens.
Unit: seconds expr min lq mean median uq max neval func2(x, s) 1.956879 2.039474 2.186676 2.151696 2.296424 3.011084 100
Hmm. The evaluation time has climbed up to more than 2 seconds. It doesn’t sound like much, but it really adds up when you’re calling this function thousands of times.
Parallelization with CUDA
We could rewrite the function in C, a lower-level language than R, but this only speeds things up by a factor of 2 or 3 at best. The main reason why this is so slow is that R (and ordinary C as well) is single-threaded. This means that R can only do one thing at a time. It takes , calculates , then takes , calculates , and so on until it hits . But there’s no reason that things must be evaluated so sequentially. It would save a lot of time if we could do that calculation for lots of different ’s simultaneously. This is called parallelization.
CUDA is an API developed by Nvidia for doing parallel computing on a GPU (graphics processing unit). GPUs are designed with parallel computing in mind. They are essentially made up of thousands of little processors that are organized in such a way that they can split up a task, execute part of it, and recombine the results. Here is a CUDA file that accomplishes our task. I named it
Using CUDA in R
To use this in R, we need to compile it as a shared library object. I do this in two steps, although I'm not sure if it's strictly necessary. First I compile the source code (the
mc.cu file above) into an object file called
Then, in order for R to use it, I turn it into a shared library file called
I used a makefile so I wouldn't have to remember what to do every time. The makefile looks like this:
At this point, here is the structure of my directory.
. ├── bin │ └── mc.o ├── makefile ├── mc.so └── src ├── mc.cu └── mc.h
Now, in an R session, first we need to load that compiled library with
dyn.load(), and then we can write a wrapper function around the call to C.
Finally, we can use this function just like any other regular R function. When I tried this with and , it ran in about half a second. Pretty amazing if you ask me.
user system elapsed 0.284 0.176 0.463
NOTE: If you don’t have an Nvidia GPU in your computer, this will not work for you. I don’t have one in mine, so I needed to use Penn State’s ICS-ACI computing cluster, which was an adventure in itself. I have written another blog post about how to do that.