# Replacing Slow R Code with CUDA

# The Problem

Suppose we have a vector and a collection of random samples. For each , we want to calculate

# Naive R Solution

If you were new to R, you might do something like this:

If you’ve had any experience with R, you probably recoiled in horror at the sight of those nested for loops. For loops are notoriously slow in R, and the function above is doing iterations with them. In fact, that’s probably the single worst way I could have written that function. If and are small then it doesn’t matter much, but even if (not big at all for practical applications), look what happens to the runtime:

```
Unit: seconds
expr min lq mean median uq max neval
func1(x, s) 1.608869 1.650864 1.718337 1.67365 1.77996 2.070576 100
```

It takes about 1.7 seconds on average to run. This is not ideal, especially in my research, where I need and at least, and then I need it to run tens of thousands of times.

# A Faster R Solution

Let’s try a different approach. R has *vectorized* functions, like `mean(x)`

, which are much, much faster than doing the equivalent calculation with a loop.

The calculation is taken care of in one line, `mean(cos(x*s))`

. The function is wrapped in `Vectorize`

, which tells R to apply the function to every element of `x`

, one at a time. If we run this one with , we get

```
Unit: milliseconds
expr min lq mean median uq max neval
func2(x, s) 27.50761 29.18643 31.93724 31.30861 33.46475 50.32115 100
```

That’s more like it. This runs in about 30 milliseconds, 50 times faster than the first function. Let’s bump up the sizes to and see what happens.

```
Unit: seconds
expr min lq mean median uq max neval
func2(x, s) 1.956879 2.039474 2.186676 2.151696 2.296424 3.011084 100
```

Hmm. The evaluation time has climbed up to more than 2 seconds. It doesn’t sound like much, but it really adds up when you’re calling this function thousands of times.

# Parallelization with CUDA

We could rewrite the function in C, a lower-level language than R, but this only speeds things up by a factor of 2 or 3 at best. The main reason why this is so slow is that R (and ordinary C as well) is *single-threaded*. This means that R can only do one thing at a time. It takes , calculates , then takes , calculates , and so on until it hits . But there’s no reason that things *must* be evaluated so sequentially. It would save a lot of time if we could do that calculation for lots of different ’s simultaneously. This is called *parallelization*.

CUDA is an API developed by Nvidia for doing parallel computing on a GPU (graphics processing unit). GPUs are designed with parallel computing in mind. They are essentially made up of thousands of little processors that are organized in such a way that they can split up a task, execute part of it, and recombine the results. Here is a CUDA file that accomplishes our task. I named it `mc.cu`

.

# Using CUDA in R

To use this in R, we need to compile it as a shared library object. I do this in two steps, although I'm not sure if it's strictly necessary. First I compile the source code (the `mc.cu`

file above) into an object file called `mc.o`

.

Then, in order for R to use it, I turn it into a shared library file called `mc.so`

.

I used a *makefile* so I wouldn't have to remember what to do every time. The makefile looks like this:

At this point, here is the structure of my directory.

```
.
├── bin
│ └── mc.o
├── makefile
├── mc.so
└── src
├── mc.cu
└── mc.h
```

Now, in an R session, first we need to load that compiled library with `dyn.load()`

, and then we can write a wrapper function around the call to C.

Finally, we can use this function just like any other regular R function. When I tried this with and , it ran in about half a second. Pretty amazing if you ask me.

```
user system elapsed
0.284 0.176 0.463
```

**NOTE:** If you don’t have an Nvidia GPU in your computer, this will not work for you. I don’t have one in mine, so I needed to use Penn State’s ICS-ACI computing cluster, which was an adventure in itself. I have written another blog post about how to do that.