Analysis of statistical algorithms can generate workloads that run for hours, if not days, tying up a single computer. Many statisticians and data scientists write complex simulations and statistical analysis using the R statistical computing environment. Often these programs have a very long run time. Given the amount of time R programmers can spend waiting for results, it makes sense to take advantage of parallelism in the computation and the available hardware.
In a previous post on the Teraproc blog, I discussed the value of parallelism for long-running R models, and showed how multi-co...

More
# R-blog

# Scaling R clusters? AWS Spot Pricing is your new best friend

An elastic infrastructure for distributed R
Most of us recall the notion of elasticity from Economics 101. Markets are about supply and demand, and when there is an abundance of supply, prices usually go down. Elasticity is a measure of how responsive one economic variable is to another, and in an elastic market the response is proportionately greater than the change in input.
It turns out that cloud pricing, on the margin at least, is pretty elastic. Like bananas in a supermarket, CPU cycles are a perishable commodity. If capacity sits idle and doesn’t get used, it goes away. Your cloud...

More
# Accelerating R with multi-node parallelism – Rmpi, BatchJobs and OpenLava

Gord Sissons, Feng Li
In a previous blog we showed how we could use the R BatchJobs package with OpenLava to accelerate a single-threaded k-means calculation by breaking the workload into chunks and running them as serial jobs.
R users frequently need to find solutions to parallelize workloads, and while solutions like multicore and socket level parallelism are good for some problems, when it comes to large problems there is nothing like a distributed cluster.
The message passing interface (MPI) is a staple technique among HPC aficionados for achieving parallelism. MPI is meant to op...

More
# Seeing the Forest and the Trees – a parallel machine learning example

Parallelizing Random Forests in R with BatchJobs and OpenLava
By: Gord Sissons and Feng Li
In his series of blogs about machine learning, Trevor Stephens focuses on a survival model from the Titanic disaster and provides a tutorial explaining how decision trees tend to over-fit models yielding anomalous predictions.
How do we build a better predictive model? The answer as Trevor observes, is to grow a whole forest of decision trees, let the models grow as deep as they will, and let these randomized models vote on the outcome. It turns out that a large collection of imperfect models lead...

More
# Parallel R with BatchJobs

Parallelizing R with BatchJobs - An example using k-means
Gord Sissons, Feng Li
Many simulations in R are long running. Analysis of statistical algorithms can generate workloads that run for hours if not days tying up a single computer. Given the amount of time R programmers can spend waiting for results, getting acquainted parallelism makes sense.
In this first in a series of blogs, we describe an approach to achieving parallelism in R using BatchJobs, a framework that provides Map, Reduce and Filter variants to generate jobs on batch computing systems running across clustered comput...

More