Analysis of statistical algorithms can generate workloads that run for hours, if not days, tying up a single computer. Many statisticians and data scientists write complex simulations and statistical analysis using the R statistical computing environment. Often these programs have a very long run time. Given the amount of time R programmers can spend waiting for results, it makes sense to take advantage of parallelism in the computation and the available hardware.
In a previous post on the Te...

Read More
# Teraproc Blog

# Scaling R clusters? AWS Spot Pricing is your new best friend

An elastic infrastructure for distributed R
Most of us recall the notion of elasticity from Economics 101. Markets are about supply and demand, and when there is an abundance of supply, prices usually go down. Elasticity is a measure of how responsive one economic variable is to another, and in an elastic market the response is proportionately greater than the change in input.
It turns out that cloud pricing, on the margin at least, is pretty elastic. Like bananas in a supermarket, CPU cyc...

Read More
# Why HPC Clusters are like Bananas

Realizing a more cost-efficient infrastructure
Most of us recall the notion of elasticity from Economics 101. Markets are about supply and demand, and when there is an abundance of supply, prices usually go down. Elasticity is a measure of how responsive one economic variable is to another, and in an elastic market the response is proportionately greater than the change in input.
What does this have to do with HPC or analytic clusters you ask? It turns out that cloud pricing, on the margin a...

Read More
# Accelerating R with multi-node parallelism – Rmpi, BatchJobs and OpenLava

Gord Sissons, Feng Li
In a previous blog we showed how we could use the R BatchJobs package with OpenLava to accelerate a single-threaded k-means calculation by breaking the workload into chunks and running them as serial jobs.
R users frequently need to find solutions to parallelize workloads, and while solutions like multicore and socket level parallelism are good for some problems, when it comes to large problems there is nothing like a distributed cluster.
The message passing inter...

Read More
# Seeing the Forest and the Trees – a parallel machine learning example

Parallelizing Random Forests in R with BatchJobs and OpenLava
By: Gord Sissons and Feng Li
In his series of blogs about machine learning, Trevor Stephens focuses on a survival model from the Titanic disaster and provides a tutorial explaining how decision trees tend to over-fit models yielding anomalous predictions.
How do we build a better predictive model? The answer as Trevor observes, is to grow a whole forest of decision trees, let the models grow as deep as they will, and let these ...

Read More
# Parallel R with BatchJobs

Parallelizing R with BatchJobs - An example using k-means
Gord Sissons, Feng Li
Many simulations in R are long running. Analysis of statistical algorithms can generate workloads that run for hours if not days tying up a single computer. Given the amount of time R programmers can spend waiting for results, getting acquainted parallelism makes sense.
In this first in a series of blogs, we describe an approach to achieving parallelism in R using BatchJobs, a framework that provides Map, Re...

Read More
# Early access for R CaaS

Teraproc announces early registration for our R Cluster-as-a-Service offering. It's the eleventh hour so hurry up and secure your space!
Learn more about the service here.
As data scientists and statisticians know, R is an excellent language for analytic problems. For large scale problems, configuring distributed Hadoop or compute clusters can be a challenge. Talented technical people can spend days or weeks building out distributed clusters, assembling all the needed software components a...

Read More