Getting Started with Teraproc R Cluster-as-a-Service

Overview

This guide explains how to get started with the beta release of Teraproc’s R Analytics Cluster as a service. Teraproc’s R service enables users to quickly and easily deploy a complete, ready-to-run R environment in the cloud that uses the familiar R Studio web interface.

At a high level, the required steps are:

  1. Obtain an Amazon Web Services (AWS) account
  2. Register for the Teraproc service
  3. Setup an Identity and Access Management (IAM) Role in AWS
  4. Create your Teraproc R cluster
  5. Login and use the cluster

Chances are you can perform many of these steps with little or no guidance, but in case additional clarification is required, we offer it here. If you have any difficulties, you can also send e-mail to rcaas-support@teraproc.com or click on the support icon within the Teraproc cluster management screens.

Note: The service requires features not found in old versions of web browsers. The supported web browsers are: IE 11 and above, Firefox 30 and above, Chrome, and Safari 8.

Obtain an Amazon Web Services (AWS) account

There has been a lot written about creating AWS accounts, so there is no sense repeating it here. The process involves a few steps, but is straightforward.

If you are just getting started with Amazon Web Services, you may want to start with the Free Tier. The AWS Free Tier allows you to experiment with servers in the cloud at no cost, subject to some limitations that will become clear as you go through the registration process. You can easily expand your usage from the free account to use larger clusters in future. Start the registration process at the link below:

http://aws.amazon.com/free/

The Amazon service you will be using is the Elastic Cloud Computing service (referred to as EC2). You don’t need to know much about EC2 because the Teraproc service will manage the process of creating and releasing clusters and machine instances in the cloud on your behalf. To get started, click on the Create a Free Account button as shown on the Amazon website and follow the registration process. You will need to enter a credit card, however as long as you use the free instances, and don’t exceed usage maximums, you will not need to pay for the service for the first year.

During the process of creating an account you will be asked to create an Amazon EC2 key pair. It is a good idea to do so, however the Teraproc cluster service will manage keys for you, so you won’t actually need this key pair to use the Teraproc service.

The process of creating an Amazon web-services account is discussed in detail here. http://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/AMS5.0CreatingAnAWSAccount.html

aws_free_tier

Once you have your login credentials for Amazon Web Services, you are ready to move to the next step.

Register for the Teraproc Service

Once you have an Amazon Web Services account you can register for the Teraproc R cluster-as-a-service offering. Visit http://www.teraproc.com and click the button that says “Launch the Service”.

You can also visit http://rcluster.teraproc.com directly and register an account.

If you have previously registered for the Teraproc service you can login at this point. First time users should click the Register button.

When prompted, enter your desired username, e-mail, password and organization as shown. The username that you create should be in lower case. The value that you enter here will become your default login to R Studio and the Linux shell (if you choose to log in to the shell) once the cluster is deployed.

teraproc_registration

After you’ve signed up for your Teraproc you should be able to login. Simply visit http://rcluster.teraproc.com and login using the username and password you created in the step above.

teraproc_login

Setup an Identity and Access Management Role

After you’ve created your AWS account and registered with Teraproc, you need to permit Teraproc to access your AWS account so that machine instances can be created and managed on your behalf.

In order to allow this, you need to configure an AWS Identity and Access Management (IAM) Role.

This step is a little tedious, but this is one time effort. Once you have setup the IAM role, you will be able to create and remove as many clusters as you would like without needing to repeat this process.

Follow the steps below to create an IAM Role. These are well documented within the Teraproc web interface and you can access further instructions by clicking on the help icon beside the field titled “ARN Role”. The steps are detailed here as well.

  • Log in to the AWS management console (http://aws.amazon.com) with the user account you created previously
  • Next, find the Identity and Access Management (IAM) screen in the AWS console. Clicking on the orange cube in AWS will bring you to the console. Look for the IAM icon under Administration & Security below. Click on Identity & Access Management

iam_icon

  • Create a new role in the Identity and Access Management screen (IAM) by selecting Roles from the options on the left and clicking on Create New Role.
  • Give the role a name – for example teraproc-access. Write down the name of the role because you’ll need it later steps.
  • When invited to “Select Role Type” check the box beside Role for Cross-Account Access. Click the Select button beside the option that says “Allows IAM users from a 3rd party AWS account to access this account” (the second option shown below).

select_role_type

  • Enter Teraproc’s unique Account ID which is 122931797421. In the field labeled External ID enter the text provision-R-cluster. You can leave the option Require MFA unchecked and click on Next Step in the lower right corner of the screen.
  • Skip the Attach Policy page by clicking Next Step in the lower right again.
  • You should now see the Review page. If everything looks correct, click Create Role.
  • After this you are returned to the roles page. You should now see the new role that you created. Click on the name of the role (teraproc-access in our example) to edit the role.
  • In the Permissions section click on Inline Policies and create a new policy.
  • On the Set Permissions page, select Custom Policy and click Select.
  • Give the policy a name such as teraproc-policy, and copy the template policy from this link: http://rcluster.teraproc.com/iam_role.policy. An example is shown below. After this you can validate and apply the policy.

review_policyCongratulations. You’ve managed to setup an IAM role. The critical piece of information you will need is that Amazon Role Name or ARN. This will be visible under Roles when you click on the Role Name that you created.

The Role ARN will be of the form:

arn:aws:iam::123456789012:role/teraproc-access

Create your R Cluster

Now return to the Teraproc R cluster-as-a-service setup by login into http://rcluster.teraproc.com using the username and password you created earlier.

To create your cluster, click the green create cluster button

create_cluster

If you want the cluster to be free, setup a three node cluster using the t2-micro instance type on Amazon.COM accepting the defaults for the volume type and the shared home directory size. Also make sure you setup only a three node cluster. Otherwise charges will apply.

Teraproc supports several different AWS machines types providing a variety of price points and performance levels as shown below. If you are developing applications that can take advantage of NVIDIA general-purpose GPUs, you may want to select the G22xlarge instance for optimal performance. You can learn more about R and GPUs at http://www.r-tutor.com/gpu-computing.

You can learn about managing GPU jobs with the OpenLava scheduler at https://dokuwiki.wesleyan.edu/doku.php?id=cluster:119

machine_types

In the example below, we create a cluster of four nodes comprised of Amazon C3Xlarge instances with 4 vcpus and 14 GB of memory which at the time of this writing cost $0.21 per hour.

You will need to copy the “Role ARN” from the Roles Summary screen that you generated in AWS in the previous step into this form as shown below

create_cluster_form

Enter your desired R Studio login name and R Studio password in the space provided. You can ignore the setting labelled spot instances for now. You can set this up in future if you want to expand he cluster to take advantage of spot pricing on Amazon EC2.

After you have filled in the create cluster form press the button “create cluster”.

After several seconds, a screen like the one below will appear.

create_in_progress

While not necessary, you can log into your Amazon account, and select the view called EC2 Instances to watch the machine images you requested being created.

aws_status

After a few minutes, the Teraproc interface shows that the cluster called “mycluster” is available and ready for use. To access the cluster, expand the cluster by clicking on the blue arrow that points to the right.

cluster_available

When you expand the cluster view, you see the screen below. Note that the RStudio service URL is provided for you. You can click on this link at any time to access your cluster in the cloud. As long as your cluster is up and running, this link should be available.

running_clusters

You can click on the link to access your personal RStudio instance in the cloud. You can also perform operations like scaling your cluster up or down, pausing the cluster (so that you do not incur charges on Amazon) or terminating the cluster altogether, removing it from your Amazon account.

Login and use the cluster

When you visit the link provided, you should see the RStudio login. Login using the credentials you provided when you created the cluster.

rstudio_login

After this, you should be in R Studio. This interface will be very familiar to most R users.

To verify that cluster components behind the scenes like OpenMPI and OpenLava are installed and configured properly, you can load the test script called “kmeans-batch.R” and run it using the command source(‘kmeans-batch.R”) in the console pane. This script will exercise the BatchJobs components and the OpenLava workload manager validating that everything is working properly.

To exercise OpenMPI and run the same k-means example using Rmpi, use the “kmeans-mpi.R” script.

Please note: After the cluster starts, especially on larger clusters it may take some time for all of the OpenLava cluster hosts to come on-line. You may need to wait a short time after logging on before the example programs work. Also, if you deploy a free cluster (with a maximum of three t2.micro instances) you will need to make sure that you are running no more than two MPI slaves. Check the source code to make sure that the number of MPI slaves corresponds to the size of the cluster that you have deployed.

 

rstudio_console

If the k-means R scripts complete successfully, you should see the image below appear in the plots tab confirming that the cluster is working and responding to requests. The sample program has run a k-means algorithm in parallel on the cluster to categorize 250,000 data points about their cluster centers and categorize the results.

plot

If you’ve gotten this far, congratulations! Your R cluster is now and running and ready for use.

Transferring data to and from the cluster.

As you start using the cluster, at some point you will want to move data back and forth between your local computer and the Teraproc cluster. Advanced users may wish to login via a secure-shell (ssh) and run R or other commands directly on cluster nodes. Before you can login, you will need to obtain your private RSA key that was created for you when the cluster was deployed. Follow the steps below to retrieve your private key.

  • Login to R-Studio
  • Under the Tools menu, select Shell to open a command shell on the cluster head-node
  • Change to the .ssh directory under your home directory and view the contents of your private key file. Your private key is stored in the file id_rsa in the directory ~/.ssh.

shell_in_rstudio

  • Copy the contents of this file (starting at the beginning of —— BEGIN RSA PRIVATE KEY ——– until the end of the line ——- RSA PRIVATE KEY ——-) using your browsers copy function – Control-C on most Windows browsers.
  • Paste the contents of this file on your local machine into a file called “tproc_priv_key.pem” or some other name that you will remember on your local system.

You will need an ssh client on your local machine for remote login and a program that supports scp or sftp protocols for secure copy.

If you are comfortable with using command line utilities, good choices on Windows are Putty or Cygwin. You should download Putty regardless of the tool you are using because other programs on Windows may rely on utilities provided with open source Putty. If you prefer a tool on Windows with a graphical interface, you can use WinSCP. For Mac OS X users, OS X should include OpenSSH by default. Other open source SSH implementations are listed here – http://www.openssh.com/macos.html

  • When setting up your login session in your secure copy program (we use WinSCP in this example) the hostname will be the IP address of the R Studio server and the login credentials will be the same credentials you use to login to R Studio. Enter them as shown.

winscp_setup

  • Next, you’ll need to save your private key in a format that is recognized for use by your client program. Both WinSCP and Putty use the same format for the private key. The Puttygen program is used to convert you private key into a format recognized by Putty or the WinSCP clients. Save the private key using the interface below as “tproc_pkey.ppk”. WinSCP calls the Putty key generator for you making this process straightforward.

putty_key_generator

  • In the settings for your file transfer program, make sure that you specify the location of the private key file associated with your account and save the definition of your scp

advanced_settings

  • If you have followed these steps correctly, you should now be able to drag and drop files between your local machine and the Teraproc cluster host as shown below.

winscp_session