Managing Parallel Jobs in OpenLava

Blaunch – A new, parallel job remote task launcher in OpenLava 4.0 HPC environments are often complex by nature with many moving parts. This is especially true of parallel workloads. Making MPI jobs run reliably and predictably under the control of a workload manager can go a long way toward alleviating a range of potential problems and make the HPC environment more reliable productive. In a perfect world, the process of launching and managing MPI tasks would be consistent across all workload managers and MPI implementations. In the real world however, things are not always so simple. Th...
More

License Scheduling in OpenLava

In June of 2016 at the annual DAC conference, Teraproc previewed new license scheduling capabilities in OpenLava explaining how EDA licenses could be shared among different users, design teams and projects on the same cluster. With the release of OpenLava 4.0, resource based pre-emption has arrived making OpenLava a much more compelling choice for EDA firms concerned about license management. New Functionality in OpenLava supports not only flexible license sharing on the same cluster, but license sharing across multiple clusters as well. In this technical update to our earlier article ...
More

What’s New in OpenLava 4.0

OpenLava 4.0 is a significant new release that builds on the scalability improvements in previous releases adding many new features. OpenLava has been enhanced in the following areas: More flexible resource limits Enhanced parallel job management Improved software license management / preemption NUMA features, enhanced processor affinity controls Cluster management enhancements Fairshare scheduling enhancements Support for job groups Below, we provide a high-level overview of what’s changed for OpenLava cluster administrators and technical readers interested in understandin...
More

Teraproc OpenLava at DAC 2016

Thanks to all of our wonderful friends and customers for taking the time to stop by and say hello at this years DAC conference in Austin, Texas. It was a highly worthwhile event! Thanks also to the new clients who have adopted OpenLava and are realizing the benefits of flexible, IBM® Spectrum LSF™ compatible open-source workload management for electronic design environments. In case you missed our announcements and technology demonstrations, or are looking for details about the latest OpenLava release, a few pointers below: OpenLava 3.3 new features EDA License Optimization in Open...
More

OpenLava 3.3 – New Features

Performance and Scalability Enhancements As with previous releases, performance and scalability is a significant focus in OpenLava. OpenLava 3.3 provides significant enhancements enabling large clusters to be responsive even while processing large numbers of jobs. Some specific enhancements in OpenLava 3.3 are described below: Administrators familiar with OpenLava will be aware that the lsb.events file is used to log any changes in status associated with hosts, jobs or queues. Similarly, the master batch daemon (mbatchd) generates a record in lsb.acct for accounting purposes whenever a...
More

Preview of License Optimization in OpenLava Enterprise Edition

Application licenses are precious resources in the design environment. It is always desirable to maintain a high level of license utilization while ensuring that high priority jobs get licenses before low priority jobs. A common approach to managing application licenses with a workload scheduler is to configure and track licenses as resources. Users then specify license requirements when submitting jobs. The scheduler will hold the job in a queue if the resource (license) is not available. This simple approach prevents jobs from being dispatched from the queue only to fail at run-time when li...
More

OpenLava 3.3 – Benchmarking one million jobs on a 100,000 core cluster

  Note to reader: This article supersedes an earlier article on scalability testing featured in October 2015 on the Teraproc blog.  With OpenLava 3.3 scalability has been enhanced significantly. As customers deploy OpenLava in ever larger environments, scalability, throughput and performance become increasingly important. To help meet customer requirements in these areas, OpenLava release 3.3 provides a number of important enhancements: Parallelized job event handling to speed cluster start-up and minimize downtime Enhanced inter-daemon communications for improved efficiency and p...
More

Configuring OpenLava 3.2 for large clusters

Overview OpenLava 3.2 is more scalable than the previous OpenLava versions. When an OpenLava cluster has over 1,000 job slots, default configurations settings are no longer suitable, and you should tune the OS and OpenLava configuration parameters on the master host. This document discusses how to configure OpenLava for large environments. These recommendations were implemented in a recent test involving a 500 node cluster with 50,000 cores running 1,000,000 jobs of various durations. OpenLava master hardware recommendation To enable OpenLava to schedule large number jobs, it is recommended...
More

New Features in OpenLava 3.2

A technical update on OpenLava 3.2 for cluster administrators 1. Scalability improvements 50k+ slots, 500k+ jobs per cluster With OpenLava 3.2, Teraproc is increasing the supported job and slot limits for OpenLava Enterprise Edition. Any workload manager can claim to support large clusters, but what really matters is the ability to drive workload throughput, and keep a large cluster fully utilized - a key point sometimes missed. With OpenLava 3.2, Teraproc are pleased to support these new limits supported by a recent full-scale performance benchmark conducted on Amazon web ser...
More

Webinar: What’s new in OpenLava – March 3rd, 2016

OpenLava is an open-source, IBM® Platform LSF® compatible workload manager. Over the past two years, the pace of OpenLava development has been nothing short of amazing. Whether you are using IBM Platform LSF, OpenLava, or competitive workload managers you'll want to learn about recent advances in OpenLava. Register now for our Webinar on March 3rd, 2016 at 11:00 AM eastern time. In this free seminar, sponsored by Teraproc Inc. and OpenLava.org, you will learn about the advantages of open-source software, and understand why top-tier global firms are using OpenLava to manage large clusters...
More

Meet OpenLava.org’s founder: Dave Bigagli

As anyone who has managed a development effort knows, building quality software takes focus, dedication and collaboration. In open-source development this is especially true. It is often said “it takes a village”, so we thought it would be nice to take a moment to acknowledge the contributions of the respected “mayor” of our own OpenLava.org village, David Bigagli. David was a senior architect at Platform Computing between 1996 and 2010 and has also worked with other leading software firms in HPC and cluster computing including SchedMD and Bright Computing. Since David started Openlava....
More

OpenLava gets a new WebGUI

I’ve often heard from experienced Linux sysadmins that GUI’s get in the way of doing real work. As someone who doesn’t always know the right commands to use though, a well-designed web interface can be pretty handy. Recently I had a chance to look at the new web interface now offered with OpenLava Enterprise Edition, Teraproc’s commercially supported version of OpenLava. Teraproc continues to make investments in OpenLava, and contribute improvements back to the open-source base (http://github.com/openlava). With all the investments Teraproc are making, it is understandable that they’ll want...
More

OpenLava enjoys new momentum

Since OpenLava was first offered as an open source tool almost a decade ago, there have been a little over 6,000 downloads. While accurate metrics are hard to come by, especially in the early years , the OpenLava.org site saw a steady rate of one or two downloads per day not including clones of the sources from GitHub. In the last year, we've devised better ways to gather better metrics, and Teraproc has been tracking free downloads of the compiled RPMs for OpenLava 2.2, 3.0 and now 3,1 releases. With new enhancements in OpenLava, we've measured a steady increase in traffic and a 300% in...
More

Testing OpenLava at Scale

  Note to reader: This result has been superseded by a more recent benchmark on OpenLava 3.3. For the latest results, please see this new 1,000,000 job benchmark instead. To validate the latest OpenLava 3.1 release at scale, Teraproc recently ran a significant benchmark on Teraproc's HPC Cluster-as-a-Service. The benchmark was designed to stress the OpenLava scheduler with a large workload representative of what OpenLava users might run in production. The goals of the benchmark were to demonstrate that: OpenLava can manage workloads at scale – 500,000 jobs scheduling across 5,600 co...
More

What’s New in OpenLava 3.1?

A technical update on the OpenLava 3.1 release for cluster administrators 1. Resource requirement enhancements to support multiple application licenses e.g. Availability of License A or License B The resource requirement string in OpenLava 3.1 has been enhanced to support an OR operator. In design and simulation environments, users may have multiple versions of the same software tool where license usage is metered by FlexLM. While some simulations require a particular version of the licensed tool, others may run equally with multiple versions. In this later case it is useful to indicate...
More

Ten ways to reduce the cost of your EDA infrastructure

As semiconductor design firms know well, infrastructure for EDA (Electronic Design Automation) is an expensive business. Traditional rules of thumb for IT costs don’t apply when the cost of tools and design talent dwarfs infrastructure costs. When it comes to EDA, productivity and efficiency are jobs one, two and three! If you’re managing a design environment, you’re probably running commercial tools from the likes of Cadence®, Synopsys® and Mentor Graphics. Thorough device simulation is critical to success at tape-out, so you’re probably also operating compute clusters comprised of the most ...
More

Running MPI Jobs with OpenLava

Introduction OpenLava is an open source, IBM® Platform® LSF compatible workload manager that can schedule both serial and parallel jobs. MPI (Message Passing Interface) is a widely used programming interface in High-Performance Computing (HPC) applications to parallelize the execution of large-scale problems. There are multiple commonly used MPI implementations. This document describes how to run MPI applications with OpenLava. Most MPI implementations support integrations with commonly used workload managers. For the most part, these integrations use a workload manager specific remote tas...
More

GPU-Accelerated R in the Cloud with Teraproc Cluster-as-a-Service

Analysis of statistical algorithms can generate workloads that run for hours, if not days, tying up a single computer. Many statisticians and data scientists write complex simulations and statistical analysis using the R statistical computing environment. Often these programs have a very long run time. Given the amount of time R programmers can spend waiting for results, it makes sense to take advantage of parallelism in the computation and the available hardware. In a previous post on the Teraproc blog, I discussed the value of parallelism for long-running R models, and showed how multi-co...
More

Scaling R clusters? AWS Spot Pricing is your new best friend

An elastic infrastructure for distributed R Most of us recall the notion of elasticity from Economics 101. Markets are about supply and demand, and when there is an abundance of supply, prices usually go down. Elasticity is a measure of how responsive one economic variable is to another, and in an elastic market the response is proportionately greater than the change in input. It turns out that cloud pricing, on the margin at least, is pretty elastic. Like bananas in a supermarket, CPU cycles are a perishable commodity. If capacity sits idle and doesn’t get used, it goes away. Your cloud...
More

Why HPC Clusters are like Bananas

Realizing a more cost-efficient infrastructure Most of us recall the notion of elasticity from Economics 101. Markets are about supply and demand, and when there is an abundance of supply, prices usually go down. Elasticity is a measure of how responsive one economic variable is to another, and in an elastic market the response is proportionately greater than the change in input. What does this have to do with HPC or analytic clusters you ask? It turns out that cloud pricing, on the margin at least, is pretty elastic. Like bananas in a supermarket, CPU cycles are a perishable commodity. If...
More

Accelerating R with multi-node parallelism – Rmpi, BatchJobs and OpenLava

Gord Sissons, Feng Li In a previous blog we showed how we could use the R BatchJobs package with OpenLava to accelerate a single-threaded k-means calculation by breaking the workload into chunks and running  them as serial jobs. R users frequently need to find solutions to parallelize workloads, and while solutions like multicore and socket level parallelism are good for some problems, when it comes to large problems there is nothing like a distributed cluster. The message passing interface (MPI) is a staple technique among HPC aficionados for achieving parallelism. MPI is meant to op...
More

Seeing the Forest and the Trees – a parallel machine learning example

Parallelizing Random Forests in R with BatchJobs and OpenLava By: Gord Sissons and Feng Li In his series of blogs about machine learning, Trevor Stephens focuses on a survival model from the Titanic disaster and provides a tutorial explaining how decision trees tend to over-fit models yielding anomalous predictions. How do we build a better predictive model? The answer as Trevor observes, is to grow a whole forest of decision trees, let the models grow as deep as they will, and let these randomized models vote on the outcome. It turns out that a large collection of imperfect models lead...
More

Parallel R with BatchJobs

Parallelizing R with BatchJobs - An example using k-means Gord Sissons, Feng Li Many simulations in R are long running. Analysis of statistical algorithms can generate workloads that run for hours if not days tying up a single computer. Given the amount of time R programmers can spend waiting for results, getting acquainted parallelism makes sense. In this first in a series of blogs, we describe an approach to achieving parallelism in R using BatchJobs, a framework that provides Map, Reduce and Filter variants to generate jobs on batch computing systems running across clustered comput...
More

Exploring Fairshare Scheduling in OpenLava

Sharing is good. Whether we're sharing a soda, an apartment or an HPC cluster, chances are good that sharing can save us money. As readers of my previous blog will know, I've doing some playing around with OpenLava. OpenLava is an LSF compatible workload manager that is free for use and downloadable from http://openlava.org or http://teraproc.com. One of the new features in OpenLava 3.0 is fairshare scheduling. I know a lot of clients see value in this, so I decided to setup another free cluster in the cloud for the purpose of trying out OpenLava 3.0's new fairshare scheduler. For tho...
More

Early access for R CaaS

Teraproc announces early registration for our R Cluster-as-a-Service offering. It's the eleventh hour so hurry up and secure your space! Learn more about the service here. As data scientists and statisticians know, R is an excellent language for analytic problems. For large scale problems, configuring distributed Hadoop or compute clusters can be a challenge. Talented technical people can spend days or weeks building out distributed clusters, assembling all the needed software components and re-inventing wheels. Getting all the components to work together can be a daunting task. ...
More