Realizing a more cost-efficient infrastructure
Most of us recall the notion of elasticity from Economics 101. Markets are about supply and demand, and when there is an abundance of supply, prices usually go down. Elasticity is a measure of how responsive one economic variable is to another, and in an elastic market the response is proportionately greater than the change in input.
What does this have to do with HPC or analytic clusters you ask? It turns out that cloud pricing, on the margin at least, is pretty elastic. Like bananas in a supermarket, CPU cycles are a perishable commodity. If capacity sits idle and doesn’t get used, it simply goes away. Your cloud provider are no dummies, and much like store owners that mark down the price of ripe bananas before they spoil, cloud providers would rather sell idle capacity for pennies on the dollar rather than have it go to waste.
To arbitrate the price of excess capacity, Amazon Web Services have created a spot market for purchasing machine cycles in addition to their on-demand and reserved instance marketplace pricing. You can essentially bid on machine capacity, and as long as your bid price exceeds the spot price, you get to use the machine. If I’m running an ERP system or a transactional database this market is probably not for me, but if I’m running an HPC cluster for research or non-critical simulations, it can make a lot sense to buy machine cycles on the discount rack.
Savings can be dramatic
Consider a particular Amazon EC2 machine type – a c4.xlarge instance with four virtual CPUs and 7.5 GB or RAM. At the time of this writing, the price per hour for this instance is $0.232 per hour. The benefit of paying “retail” (on-demand) is that the machine is yours and won’t go away.
If you’re willing to take your chances and bid for this machine type on the open market however, the savings can be dramatic. The blue line on the chart below represents the spot price history for this machine on Amazon’s east coast servers. For just $.04 per week (4 cents) I would have been well above the price required to hold onto machine instances 95% percent of the time representing of savings of over 80%. At a more conservative 15 cent bid price (~ a 35% discount from on-demand rates) I would have had no interruption in service in this scenario. Markets are dynamic and complicated, but Amazon indicates that on average spot prices are 86% lower than on-demand prices. As with any market, prices can fluctuate (in some cases wildly), so one needs to be careful to bid only what one willing to pay, there are big savings to be had.
The importance of being elastic
HPC applications are often well situated to take advantage of spot pricing. While computer simulations may be important, they are not necessarily urgent. If I have a job that is going to run for a few hours on ten computers, it may not matter whether I run the job this morning or this afternoon. If the job gets interrupted, I can just start it again. Amazon’s Elastic Cloud Computing (EC2) service reflects elasticity in the name, and while the infrastructure is elastic, it does not necessary follow that a deployed cluster is elastic. Software environments are often brittle and unable to tolerate dynamic changes in the underlying infrastructure while applications or models are running. Also, if it takes hours to manually re-constitute the software stack after spot-priced nodes are inevitably preempted, it makes more sense to use Amazon’s on-demand pricing. To take advantage of spot pricing, there are two essential prerequisites:
- An elastic software infrastructure – the application environment needs to be tolerant of shifts in available capacity at run-time
- Agility – In the event that cluster nodes are preempted by shifts in market pricing, there needs to be sufficient automation to rapidly re-constitute the running environment
Leveraging Spot pricing cluster-as-a-service offerings
Fortunately for HPC application users, services like Cycle Computing and Teraproc cluster-as-a-service check both of these boxes. For most distributed workloads (serial batch jobs, job arrays, and parametric sweeps) the OpenLava workload manager is tolerant of one or more nodes being lost at run-time. Cluster nodes can be added or removed at run-time, and interrupted jobs can be re-run automatically based on configurable policies.
Even more importantly, in the event that spot-priced cluster nodes are preempted, the cluster head-node that houses applications and data continues to run, but as a single node cluster. This means that my applications, data and user accounts still exist. I can simply flex the cluster up again naming a new price for nodes on the spot market. As shown below in the Teraproc cluster-as-a-service provisioning interface below, AWS spot pricing is an integral part of the service. You can learn how clusters are automatically provisioned in the Teraproc service’s getting started guide.
The moral of the story? When it comes to infrastructure for HPC and analytic applications, it is not just the infrastructure that is elastic – the pricing model and the management software is elastic as well. The key to achieving financial savings is to have a similarly elastic and agile software environment. And when it comes to running distributed compute clusters in the cloud, it can pay to buy your bananas bruised!