As semiconductor design firms know well, infrastructure for EDA (Electronic Design Automation) is an expensive business. Traditional rules of thumb for IT costs don’t apply when the cost of tools and design talent dwarfs infrastructure costs. When it comes to EDA, productivity and efficiency are jobs one, two and three!
If you’re managing a design environment, you’re probably running commercial tools from the likes of Cadence®, Synopsys® and Mentor Graphics. Thorough device simulation is critical to success at tape-out, so you’re probably also operating compute clusters comprised of the most powerful hardware you can afford. You’re possibly also running IBM Platform LSF as a workload scheduler, a popular choice for EDA firms. If this sounds like your environment, read on as chances are good that at least one of the cost savers presented here will be relevant to you.
1) Save money with faster hardware – This sounds counter-intuitive, but spending a little more on hardware can save a lot on licenses. EDA tools are commonly licensed based on features that are checked out when a tool runs. Feature usage is generally metered by FlexNet licensing software (formerly FlexLM) from Flexera® Software. Simulation and regression tests are by far the most resource intensive workloads, and for these jobs performance matters. Costs vary, but high value license features can cost between $20 and $50K per year for a single feature and many firms license tens or even hundreds of features for widely used tools like Synopsys VCS or Mentor Graphics ModelSim. Consider a simulation comprised of 100,000 simulation jobs. Let’s assume my budget for a particular license feature costing $30K per feature per year is $3M annually. This means I am constrained to run 100 jobs concurrently no matter how much hardware I have. Let’s further assume that each job takes 20 seconds (of license checkout time) to execute. Running 100,000 jobs under ideal conditions in this case would take (100,000 jobs / (180 jobs per hour) / (100 concurrent jobs)) = 5.6 hours to complete the set of tests. Now imagine if the servers were twice as fast and could complete a job in just 10 seconds. A design firm using faster servers would have the choice of either doubling their productivity (running their workload in 2.8 hours), or saving up to the equivalent of 50 license features saving $1.5 million annually not counting un-needed infrastructure . A hundred concurrent jobs is not a lot of hardware given modern, multi-core architectures – probably no more than five to ten servers. $1.5 million in software license avoidance leaves a lot of money to upgrade a few servers. The calculus in production environments will be more complex, and this example misses some finer points to be sure, but faster hardware is always better.
2) Revisit the cloud – If you’re managing a design environment, you may be jaded about the economics of cloud for good reason. Cloud services are often sold as less costly, but for organizations with a mature IT function, running workloads in the cloud is often considerably more expensive than running servers on premise – even factoring the cost of personnel, power and facilities. The point above around the cost of EDA tools should give us pause however. What matters in EDA environments is access to state-of-the-art large memory servers. AWS and other cloud providers introduce new infrastructure constantly, meaning that it is easier to stay current with the latest server infrastructure. If on average I can be running on servers that are 25% to 30% faster, lower average spending on software licenses alone may justify the “cloud tax” – not to mention savings that accrue from avoiding the hassle and risk of managing and constantly upgrading your own infrastructure. Companies doing TCO comparisons often depreciate on premises hardware assets over three years and compare this to the cost of cloud infrastructure. This may not be an apples to apples comparison though – To be fair, you might want to consider the cost of upgrading your infrastructure every year or so – something easy to do in the cloud. Suddenly the cloud may start to look price competitive once more. Your tools vendor may not allow their simulation licenses to be used in the cloud, but if they do, or if you are using open source tools, the cloud is worth a look. You can try an open source cluster in the cloud for free using Teraproc’s HPC Cluster-as-a-service offering at http://teraproc.com
3) Reduce workload management costs – While the cost of tools and infrastructure are dominant, workload management can be a significant cost as well. Platform Computing, the developers of the Platform LSF scheduler were acquired by IBM in 2012. Platform LSF is the choice of EDA firms for good reason. It is simply the best solution when it comes to reliably managing large clusters with high job throughput and resource utilization. Also, it provides an expressive resource selection syntax that helps administrators be granular in selecting hardware resources to optimize run-time efficiency and keep infrastructure and license utilization high. Features like fair-share scheduling and pre-emption ensure that scarce resources are assigned to the most business critical jobs. For smaller firms, who need an LSF compatible scheduler with similar capabilities, but without all the bells and whistles, OpenLava is worth a look. OpenLava is a free, open source scheduler that is command-line and file format compatible with LSF meaning that the majority of open-source and commercial EDA application integrations are readily portable to OpenLava without modification. Commercial support is now available from Teraproc – a firm comprised of ex-Platform Computing engineers with deep knowledge of the open source version of IBM’s flagship scheduler. Using a scheduler that is now and forever “free” represents a significant savings opportunity – especially when savings are projected over multiple years.
4) Avoid costly and disruptive migrations – Faced with the cost challenges described above, some organizations had been considering migrations from LSF to alternative workload schedulers including Univa Grid Engine, Moab or Condor. As any IT person knows, touching old code and scripts can have unforeseen consequences and migration projects almost always turn out to be harder and more complex than they seem at first. The beauty of migrating to OpenLava is that migration costs and business risk can both be minimized. While there will likely be some issues, EDA users migrating from IBM Platform LSF to free, open-source OpenLava can expect a trouble-free migration compared to clients migrating to an incompatible workload manager that demands that internal scripts be re-written and users and cluster administrators be re-trained.
5) Bend the cost curve with cheaper node-based support pricing – As EDA customers know well, server nodes are getting denser supporting more cores and threads per socket. If you are refreshing your previous generation of servers deploying new machines using Intel’s new Haswell E5-2600 v3 series processors, chances are you’re moving from 4-8 cores per socket to servers supporting 18 cores per socket depending on the specific hardware and requirements of your environment. While license terms will vary, IBM normally licenses Platform LSF on a per-core metric – resource-value units or RVUs in IBM’s terminology. This means that doubling the number of cores per server doubles your LSF support price per server. If you grow the number of cores, you may even be forced to acquire new perpetual licenses to accommodate the growth in cores even though the size of your cluster in terms of physical nodes remains the same. OpenLava support pricing is node based, meaning that you are protected from this stealth form of price inflation as compute nodes become more powerful. In fact, as processors become more powerful on a per socket basis and are able to support more concurrent jobs, scalability turns in your favor and support costs per job actually decreases as technology advances.
6) Provision for average rather than peak demand – We’ve all considered the idea of bursting into the cloud for specific workloads and of course the trade-offs are complex. The practicality of this will depend on your level of comfort pushing proprietary design data into the cloud and the nature of your application licenses along with other considerations like dataset sizes. For firms with “spikey design cycles” it can make good business sense to maintain a core amount of capacity on premises and burst to the cloud only when needed, however this assumes that the expertise in your firm exists to rapidly and seamlessly scale-up capacity with cloud infrastructure as needed. Fortunately, firms like Cycle Computing and Teraproc can help here – making tapping cloud-based resources and expanding your on-premises cluster seamless. Before you discard this idea based on security concerns – remember that IP packets don’t read the addresses on the brick and mortar buildings they traverse. For smaller firms running their own infrastructure, it may be that storing data in the public cloud is actually more secure owing to the cloud provider’s management expertise and security practices. Cloud-bursting will certainly not be appropriate for many applications, but there may be some workloads in your environment that can benefit from low-cost capacity on demand including training clusters, test and development clusters and the like.
7) Provision clusters faster, and avoid paying for non-productive cycles – Let’s assume for a moment you’ve decided that running peak demand workloads in the cloud to augment your onsite cluster makes sense. If you need 1,000 cores to run a large series of verification or regression tests for four hours, the last thing you want to do is spend two days setting up your cluster for four hours worth of work. This is a rather obvious but under-appreciated consideration that dramatically impacts cost. In this plausible example involving a two day setup time, we overspend by at least an order of magnitude. Efficient cloud-bursting demands agile systems that can provision on-demand clusters in minutes – not hours to keep costs low. Whereas these solutions were hard to come by a few years ago, there are many good solutions today. Combine agile scheduling with spot pricing (discussed below) and now you’re cooking with gas and on your way to saving significant cost in the cloud.
8) Consider taking advantage of spot pricing – For those not familiar with spot pricing, this is a pricing mechanism available on Amazon Web Services that allows customers to “bid” for server capacity and name their own price in an open market. For EDA sites the considerations are complex – EDA simulations are mission critical and software licenses are expensive so you definitively don’t want your workload pre-empted when 90% done. According to Amazon the average spot pricing user can save 80% on their infrastructure costs. For open-source EDA tasks this can make a lot of sense depending on dataset sizes, but licensed jobs need more consideration. The savings may still be compelling however. By choosing an appropriate spot price you can obtain server capacity at a rate less than Amazon’s standard costs and save money – See Teraproc’s blog article “AWS spot pricing is your new best friend” for a discussion of the dynamics of spot pricing and the pros and cons.
9) Consolidate clusters for big savings – Many EDA firms run separate compute clusters to serve different locations. While fast networks make it increasingly practical to share global infrastucture, you may lack the ability to enforce policy across sites around how expensive license features are shared. Consider the case of a cluster in San Jose and a cluster in Bangalore sharing 100 licenses. With each cluster, administrators can define policies to ensure that licenses are shared effectively for near 100% utilization. Sharing license resources across clusters, with appropriate pre-emption logic to ensure that low-value jobs do not consume high value-assets is much harder. Consolidating to a single cluster, whether at a single location or in the cloud can solve this problem, helping licenses be shared more efficiently for potentially dramatic savings.
10) Use License Scheduling – One of the key reasons that IBM Platform LSF is so widely used in EDA companies is that it provides license scheduling capabilities allowing license features to be allocated based on policies. License features may be borrowed when idle, and be reclaimed by entitled users or groups when license features are needed. License scheduling is a complex business to be sure, but often the 80/20 rules applies. Sometimes more efficient sharing of just a single license feature is all that’s needed to dramatically reduce costs. While License Scheduling is new to OpenLava, users requiring this capability may want to consult with Teraproc to ensure that evolving functionality in this area of OpenLava supports the specific use cases they have in mind.