If you want to skip the details and deploy this solution in Azure, here is a link to my GitHub repo with everything used in this post: https://github.com/covid19folding/AzureSpotVMWorker
Azure Spot VMs are a special offering in Azure that offer major savings over traditional VMs (up to 90% sometimes!), but do not come with any kind of guaranteed up-time. Think of them as the leftover resources in an Azure datacenter that don’t get allocated – you get to use those at a fraction of the cost until someone else needs them. This type of workload won’t be a good fit for many applications, but they are excellent for things like batch jobs, rendering farms, or compute workloads that can be stopped and resumed.
How does Spot VM eviction work? A limit is set for every Spot VM when deployed, and the VM instance gets evicted (deallocated) when the limit is breached. There are two types limits. The first is a limit on availability – when the SKU of your VM runs out in a particular Azure region, your Spot VM will be the first to go. You get no guarantee on availability with Spot Instances, but that’s why they are a fraction of the price. The second type of limit is based on price. You can set a maximum price you want to pay per hour for a Spot VM, and as demand goes up for that SKU, so will the price. If your price limit is breached or if the SKU becomes unavailable, then the VM gets deallocated.
There is a lot more info, including a FAQ, on Spot VMs on Microsoft’s official site linked above.
Given current events with the global pandemic of the COVID-19 Coronavirus, I have decided to deploy Azure Spot VMs to run client workloads for the Folding@home project at Stanford University. If you haven’t heard of Folding@home, it is a distributed computing project that leverages client CPU and GPU cycles to simulate protein folding. The project has been directly responsible for published papers and progress on cures for Alzheimer’s, cancer, diseases, and now the Coronavirus. It’s a project that is very worthy of your support if you have some extra CPU/GPU cycles available.
Spot VMs are a great fit for Folding@home clients because they are so inexpensive and don’t require 100% up-time. It wouldn’t be economical to run folding workloads on standard VMs, but if we can run them at 1/10th of the cost for 23 hours out of the day, that is quite a bargain. Even more, there are now GPU SKUs available for Spot Instances, which are much more efficient at folding than CPUs alone. The folding client works well in this scenario as it runs on its own and resumes workloads automatically.
Before writing this post, I deployed 20 Folding@home workers running in Azure to evaluate cost and performance to see would even be economical. I used the following configuration on my Spot VMs to maximize performance versus cost. I won’t cover these in-depth.
- VM size: Standard NV4as_v4 (4 vcpus, 14 GiB memory) – Azure Spot SKU with GPU.
- Standard HDD disks – the cheapest disk available which will not impact CPU/GPU folding performance.
- Gen 2 – this configuration allows for faster VM start/stop times.
- Azure Bastion – this service allows use of a single IP to remotely access all VMs through the Azure portal.
- OS: Windows 10 Pro – to optimize GPU performance with client OS drivers. No server infrastructure required.
- Azure Region: South Central US – only 2 regions currently support Spot VMs, and South Central currently has lowest demand.
- Automation: a simple automation runbook from the Azure gallery will be used to start all available VMs every hour in case they get evicted.
- Virtual Network – a simple virtual network with a single subnet was used for this deployment.
Deploying 20 workers manually through the Azure portal would take quite a bit of time. Using ARM templates for deployment is much faster and will help keep your configuration consistent. The ARM template and parameter JSON file I used for this deployment can be found in the GitHub repo here: https://github.com/covid19folding/AzureSpotVMWorker
You may find it easier to deploy this template from the Azure portal where you can easily modify settings and view the deployment status, but know there are several different ways to deploy these templates. To deploy the template from the Azure portal, search for “template” in the top search bar and open Templates.
Click Add Template and give it a name and description.
Paste the template.JSON contents into the ARM Template.
Click Add to complete the Template creation. Select the new deploy and choose Deploy.
At the custom deployment screen, open “Edit Parameters.” Paste or import the parameters.JSON in the link above for the easiest configuration. You will need to update these parameters to match your own Azure resources – the networkSecurityGroupId, virtualNetworkId, virtualMachineRG, and subnetName. You should also update the networkInterfaceName and virtualMachineName to follow your own naming scheme. Save the parameters when done.
Fill in the Resource Group and enter a Password for your login. When ready, accept the terms and conditions and purchase.
You can view the deployment status in the notification area. Open it to view the details. Any errors will also show here.
Note: this deployment will not configure a Public IP to access your VM. This is to attain the lowest cost possible for worker performance. You may need to add a Public IP to your VM for remote access. If deploying several VMs, you may want to use Azure Bastion to access all VMs remotely in a secure fashion with a single Public IP.
Go to your new VM resource when the deployment is complete.
Here you can see the folding worker VM was deployed using an Azure Spot instance, in the proper region, with the proper size and name from the provided parameters.
The last step is to create an Azure Automation account to handle the automatic start of this VM and any others added to the resource group. This is essential for Spot VMs, since they can be evicted and deallocated at anytime if there’s no availability. Create an Automation resource from the portal.
Open your new Automation resource and find the “Start Azure V2 VMs” runbook in the Runbook gallery.
Import this runbook to use it in your environment. This particular runbook has the ability to start all VMs in your folding resource group.
Find your new runbook and open it.
Create a new schedule for your runbook and enter the details for your resource group, start date, and recurrence. I used an hourly recurrence, which was the most frequent available.
You can verify that your runbook is working properly by reviewing the Jobs blade. This will start a deallocated worker VM if there is availability.
Your VM will now be able to stop and start on its own in an automated fashion. This is perfect for an Azure Spot workload.
VM Configuration for Folding
I won’t go into too much detail in this section because it’s pretty basic, especially for anyone who has run the Folding@Home client before. The software install should be automated using PowerShell or a Configuration Management tool for anything at scale. You could also build it into a custom VM image. These VMs are essentially being deployed as containers for the Folding workload.
First, you’ll need to connect to your VM using RDP or Bastion. Some configuration may be required for this.
The latest GPU drivers for the NV4as_v4 instance are available from Microsoft here: https://docs.microsoft.com/en-us/azure/virtual-machines/windows/n-series-amd-driver-setup
Since these are Azure Spot VMs that may be evicted at any time and then restarted from our automation runbook, we want these to log in and start working automatically. To take care of the automatic login, run the Sysinternals Autologon tool found here on the VM: https://docs.microsoft.com/en-us/sysinternals/downloads/autologon
Finally, you can install the Folding@Home client for the VM here: https://foldingathome.org/start-folding/ . I found that the Express install works fine, but you may need to configure the client to start working automatically. If you need a team to join, feel free to join ours – 236437!
So what’s the performance like? Folding@Home uses a point system where you earn points for completed CPU/GPU workloads on your local client. Each worker VM in this tutorial earns around 12,000 folding points per day when under full load. This is quite a bit lower than a physical workstation with a modern CPU/GPU – those typically earn closer to 100,000-200,000 points per day. But when you factor in the scale you can achieve through cloud computing combined with the low cost of Spot VMs, you can see how easily this can grow into a powerful contributor to the Folding@home project.
It’s important to note that Azure Spot pricing is variable and totally based on regional demand. While it will never exceed the cost of a standard non-Spot VM, you probably don’t want to operate anywhere near that cost if possible. To mitigate this, you can deploy your VM using price as your limit instead of capacity (used in this tutorial). You can see this option when deploying a Spot VM manually through the portal. This also allows you to compare pricing across regions for the selected VM size.
In this environment with 20 Spot VMs running in South Central, I was able to get the cost per VM down to around $0.40/day. For comparison, a non-Spot VM at the same size is currently priced around $5.60/day. That is a massive amount of savings! Even lower costs could be achieved by using a smaller custom disk for each VM compared to what is offered in the Azure gallery. I recommend monitoring daily costs for Spot VMs in Azure Cost Management.
Another interesting note is that prior to this post, none of my 20 worker nodes have been evicted in the South Central US region yet. I deployed the same workers in East US and they were evicted within a day – probably due to less GPU Spot VM availability in that region. It’s advisable to set up alerts for these evictions for visibility – see my Azure Monitor post for more info on how to do that.
With proper management and automation, Azure Spot VMs can be a powerful, cost-effective tool for your cloud infrastructure workloads.