"High-ish" availability cross-region architectures
By Shahid Iqbal on 31 August 2018
The architectures of a cross-region highly available (HA) application vs a single region application are often seen as quite distinct. This post will show it doesn’t have to be that clear cut, with a few small changes we can cost effectively deal with a region outage quicker and without the same expense as running redundant copies of our solution.
High Availability concepts and architecture is a big topic but I wanted to cover a specific aspect of HA with regards to configuring architectures across datacentre regions. This post doesn’t cover other important aspects of HA and resilience nor recovery point objectives.
When you’re creating Azure architectures you should always consider your high availability options especially in the context of your agreed Recovery Time Objective (RTO). Many services in Azure have a degree of resilience within a region already, however to protect against a whole region outage you may want to consider a cross-region HA deployment.
Thankfully whole region outages are becoming rare and with the roll out of Availability Zones we should have even more resilience within a single region.
To setup cross-region HA you typically provision multiple instances of your architecture across regions and route the traffic as required e.g. Active-Passive or Active-Active.
However, if your RTO allows it, it may be more cost effective to setup in a single region but be prepared to quickly provision infrastructure into another region in the event of an outage. The decision to go single region doesn’t mean you can’t take steps to make the process of failing over to another region easier without incurring the same level of cost as a fully solution.
Cross-region HA setup
If you consider the diagram above, this is a typical cross-region HA deployment for a web app, in this architecture the Azure Traffic Manager is responsible for directing traffic to whichever region you choose. Because the SQL DB is being replicated automatically in the event of a region outage you would be able to fail over to the secondary region in a matter of minutes.
Single region setup
Without the cross-region requirement your architecture can be simplified.
However in this scenario, in the event of a region outage you need to manually “fail over” to a secondary region. Beyond having to do this manually, you may also encounter some challenges which can delay coming back online (and therefore risk your RTO).
If you imagine the steps required to get up and running in a new region to be
1) Provision new infrastructure in secondary region.
2) Restore SQL database(s)
3) Update DNS records to point to secondary region (assuming you’re using a custom domain in an external registry)
Step 1 should be relatively straightforward, especially if you have the ARM templates ready to go and good support for application deployment.
Step 2 should be manageable (assuming you have the relevant SQL backups at the correct frequency) but can be time consuming to restore large databases.
Step 3 however could be problematic, especially if you don’t have direct control of the DNS records. You will likely need to raise a change with the controller of the DNS records and then wait for the changes to flush through, all of which will be eating into your RTO.
Mitigate DNS issues using Traffic Manager
The simplest option which can help mitigate the DNS challenges of step 3 is to retain the Traffic Manager component. This allows you to use Traffic Manager to direct traffic to the secondary region once its setup without needing to update the external DNS records (which you may not control) and wait for them to flush through. You just need to update the Traffic Manager with the new endpoint once its created and you should be up and running.
If you’re curious about how much this approach would cost, it depends on the number of DNS queries resolved by Traffic Manage but to give you an indication, 5 million DNS queries/month + 1 Azure health check costs approx £2/month for a Traffic Manager in West Europe
You may also be wondering if the Traffic Manager is a single point of failure in the architecture above and it should come as no surprise that it is highly resilient and can tolerate a region outage.
Retaining Traffic Manager and SQL geo-replication
If you retain the SQL ge0-replication it should allow you to get up and running a lot quicker as this removes the challenges/time taken to restore your database(s). You do obviously have an additional cost of the SQL geo-replication, but you could justify this cost by allowing your application to using the geo-replicated database as a read-only replica and thereby reduce the load on your primary database.
Deciding between single region and cross-region architecture doesn’t have to be black or white. With a bit of creativity you can find a middle ground which is cost effective and allows you to recover from disaster with less effort and more reliably.
If you’d like to discuss your current Azure architecture get in touch with us using the contact form we can perform an Azure Health check reviewing both your Azure architecture and also your application architecture.