Business Continuity Planning
Let’s face it, stuff happens! Even the most resilient systems can fail. The trick with Business Continuity is identifying those processes/components that could fail and analysing them for risk and impact versus cost of mitigation before deciding whether to accept, insure or mitigate the risk.
One of the most valuable assets of a business, after its people, of course, is often its data. If that data happens to be stored in a database, then you had better make sure it is protected. Losing your entire customer database could potentially put you out of business. At best, assuming you have a reliable backup, the effort and time required to restore the data can still be troublesome.
When planning for business continuity there are two key measures related to backups, these are:
- Recovery time objective (RTO)
- Recovery point objective (RPO)
RTO is the maximum acceptable time taken to recover the system after a disruptive event. RPO on the other hand represents the maximum period of time the business can tolerate the loss of data, in other words, the time difference between the last backed up data and the event occurring as depicted below:
Once the business has defined acceptable values for these two measures you can start to design a plan that meets those objectives within the available budget.
Defining a Backup Strategy
Databases are typically backed up using a combination of different backup types as shown below:
|Backup the entire database
|Changes since the last full backup
|Transactions occurred since the last differential or transaction backup
As you can imagine, the weekly full backup can get quite large, and take a while to run. The differential and transaction logs are smaller as they are essentially just backing up any changes. Using the right combination of backup types and frequency allows you to set your RPO to an acceptable level. In this example, because the transaction logs are hourly, the most data you would lose would be an hour. However, you still need to perform the physical backup, which could take up to an hour or more depending on the size of your database. So, using this model, you can pretty much determine what your Recovery Point Objective is by setting the frequency of your Transaction log backups. But the time taken to recover (RTO) will change over time as the overall size of your database grows. If you need to restore the database to a different server, you are going to need to update your connection strings in any apps or reports that use the database. This can add significant resources and take time, which increases the overall RTO.
There is a way to dramatically reduce your RTO and RPO. Using database replication, you can set up a copy of your database in a different location to your primary. Every change to the primary database is replicated to the secondary in near real-time, so you always have an up-to-date copy of your production database. In the event of a disaster, you would simply failover to the secondary database. At that point, the secondary becomes the primary and the old primary is taken offline until it is recovered. This all happens in a matter of seconds, and if correctly configured, there is no need to update any connection strings. The result is that you have a standby database with an RTO of seconds and an RPO of possibly a few minutes (any transactions that were in progress when the failure occurred will be lost).
So why do we need backups at all if we can just set up a replica? Well, there are four main reasons you might need to recover data, these are:
- Hardware or software failures affecting the database such as a disk failure.
- Datacentre outage, possibly caused by a natural disaster.
- Data corruption or deletion typically caused by an application bug or human error.
- Upgrade or maintenance errors, unanticipated issues that occur during planned infrastructure maintenance or upgrades.
Hardware failures can usually be mitigated through the use of redundant systems but may in some instances require full restoration from the latest backup or failover to a secondary. Datacentre outages will also benefit from replication.
Data corruption and upgrade or maintenance errors require a restore from a backup prior to the event. This is for two reasons, firstly, you might only need a small subset of the data, and secondly, the corruption is likely to have been replicated to the secondary in any event.
DebtView Backup Strategy
Our cloud-based debt collection system, DebtView, is built on an Azure SQL database. Each customer gets their own separate database, so their data is never shared with other customers. Each subscription is configured with the following backup strategy as standard, with a default retention period of 35 days:
|Every 5-10 minutes
The configuration above gives you a maximum RPO of 5-10 minutes (time of the last Transaction Log backup) with the ability to restore to a specific point in time over the previous 35 days. As discussed above, the RTO is dependent on the size of your database plus the time taken to change the connection string in DebtView and redeploy the application. This can typically be an hour or more.
In the event of an incident, customers would need to raise a support ticket for a database restore.
DebtView Geo Restore with Auto Failover
For those customers that need it, there is the option of a fully automated database failover service. This provides a replicated secondary database in a physically different region which is always up to date. In the event of an incident, the database will failover automatically within an hour, although this can be expedited manually if required. The service also means that no reconfiguration of the DebtView app is required.
The image above shows a geo-replicated database with the primary being hosted in North Central US and the secondary in South Central US.
Auto Failover, is a billable option in DebtView and pricing is dependent on the size of your database. Please contact us for more information.