Azure Database for PostgreSQL Disaster Recovery

Azure Database for PostgreSQL Disaster Recovery

Summary

To ensure resiliency of data stored in the Azure Database for PostgreSQL servers within the environment, the server instance(s) are configured to replicate data via GRS to the paired secondary region, in this case, UAE Central. This page describes the capabilities that Azure Database for PostgreSQL provides for business continuity and disaster recovery, options for recovering from disruptive events that could cause data loss or cause your database and application to become unavailable and what to do when a user or application error affects data integrity, an Azure region has an outage, or your application requires maintenance.

Introduction

Microsoft strives to ensure Azure services are always available, however, unplanned service outages may occur. An important part of a disaster recovery plan is preparing to fail over to the secondary endpoint in the event that the primary endpoint becomes unavailable.

Azure Database for PostgreSQL provides business continuity features that include geo-redundant backups with the ability to initiate geo-restore and deploying read replicas in a different region. Each has different characteristics for the recovery time and the potential data loss. With Geo-restore feature, a new server is created using the backup data that is replicated from another region. The overall time it takes to restore and recover depends on the size of the database and the amount of logs to recover. The overall time to establish the server varies from few minutes to few hours. With read replicas, transaction logs from the primary are asynchronously streamed to the replica. In the event of a primary database outage due to a zone-level or a region-level fault, failing over to the replica provides a shorter RTO and reduced data loss.

The following table compares RTO and RPO in a typical workload scenario:

FEATURES THAT YOU CAN USE TO PROVIDE BUSINESS CONTINUITY
Capability Basic General Purpose Memory optimized
Point in Time Restore from backup Any restore point within the retention periodRTO – VariesRPO < 15 min Any restore point within the retention periodRTO – VariesRPO < 15 min Any restore point within the retention periodRTO – VariesRPO < 15 min
Geo-restore from geo-replicated backups Not supported RTO – VariesRPO < 1 h RTO – VariesRPO < 1 h
Read replicas RTO – Minutes*RPO < 5 min* RTO – Minutes*RPO < 5 min* RTO – Minutes*RPO < 5 min*

Source: https://docs.microsoft.com/en-us/azure/postgresql/concepts-business-continuity

* RTO and RPO can be much higher in some cases depending on various factors including latency between sites, the amount of data to be transmitted, and importantly primary database write workload.

Backup Architecture

Azure Database for PostgreSQL servers are backed up periodically to enable Restore features. Using this feature, we can restore the server and all its databases to an earlier point-in-time, on a new server.

Scope

The options described are for Azure database for PostgreSQL database(s) in the environment.

Out of scope topics include the failover of Azure database for PostgreSQL hosted in the environment.

Responsibilities

Assumed: Operations/Platform/”DBA”

Limitation

PostgreSQL is available in UAE Central, however, the region UAE Central is not visible in Azure Portal while creating a new resource.

Deleted servers cannot be restored. If you delete the server, all databases that belong to the server are also deleted and cannot be recovered.

The lag between the primary and the replica depends on the latency between the sites, the amount of data to be transmitted and most importantly on the write workload of the primary server. Heavy write workloads can generate significant lag.

Because of asynchronous nature of replication used for read-replicas, they should not be considered as a High Availability (HA) solution since the higher lags can mean higher RTO and RPO. Only for workloads where the lag remains smaller through the peak and non-peak times of the workload, read replicas can act as a HA alternative. Otherwise read replicas are intended for true read-scale for ready heavy workloads and for (Disaster Recovery) DR scenarios.

Recovery Scenarios

Recover from an Azure data center outage

Although rare, an Azure data center can have an outage. When an outage occurs, it causes a business disruption that might only last a few minutes but could last for hours.

One option is to wait for the server(s) to come back online when the data center outage is over. This works for applications that can afford to have the server offline for some period of time, for example a development environment. When a data center has an outage, we do not know how long the outage might last, so this option only works if you can afford being without the database server till the outage is resolved and the services are restored.

Point-in-time Restore

Independent of the backup redundancy option configured, we can perform a restore to any point in time within the backup retention period. A new server is created in the same Azure region as the original server. It is created with the original server’s configuration for the pricing tier, compute generation, number of vCores, storage size, backup retention period, and backup redundancy option.

Point-in-time restore is useful in multiple scenarios. For example, when a user accidentally deletes data, drops an important table or database, or if an application accidentally overwrites good data with bad data due to an application defect.

We may need to wait for the next transaction log backup to be taken before we can restore to a point in time within the last five minutes.

If you want to restore a dropped table,

  • Restore source server using Point-in-time method.
  • Dump the table using pg_dump from restored server.
  • Rename source table on original server.
  • Import table using PSQL command line on original server.
  • You can optionally delete the restored server.

Geo-restore

We can restore a server to another Azure region where the service is available by using the geo-redundant backups. Servers that support up to 4 TB of storage can be restored to the geo-paired region, or to any region that supports up to 16 TB of storage. For servers that support up to 16 TB of storage, geo-backups can be restored in any region that support 16 TB servers as well. Review Azure Database for PostgreSQL pricing tiers for the list of supported regions.

Geo-restore is the default recovery option when your server is unavailable because of an incident in the region where the server is hosted. If a large-scale incident in a region results in unavailability of your database application, you can restore a server from the geo-redundant backups to a server in any other region. There is a delay between when a backup is taken and when it is replicated to different region. This delay can be up to an hour, so, if a disaster occurs, there can be up to one hour data loss.

During geo-restore, the server configurations that can be changed include compute generation, vCore, backup retention period, and backup redundancy options. Changing pricing tier (Basic, General Purpose, or Memory Optimized) or storage size is not supported.

Perform post-restore tasks

After a restore from either recovery mechanism, the platform or ops team member should perform the following tasks to get the users and applications back up and running:

  • If the new server is meant to replace the original server, redirect clients and client applications to the new server. Also change the username also to username@new-restored-server-name.
  • Ensure appropriate server-level firewall and VNet rules are in place for users to connect. These rules are not copied over from the original server.
  • Ensure appropriate logins and database level permissions are in place
  • Configure alerts, as appropriate

How to initiate a Point-in-Time Restore from Azure Portal

  • Log into https://portal.azure.com
  • In the Azure portal, select the Azure Database for PostgreSQL server which we want to resto

  • In the toolbar of the server’s Overview page, select Restore.

Fill out the Restore form with the required information:

  • Restore point: Select the point-in-time we want to restore to.
  • Target server: Provide a name for the new server.
  • Location: This cannot be changed for Point-in-Time restore. By default it is same as the source server.
  • Pricing tier: This cannot be changed when doing a point-in-time restore. It is same as the source server.

Click OK to restore the server to restore to a point-in-time.

  • Once the restore finishes, locate the new server that is created to verify the data was restored as expected.
    • The new server created by point-in-time restore has the same server admin login name and password that was valid for the existing server at the point-in-time chose. We can change the password from the new server’s Overview page.
    • The new server created during a restore does not have the firewall rules or VNet service endpoints that existed on the original server. These rules need to be set up separately for this new server.

How to initiate a Geo restore from Azure Portal

If you configured your server for geographically redundant backups, a new server can be created from the backup of that existing server. This new server can be created in any region that Azure Database for PostgreSQL is available.

Select the Create a resource button (+) in the upper-left corner of the portal. Select Databases > Azure Database for PostgreSQL.

Select the Single server deployment option.

  • Provide the subscription, resource group, and name of the new server.
  • Select Backup as the Data source. This action loads a dropdown that provides a list of servers that have geo redundant backups enabled.
    • Select the Backup dropdown
    • Select the source server to restore from.
    • The server will default to values for number of vCoresBackup Retention PeriodBackup Redundancy OptionEngine version, and Admin credentials. Select Continue.

Fill out the rest of the form with your preferences. You can select any Location.

PS: As on 13 May 2021, when this instruction was created, the secondary region UAE Central is not visible in the Location Drop-down via Azure Portal.

  • After selecting the location, you can select Configure server to update the Compute Generation (if available in the region you have chosen), number of vCoresBackup Retention Period, and Backup Redundancy Option. Changing Pricing Tier (Basic, General Purpose, or Memory Optimized) or Storage size during restore is not supported.
    • Select Review + create to review your selections.

Select Create to provision the server. This operation may take a few minutes.

The new server created by geo restore has the same server admin login name and password that was valid for the existing server at the time the restore was initiated. The password can be changed from the new server’s Overview page.

The new server created during a restore does not have the firewall rules or VNet service endpoints that existed on the original server. These rules need to be set up separately for this new server.

 

 

Share via
Copy link