Hybrid Manager disaster recovery Innovation Release
- Hybrid Manager dual release strategy
- Documentation for the current Long-term support release
The Disaster Recovery (DR) procedure is defined as the series of manual steps that you need to take to recover your HM installation and your HM-managed Postgres clusters.
Warning
You must constantly test and update your organization's DR procedure for it to remain valid.
Before you start
Before starting the DR procedure, ensure you have made yourself familiar with:
Prerequisites
A new HM instance deployed and running. It must be running the same version as the old instance that failed or became unavailable.
The container images used to build the clusters in the old, unavailable HM instance are available to the new one.
Required tools
Ensure the following tools are available on your workstation environment or Bastion host:
jq(Command-line JSON processor)yq(Command-line YAML processor)
1. Make backups available in the new HM instance
The first step ensures the backups of the unavailable HM instance (“old backups”) are reachable from the new HM instance by copying the backups of the damaged HM instance to the linked storage (new bucket) of the new HM instance.
Obtain the bucket names, the backup ID, and the region of the new bucket. Store these as environment variables to be used in the commands throughout this guide:
Important
These variables are session-specific. If you open a new terminal tab or your session times out, the variables will be lost, and subsequent commands will fail. To avoid re-typing these values, you could save these export commands into a small shell script (e.g.,
dr-env.sh). You can then "source" that file in any new terminal window to instantly reload your environmentexport OLD_BUCKET=<old-bucket-name> export OLD_BACKUP_ID=<old-environment-internal-backup-id> export NEW_BUCKET=<new-bucket-name> export NEW_REGION=<region-of-the-new-bucket>
How do I obtain the old bucket values?
To obtain the old bucket values:
- Go to the console/dashboard of your CSP > buckets.
- Find and select the bucket linked to the backups of your old HM instance.
- Browse through to the
edb-internal-backupsfolder. Inside that folder you will find a subfolder with the backup ID, e.g. 4be7a1c8c9f0.
EKS Example
This is an example for setting the environment variables for an HM instance deployed on EKS:
export OLD_BUCKET=eks-1105143903-2511-edb-postgres export OLD_BACKUP_ID=a7462dbc7106 export NEW_BUCKET=eks-1105155418-2511-edb-postgres export NEW_REGION=eu-west-3
To copy the data from the old bucket to the new bucket, you first need to locate and note the names of the source and target folders. You need to copy the following folders and their content:
Internal EDB backups folder — The internal backups folder in the old bucket
edb-internal-backups/<random-string>is different in the new HM instance, as it will have a different<random-string>.Postgres clusters backups folder —
customer-pg-backups.Folder corresponding to any defined custom storage locations — If you utilize Managed Storage Locations in the HM console (e.g., for offloading Postgres queries), you must ensure the corresponding folders are copied from the old S3-compatible bucket to the new one. While the definitions are restored via Velero, the actual data inside those custom folders must be manually migrated to the new target bucket.
Copy the old backups to the new bucket using your preferred tools. Here are some examples using cloud service provider CLIs to move data between buckets:
aws s3 cp --recursive s3://${OLD_BUCKET}/edb-internal-backups/${OLD_BACKUP_ID} s3://${NEW_BUCKET}/edb-internal-backups aws s3 cp --recursive s3://${OLD_BUCKET}/customer-pg-backups s3://${NEW_BUCKET}/customer-pg-backups
If you have configured additional Managed Storage Locations, use the same method to copy those folders.
gcloud storage cp gs://${OLD_BUCKET}/edb-internal-backups/${OLD_BACKUP_ID} gs://${NEW_BUCKET}/edb-internal-backups --recursive gcloud storage cp gs://${OLD_BUCKET}/customer-pg-backups gs://${NEW_BUCKET}/customer-pg-backups --recursive
If you have configured additional Managed Storage Locations, use the same method to copy those folders.
Load the backups you just copied to your new HM instance by creating a new custom resource definition and applying it to the new HM instance:
kubectl apply -f - <<EOF apiVersion: velero.io/v1 kind: BackupStorageLocation metadata: annotations: appliance.enterprisedb.com/s3-prefixes: edb-internal-backups/velero labels: appliance.enterprisedb.com/storage-credentials: bound name: recovery namespace: velero spec: accessMode: ReadOnly config: insecureSkipTLSVerify: "false" region: ${NEW_REGION} s3ForcePathStyle: "true" default: false objectStorage: bucket: ${NEW_BUCKET} prefix: edb-internal-backups/velero provider: aws EOF
kubectl apply -f - <<EOF apiVersion: velero.io/v1 kind: BackupStorageLocation metadata: annotations: appliance.enterprisedb.com/s3-prefixes: edb-internal-backups/velero labels: appliance.enterprisedb.com/storage-credentials: bound name: recovery namespace: velero spec: accessMode: ReadOnly credential: key: gcp name: gcs-credentials default: false objectStorage: bucket: ${NEW_BUCKET} prefix: edb-internal-backups/velero provider: gcp EOF
Confirm that the new storage location is available:
velero get backup-locations
If the status is not
Available, check the Velero pod logs for permission errors on the S3 bucket.Confirm that the backups are available as well:
velero get backups --selector velero.io/storage-location=recovery
Choose the backup you want to restore from. You can have multiple backups available, so choose the one that best suits your needs, e.g. the most recent backup before the disaster happened. Note the Velero backup name, as well as the date and time (UTC), as both are required for a restore, for example:
NAME STATUS ERRORS WARNINGS CREATED EXPIRES STORAGE LOCATION SELECTOR velero-backup-kube-state-**20241216154403** Completed 0 0 2024-12-16 16:44:03 \+0100 CET 5d recovery \<none\>
Note
The timestamp value is referred to as the recovery date in the instructions that follow.
(Optional) If you were using HM to manage AI workloads, e.g. with the GenAI Builder, also copy the object store files and CORS configuration from the old bucket to the new one:
export OLD_BUCKET_DATALAKE=<old-bucket-name> export NEW_BUCKET_DATALAKE=<new-bucket-name>
# Copy data lake objects from old bucket to new bucket aws s3 cp --recursive s3://${OLD_BUCKET_DATALAKE}/ s3://${NEW_BUCKET_DATALAKE}/ # Copy CORS configuration from old bucket to new bucket aws s3api get-bucket-cors --bucket ${OLD_BUCKET_DATALAKE} --output json > cors-config.json aws s3api put-bucket-cors --bucket ${NEW_BUCKET_DATALAKE} --cors-configuration file://cors-config.json
# Copy data lake objects from old bucket to new bucket gcloud storage cp "gs://${OLD_BUCKET_DATALAKE}/**" gs://${NEW_BUCKET_DATALAKE}/ --recursive # Copy CORS configuration from old bucket to new bucket gcloud storage buckets describe gs://rhos-uat-griptape-datalake --format="json" | jq .cors_config > cors-config.json gcloud storage buckets update gs://${NEW_BUCKET_DATALAKE} --cors-file=cors-config.json
2. Recovery steps
Restore HM-internal databases
After the old backups are available in the new bucket, you can restore the HM-internal databases. These are back-end services used by HM and are required to fully restore the HM instance. Depending on the HM version you are using and on the installation scenario you have deployed, the list of databases may vary.
To simplify these process, run following script with your kubeconfig pointing to your new HM installation:
Script details
This patch script takes care of:
- Saving the HM-internal database cluster manifests to YAML files, while generating two directories:
- One directory
old-cluster-configswith the current state of the database clusters in the new HM installation (default configuration after installation). - Another one called
new-cluster-configswith the same files, where the script will perform the patches required so the HM-internal databases start using the data from the backups.
- One directory
- Saving the HM-internal database cluster manifests to YAML files, while generating two directories:
Suspend the reconciliation of all HM-internal database clusters, so that you can safely remove the old Custom Resource Definitions (CRDs) of the database clusters without having the operator recreating them by default:
HCP_CR=$(kubectl get hybridcontrolplanes.edbpgai.edb.com -A -o json | jq -rc '.items[0] | .metadata.name') for CLUSTER in $(kubectl get clusters.postgresql.k8s.enterprisedb.io -A -o json | jq -rc '.items[].metadata | select((.name | test("^p-") | not) and (.name != "stats-collector-db")) | {namespace: .namespace}' | uniq) do NAMESPACE=$(echo "${CLUSTER}" | jq -rc '.namespace') if [[ "${NAMESPACE}" == "upm-system-db" ]] then COMPONENT="upm-app-db" else COMPONENT="${NAMESPACE}" fi INDEX=$(kubectl get hybridcontrolplane ${HCP_CR} -o json | jq '.status.components | to_entries[] | select(.value.name=='\"${COMPONENT}\"') | .key') kubectl patch hybridcontrolplane ${HCP_CR} --subresource=status --type=json -p "[{\"op\": \"replace\", \"path\": \"/status/components/$INDEX/suspended\", \"value\": true}]" done
Verify that the components have been suspended correctly:
HCP_CR=$(kubectl get hybridcontrolplanes.edbpgai.edb.com -A -o json | jq -rc '.items[0] | .metadata.name') kubectl hcp status -n edbpgai-bootstrap "${HCP_CR}"
Delete the HM-internal database clusters that were created during installation of the new HM instance to make room for the HM-internal database clusters that will be recovered from the backup:
for CONFIG in $(find new-cluster-configs -type f) do kubectl delete -f $CONFIG done
Clean the backup area that was created during the installation of the new HM instance to avoid confusion with the old backups that you want to restore:
for CONFIG in $(find new-cluster-configs -type f) do NAME=$(yq '.metadata.name' $CONFIG) NAMESPACE=$(yq '.metadata.namespace' $CONFIG) # Try to get PREFIX from cluster config first, fallback to ObjectStore if it fails PREFIX=$(yq '.spec.backup.barmanObjectStore.destinationPath | downcase' $CONFIG 2>/dev/null) if [ -z "$PREFIX" ] || [ "$PREFIX" = "null" ]; then # destinationPath doesn't exist in the cluster config (after CNPG-I barman plugin migration) # Get it from the ObjectStore resource instead PREFIX=$(kubectl -n $NAMESPACE get objectstores.barmancloud.cnpg.io $NAME -o yaml | yq '.spec.configuration.destinationPath | downcase') fi aws s3 rm --recursive ${PREFIX}/${NAME} done
for CONFIG in $(find new-cluster-configs -type f) do NAME=$(yq '.metadata.name' $CONFIG) NAMESPACE=$(yq '.metadata.namespace' $CONFIG) # Try to get PREFIX from cluster config first, fallback to ObjectStore if it fails PREFIX=$(yq '.spec.backup.barmanObjectStore.destinationPath | downcase' $CONFIG 2>/dev/null) if [ -z "$PREFIX" ] || [ "$PREFIX" = "null" ]; then # destinationPath doesn't exist in the cluster config (after CNPG-I barman plugin migration) # Get it from the ObjectStore resource instead PREFIX=$(kubectl -n $NAMESPACE get objectstores.barmancloud.cnpg.io $NAME -o yaml | yq '.spec.configuration.destinationPath | downcase') fi gsutil -m rm -r ${PREFIX}/${NAME} done
Replace the
hm-portal-bootstrapsecret with the version from backup.Note
This step is only applicable to versions of HM from
2026.2and later.Run the command
kubectl get secret hm-portal-bootstrap -n defaultagainst your new HM instance, and if the secret doesn't exist, then you can skip this step because you are using a version of HM from before2026.2.The
hm-portal-bootstrapsecret on the new HM instance needs to be replaced with the version that was on the old HM instance. This secret contains a Fernet key used for encrypting sensitive configuration data, and the static passwords used for initial bootstrap access. Without the correct Fernet key from the old environment, it isn't possible for HM to decrypt some of the data restored in the next step.This step involves restoring a single secret found in the
defaultnamespace. Velero doesn't support this well, as it is typically used to restore all resources of a given type for a given namespace. Therefore, the only way to perform this restoration is to download the entire Velero backup and pull out a specific file.To select a specific file, you first need to select a backup to use. Start by listing out all the available backups with the following command:
velero get backups -o json --selector velero.io/storage-location=recovery \ | jq -rc '(["Name", "Timestamp"]), (.items // [.] | .[] | [.metadata.name, .metadata.creationTimestamp]) | @tsv' \ | column -t -s "$(printf '\t')" Select a backup from this list$mdash;typically the latest$mdash;and export the following variable pointing to the file's selected name. ```shell export BACKUP_NAME=<selected-name>
Now download and extract the full backup:
mkdir "${BACKUP_NAME}" velero backup download "${BACKUP_NAME}" -o "${BACKUP_NAME}/backup.tar.gz" tar -xzf "${BACKUP_NAME}/backup.tar.gz" -C "${BACKUP_NAME}"
Confirm that the file
${BACKUP_NAME}/resources/secrets/namespaces/default/hm-portal-bootstrap.jsonexists. Now delete the existinghm-portal-boostrapsecret, and replace it with the one from the backup you just downloaded:kubectl delete secret hm-portal-bootstrap -n default kubectl apply -f "${BACKUP_NAME}/resources/secrets/namespaces/default/hm-portal-bootstrap.json"
Confirm the secret is there using the following command:
kubectl get secret hm-portal-bootstrap -n defaultAnd you can now remove the local copy of the backup you downloaded, as you will not be needing these files again:
rm -rf "${BACKUP_NAME}"
Apply the YAML file for all the HM-internal database clusters to be re-created with the backup data:
for CONFIG in $(find new-cluster-configs -type f) do kubectl apply -f $CONFIG done
You can monitor the restore progress using
kubectl get clusters -A.
Restart HM services
After all HM-internal database clusters are successfully restored and reporting a healthy state, perform this one-time restart of the management server to refresh the HM console:
kubectl delete pods $(kubectl get pods -n upm-beaco-ff-base | grep '^accm-server' | awk '{print $1}') -n upm-beaco-ff-base
Wait for the new pod to reach the Running state. At this point, the HM console is available, though it won't yet show your HM-managed Postgres clusters.
Configure the Velero plugin
The Velero plugin handles the transformation of Kubernetes resources during the restore. Most importantly, it ensures Postgres clusters are restored in a state that allows you to manually trigger their data recovery.
List the available backups and note the Name and Timestamp of your preferred recovery point:
velero get backups -o json --selector velero.io/storage-location=recovery \ | jq -rc '(["Name", "Timestamp"]), (.items // [.] | .[] | [.metadata.name, .metadata.creationTimestamp]) | @tsv' \ | column -t -s "$(printf '\t')"
Export the environment variables:
export BACKUP_TIMESTAMP=<recovery-date in YYYY–MM-DDTHH:MM:SSZ format> # The BACKUP_NAME may already be available in your terminal from an earlier step. export BACKUP_NAME=<selected-name> # These environment variables should already be available in your terminal: export OLD_BUCKET=<old-bucket-name> export NEW_BUCKET=<new-bucket-name>
Important
The
BACKUP_TIMESTAMPmust be the exact ISO timestamp found in the previous step.For example:
2024-12-16T15:44:03Z.Create and apply a
ConfigMapto configure the Velero plugin:kubectl apply -f - <<EOF apiVersion: v1 kind: ConfigMap metadata: name: velero-plugin-for-edbpgai namespace: velero labels: velero.io/plugin-config: "" enterprisedb.io/edbpgai-plugin: RestoreItemAction data: # configure disaster recovery mode, so restored items are transformed as needed drMode: "true" # configure a date corresponding to the velero backup date. Note the format! drDate: "${BACKUP_TIMESTAMP}" # old and new buckets for internal custom storage locations oldBucket: ${OLD_BUCKET} newBucket: ${NEW_BUCKET} EOF
Restore resources
Restore Managed Storage Locations by applying the following Velero restore. This includes the default
managed-devspatcherlocation as well as any additional custom-defined locations.kubectl apply -f - <<EOF apiVersion: velero.io/v1 kind: Restore metadata: name: restore-1-storagelocations namespace: velero spec: backupName: "${BACKUP_NAME}" includedResources: - storagelocations.biganimal.enterprisedb.com includeClusterResources: true labelSelector: matchLabels: biganimal.enterprisedb.io/reserved-by-biganimal: "false" EOF
Configure and apply the following Velero restore resource manifest to restore the cluster wrappers:
kubectl apply -f - <<EOF apiVersion: velero.io/v1 kind: Restore metadata: name: restore-2-clusterwrappers namespace: velero spec: backupName: "${BACKUP_NAME}" includedResources: - clusterwrappers.beacon.enterprisedb.com restoreStatus: includedResources: - clusterwrappers.beacon.enterprisedb.com EOF
Monitor the restore progress. You must wait until
clusterwrappersis restored first, because the following custom resources (CR) depend on it. If the correspondingclusterwrapperisn't found, HM could delete the other CRs.velero get restore restore-2-clusterwrappers
After the cluster wrappers are restored, configure and apply the following Velero resource manifest to restore the backup wrappers:
kubectl apply -f - <<EOF apiVersion: velero.io/v1 kind: Restore metadata: name: restore-3-backupwrappers namespace: velero spec: backupName: "${BACKUP_NAME}" includedResources: - backupwrappers.beacon.enterprisedb.com restoreStatus: includedResources: - backupwrappers.beacon.enterprisedb.com EOF
Configure and apply the following Velero resource manifest to restore Griptape, Lakekeeper and Dex secrets:
kubectl apply -f - <<EOF apiVersion: velero.io/v1 kind: Restore metadata: name: restore-4-required-secrets namespace: velero spec: backupName: "${BACKUP_NAME}" includedNamespaces: - upm-griptape - upm-lakekeeper - upm-dex includedResources: - secrets includeClusterResources: false EOF
(Optional) If you are running AI workloads, configure and apply the following Velero restore resource manifest to restore kserve resources:
kubectl apply -f - <<EOF apiVersion: velero.io/v1 kind: Restore metadata: name: restore-5-kservecrs namespace: velero spec: backupName: "${BACKUP_NAME}" includedResources: - clusterservingruntimes.serving.kserve.io - inferenceservices.serving.kserve.io EOF
Monitor all restores and wait for them to be completed:
velero get restores
3. Restore Postgres clusters
The cluster metadata has been restored, but the HM-managed Postgres clusters must be manually re-provisioned to link back to your data.
In the HM console, navigate to the databases section. You will see your original clusters listed with a status of Deleted.
Select the desired cluster and locate the Restore button. Follow the prompts to create a new cluster. During this process, the system will use your previous backups to populate the new instance.
After provisioning is complete, verify that the data matches your original state.
You can apply the same procedure to restore any Postgres clusters you had configured on a secondary location.
Note
AI components (such as the GenAI Builder UI in the Launchpad section) will automatically reappear in the HM console once the restore is initiated. Due to the large size of container images and profiles, synchronization may take some time.
4. Validate the restore
The restoration procedure is now complete. To ensure a successful recovery, we recommend checking for data integrity. Log in to the newly provisioned Postgres cluster and run a few test queries to confirm your data is current and accessible.
Tip
If you are performing this as part of a DR drill, internally document the total "Time to Restore" (TTR) for both the database and AI layers to help refine your recovery objectives (RTO).