Hybrid Manager disaster recovery Innovation Release

This documentation covers the current Innovation Release of EDB Postgres AI. See also:

Hybrid Manager dual release strategy
Documentation for the current Long-term support release

The Disaster Recovery (DR) procedure is defined as the series of manual steps that you need to take to recover your HM installation and your HM-managed Postgres clusters.

Warning

You must constantly test and update your organization's DR procedure for it to remain valid.

Before you start

Before starting the DR procedure, ensure you have made yourself familiar with:

Prerequisites

A new HM instance deployed and running. It must be running the same version as the old instance that failed or became unavailable.
The container images used to build the clusters in the old, unavailable HM instance are available to the new one.

Required tools

Ensure the following tools are available on your workstation environment or Bastion host:

Velero CLI
jq (Command-line JSON processor)
yq (Command-line YAML processor)

1. Make backups available in the new HM instance

The first step ensures the backups of the unavailable HM instance (“old backups”) are reachable from the new HM instance by copying the backups of the damaged HM instance to the linked storage (new bucket) of the new HM instance.

Obtain the bucket names, the backup ID, and the region of the new bucket. Store these as environment variables to be used in the commands throughout this guide:
Important
These variables are session-specific. If you open a new terminal tab or your session times out, the variables will be lost, and subsequent commands will fail. To avoid re-typing these values, you could save these export commands into a small shell script (e.g., dr-env.sh). You can then "source" that file in any new terminal window to instantly reload your environment
```
export OLD_BUCKET=<old-bucket-name>
export OLD_BACKUP_ID=<old-environment-internal-backup-id>
export NEW_BUCKET=<new-bucket-name>
export NEW_REGION=<region-of-the-new-bucket>
```
How do I obtain the old bucket values?
To obtain the old bucket values:
1. Go to the console/dashboard of your CSP > buckets.
2. Find and select the bucket linked to the backups of your old HM instance.
3. Browse through to the edb-internal-backups folder. Inside that folder you will find a subfolder with the backup ID, e.g. 4be7a1c8c9f0.
EKS Example
This is an example for setting the environment variables for an HM instance deployed on EKS:
```
export OLD_BUCKET=eks-1105143903-2511-edb-postgres
export OLD_BACKUP_ID=a7462dbc7106
export NEW_BUCKET=eks-1105155418-2511-edb-postgres
export NEW_REGION=eu-west-3
```
To copy the data from the old bucket to the new bucket, you first need to locate and note the names of the source and target folders. You need to copy the following folders and their content:
- Internal EDB backups folder — The internal backups folder in the old bucket edb-internal-backups/<random-string> is different in the new HM instance, as it will have a different <random-string>.
- Postgres clusters backups folder — customer-pg-backups.
- Folder corresponding to any defined custom storage locations — If you utilize Managed Storage Locations in the HM console (e.g., for offloading Postgres queries), you must ensure the corresponding folders are copied from the old S3-compatible bucket to the new one. While the definitions are restored via Velero, the actual data inside those custom folders must be manually migrated to the new target bucket.

Copy the old backups to the new bucket using your preferred tools. Here are some examples using cloud service provider CLIs to move data between buckets:

aws s3 cp --recursive s3://${OLD_BUCKET}/edb-internal-backups/${OLD_BACKUP_ID} s3://${NEW_BUCKET}/edb-internal-backups
aws s3 cp --recursive s3://${OLD_BUCKET}/customer-pg-backups s3://${NEW_BUCKET}/customer-pg-backups

If you have configured additional Managed Storage Locations, use the same method to copy those folders.

gcloud storage cp gs://${OLD_BUCKET}/edb-internal-backups/${OLD_BACKUP_ID} gs://${NEW_BUCKET}/edb-internal-backups --recursive
gcloud storage cp gs://${OLD_BUCKET}/customer-pg-backups gs://${NEW_BUCKET}/customer-pg-backups --recursive

If you have configured additional Managed Storage Locations, use the same method to copy those folders.

Load the backups you just copied to your new HM instance by creating a new custom resource definition and applying it to the new HM instance:

kubectl apply -f - <<EOF
apiVersion: velero.io/v1
kind: BackupStorageLocation
metadata:
  annotations:
    appliance.enterprisedb.com/s3-prefixes: edb-internal-backups/velero
  labels:
    appliance.enterprisedb.com/storage-credentials: bound
  name: recovery
  namespace: velero
spec:
  accessMode: ReadOnly
  config:
    insecureSkipTLSVerify: "false"
    region: ${NEW_REGION}
    s3ForcePathStyle: "true"
  default: false
  objectStorage:
    bucket: ${NEW_BUCKET}
    prefix: edb-internal-backups/velero
  provider: aws
EOF

kubectl apply -f - <<EOF
apiVersion: velero.io/v1
kind: BackupStorageLocation
metadata:
  annotations:
    appliance.enterprisedb.com/s3-prefixes: edb-internal-backups/velero
  labels:
    appliance.enterprisedb.com/storage-credentials: bound
  name: recovery
  namespace: velero
spec:
  accessMode: ReadOnly
  credential:
    key: gcp
    name: gcs-credentials
  default: false
  objectStorage:
    bucket: ${NEW_BUCKET}
    prefix: edb-internal-backups/velero
  provider: gcp
EOF

Confirm that the new storage location is available:
```
velero get backup-locations
```
If the status is not Available, check the Velero pod logs for permission errors on the S3 bucket.

Confirm that the backups are available as well:

velero get backups --selector velero.io/storage-location=recovery

Choose the backup you want to restore from. You can have multiple backups available, so choose the one that best suits your needs, e.g. the most recent backup before the disaster happened. Note the Velero backup name, as well as the date and time (UTC), as both are required for a restore, for example:
```
NAME                                      STATUS      ERRORS   WARNINGS   CREATED                         EXPIRES   STORAGE LOCATION   SELECTOR  
velero-backup-kube-state-**20241216154403**   Completed   0        0          2024-12-16 16:44:03 \+0100 CET   5d        recovery           \<none\>
```
Note
The timestamp value is referred to as the recovery date in the instructions that follow.

(Optional) If you were using HM to manage AI workloads, e.g. with the GenAI Builder, also copy the object store files and CORS configuration from the old bucket to the new one:

export OLD_BUCKET_DATALAKE=<old-bucket-name>
export NEW_BUCKET_DATALAKE=<new-bucket-name>

# Copy data lake objects from old bucket to new bucket
aws s3 cp --recursive s3://${OLD_BUCKET_DATALAKE}/ s3://${NEW_BUCKET_DATALAKE}/
# Copy CORS configuration from old bucket to new bucket
aws s3api get-bucket-cors --bucket ${OLD_BUCKET_DATALAKE} --output json > cors-config.json
aws s3api put-bucket-cors --bucket ${NEW_BUCKET_DATALAKE} --cors-configuration file://cors-config.json

# Copy data lake objects from old bucket to new bucket
gcloud storage cp "gs://${OLD_BUCKET_DATALAKE}/**" gs://${NEW_BUCKET_DATALAKE}/ --recursive
# Copy CORS configuration from old bucket to new bucket
gcloud storage buckets describe gs://rhos-uat-griptape-datalake --format="json" | jq .cors_config > cors-config.json
gcloud storage buckets update gs://${NEW_BUCKET_DATALAKE} --cors-file=cors-config.json

2. Recovery steps

Restore HM-internal databases

After the old backups are available in the new bucket, you can restore the HM-internal databases. These are back-end services used by HM and are required to fully restore the HM instance. Depending on the HM version you are using and on the installation scenario you have deployed, the list of databases may vary.

To simplify these process, run following script with your kubeconfig pointing to your new HM installation:
patch-clusters.sh
Script details
This patch script takes care of:
- Saving the HM-internal database cluster manifests to YAML files, while generating two directories:
  - One directory old-cluster-configs with the current state of the database clusters in the new HM installation (default configuration after installation).
  - Another one called new-cluster-configs with the same files, where the script will perform the patches required so the HM-internal databases start using the data from the backups.

Suspend the reconciliation of all HM-internal database clusters, so that you can safely remove the old Custom Resource Definitions (CRDs) of the database clusters without having the operator recreating them by default:

HCP_CR=$(kubectl get hybridcontrolplanes.edbpgai.edb.com -A -o json | jq -rc '.items[0] | .metadata.name')
for CLUSTER in $(kubectl get clusters.postgresql.k8s.enterprisedb.io -A -o json | jq -rc '.items[].metadata | select((.name | test("^p-") | not) and (.name != "stats-collector-db")) | {namespace: .namespace}' | uniq)
do
  NAMESPACE=$(echo "${CLUSTER}" | jq -rc '.namespace')
  if [[ "${NAMESPACE}" == "upm-system-db" ]]
  then
    COMPONENT="upm-app-db"
  else
    COMPONENT="${NAMESPACE}"
  fi
  INDEX=$(kubectl get hybridcontrolplane ${HCP_CR} -o json | jq '.status.components | to_entries[] | select(.value.name=='\"${COMPONENT}\"') | .key')
  kubectl patch hybridcontrolplane ${HCP_CR} --subresource=status --type=json -p "[{\"op\": \"replace\", \"path\": \"/status/components/$INDEX/suspended\", \"value\": true}]"
done

Verify that the components have been suspended correctly:

HCP_CR=$(kubectl get hybridcontrolplanes.edbpgai.edb.com -A -o json | jq -rc '.items[0] | .metadata.name')
kubectl hcp status -n edbpgai-bootstrap "${HCP_CR}"

Delete the HM-internal database clusters that were created during installation of the new HM instance to make room for the HM-internal database clusters that will be recovered from the backup:
```
for CONFIG in $(find new-cluster-configs -type f)
do
    kubectl delete -f $CONFIG
done
```

Clean the backup area that was created during the installation of the new HM instance to avoid confusion with the old backups that you want to restore:

for CONFIG in $(find new-cluster-configs -type f)
do
    NAME=$(yq '.metadata.name' $CONFIG)
    NAMESPACE=$(yq '.metadata.namespace' $CONFIG)
    # Try to get PREFIX from cluster config first, fallback to ObjectStore if it fails
    PREFIX=$(yq '.spec.backup.barmanObjectStore.destinationPath | downcase' $CONFIG 2>/dev/null)
    if [ -z "$PREFIX" ] || [ "$PREFIX" = "null" ]; then
      # destinationPath doesn't exist in the cluster config (after CNPG-I barman plugin migration)
      # Get it from the ObjectStore resource instead
      PREFIX=$(kubectl -n $NAMESPACE get objectstores.barmancloud.cnpg.io $NAME -o yaml | yq '.spec.configuration.destinationPath | downcase')
    fi
    aws s3 rm --recursive ${PREFIX}/${NAME}
done

for CONFIG in $(find new-cluster-configs -type f)
do
    NAME=$(yq '.metadata.name' $CONFIG)
    NAMESPACE=$(yq '.metadata.namespace' $CONFIG)
    # Try to get PREFIX from cluster config first, fallback to ObjectStore if it fails
    PREFIX=$(yq '.spec.backup.barmanObjectStore.destinationPath | downcase' $CONFIG 2>/dev/null)
    if [ -z "$PREFIX" ] || [ "$PREFIX" = "null" ]; then
      # destinationPath doesn't exist in the cluster config (after CNPG-I barman plugin migration)
      # Get it from the ObjectStore resource instead
      PREFIX=$(kubectl -n $NAMESPACE get objectstores.barmancloud.cnpg.io $NAME -o yaml | yq '.spec.configuration.destinationPath | downcase')
    fi
    gsutil -m rm -r ${PREFIX}/${NAME}
done

Replace the hm-portal-bootstrap secret with the version from backup.
Note
This step is only applicable to versions of HM from 2026.2 and later.
Run the command kubectl get secret hm-portal-bootstrap -n default against your new HM instance, and if the secret doesn't exist, then you can skip this step because you are using a version of HM from before 2026.2.
The hm-portal-bootstrap secret on the new HM instance needs to be replaced with the version that was on the old HM instance. This secret contains a Fernet key used for encrypting sensitive configuration data, and the static passwords used for initial bootstrap access. Without the correct Fernet key from the old environment, it isn't possible for HM to decrypt some of the data restored in the next step.
This step involves restoring a single secret found in the default namespace. Velero doesn't support this well, as it is typically used to restore all resources of a given type for a given namespace. Therefore, the only way to perform this restoration is to download the entire Velero backup and pull out a specific file.
To select a specific file, you first need to select a backup to use. Start by listing out all the available backups with the following command:
```
velero get backups -o json --selector velero.io/storage-location=recovery \
| jq -rc '(["Name", "Timestamp"]), (.items // [.] | .[] | [.metadata.name, .metadata.creationTimestamp]) | @tsv' \
| column -t -s "$(printf '\t')"

Select a backup from this list$mdash;typically the latest$mdash;and export the following variable pointing to the file's selected name.

```shell
export BACKUP_NAME=<selected-name>
```
Now download and extract the full backup:
```
mkdir "${BACKUP_NAME}"
velero backup download "${BACKUP_NAME}" -o "${BACKUP_NAME}/backup.tar.gz"
tar -xzf "${BACKUP_NAME}/backup.tar.gz" -C "${BACKUP_NAME}"
```
Confirm that the file ${BACKUP_NAME}/resources/secrets/namespaces/default/hm-portal-bootstrap.json exists. Now delete the existing hm-portal-boostrap secret, and replace it with the one from the backup you just downloaded:
```
kubectl delete secret hm-portal-bootstrap -n default
kubectl apply -f "${BACKUP_NAME}/resources/secrets/namespaces/default/hm-portal-bootstrap.json"
```
Confirm the secret is there using the following command:
```
kubectl get secret hm-portal-bootstrap -n default
```
And you can now remove the local copy of the backup you downloaded, as you will not be needing these files again:
```
rm -rf "${BACKUP_NAME}"
```
Apply the YAML file for all the HM-internal database clusters to be re-created with the backup data:
```
for CONFIG in $(find new-cluster-configs -type f)
do
    kubectl apply -f $CONFIG
done
```
You can monitor the restore progress using kubectl get clusters -A.

Restart HM services

After all HM-internal database clusters are successfully restored and reporting a healthy state, perform this one-time restart of the management server to refresh the HM console:

kubectl delete pods $(kubectl get pods -n upm-beaco-ff-base | grep '^accm-server' | awk '{print $1}') -n upm-beaco-ff-base

Wait for the new pod to reach the Running state. At this point, the HM console is available, though it won't yet show your HM-managed Postgres clusters.

Configure the Velero plugin

The Velero plugin handles the transformation of Kubernetes resources during the restore. Most importantly, it ensures Postgres clusters are restored in a state that allows you to manually trigger their data recovery.

List the available backups and note the Name and Timestamp of your preferred recovery point:

velero get backups -o json --selector velero.io/storage-location=recovery \
| jq -rc '(["Name", "Timestamp"]), (.items // [.] | .[] | [.metadata.name, .metadata.creationTimestamp]) | @tsv' \
| column -t -s "$(printf '\t')"

Export the environment variables:

export BACKUP_TIMESTAMP=<recovery-date in YYYY–MM-DDTHH:MM:SSZ format>
# The BACKUP_NAME may already be available in your terminal from an earlier step.
export BACKUP_NAME=<selected-name>

# These environment variables should already be available in your terminal:
export OLD_BUCKET=<old-bucket-name>
export NEW_BUCKET=<new-bucket-name>

Important

The BACKUP_TIMESTAMP must be the exact ISO timestamp found in the previous step.

For example: 2024-12-16T15:44:03Z.

Create and apply a ConfigMap to configure the Velero plugin:

kubectl apply -f - <<EOF
apiVersion: v1  
kind: ConfigMap  
metadata:  
  name: velero-plugin-for-edbpgai  
  namespace: velero  
  labels:  
    velero.io/plugin-config: ""  
    enterprisedb.io/edbpgai-plugin: RestoreItemAction  
data:  
  # configure disaster recovery mode, so restored items are transformed as needed  
  drMode: "true"  
  # configure a date corresponding to the velero backup date. Note the format!  
  drDate: "${BACKUP_TIMESTAMP}"  
  # old and new buckets for internal custom storage locations  
  oldBucket: ${OLD_BUCKET}  
  newBucket: ${NEW_BUCKET}
EOF

Restore resources

Restore Managed Storage Locations by applying the following Velero restore. This includes the default managed-devspatcher location as well as any additional custom-defined locations.

kubectl apply -f - <<EOF
apiVersion: velero.io/v1  
kind: Restore  
metadata:  
  name: restore-1-storagelocations  
  namespace: velero  
spec:  
  backupName: "${BACKUP_NAME}" 
  includedResources:  
   - storagelocations.biganimal.enterprisedb.com  
  includeClusterResources: true  
  labelSelector:  
    matchLabels:  
      biganimal.enterprisedb.io/reserved-by-biganimal: "false"
EOF

Configure and apply the following Velero restore resource manifest to restore the cluster wrappers:

kubectl apply -f - <<EOF
apiVersion: velero.io/v1  
kind: Restore  
metadata:  
  name: restore-2-clusterwrappers  
  namespace: velero  
spec:  
  backupName: "${BACKUP_NAME}" 
  includedResources:  
   - clusterwrappers.beacon.enterprisedb.com  
  restoreStatus:  
    includedResources:  
     - clusterwrappers.beacon.enterprisedb.com
EOF

Monitor the restore progress. You must wait until clusterwrappers is restored first, because the following custom resources (CR) depend on it. If the corresponding clusterwrapper isn't found, HM could delete the other CRs.
```
velero get restore restore-2-clusterwrappers
```

After the cluster wrappers are restored, configure and apply the following Velero resource manifest to restore the backup wrappers:

kubectl apply -f - <<EOF
apiVersion: velero.io/v1  
kind: Restore  
metadata:  
  name: restore-3-backupwrappers  
  namespace: velero  
spec:   
  backupName: "${BACKUP_NAME}" 
  includedResources:  
   - backupwrappers.beacon.enterprisedb.com  
  restoreStatus:  
    includedResources:  
     - backupwrappers.beacon.enterprisedb.com
EOF

Configure and apply the following Velero resource manifest to restore Griptape, Lakekeeper and Dex secrets:

kubectl apply -f - <<EOF
apiVersion: velero.io/v1
kind: Restore
metadata:
  name: restore-4-required-secrets
  namespace: velero
spec:
  backupName: "${BACKUP_NAME}"
  includedNamespaces:
  - upm-griptape
  - upm-lakekeeper
  - upm-dex
  includedResources:
  - secrets
  includeClusterResources: false
EOF

(Optional) If you are running AI workloads, configure and apply the following Velero restore resource manifest to restore kserve resources:

kubectl apply -f - <<EOF
apiVersion: velero.io/v1
kind: Restore
metadata:
  name: restore-5-kservecrs
  namespace: velero
spec:
  backupName: "${BACKUP_NAME}"
  includedResources:
  - clusterservingruntimes.serving.kserve.io
  - inferenceservices.serving.kserve.io
EOF

Monitor all restores and wait for them to be completed:
```
velero get restores
```

3. Restore Postgres clusters

The cluster metadata has been restored, but the HM-managed Postgres clusters must be manually re-provisioned to link back to your data.

In the HM console, navigate to the databases section. You will see your original clusters listed with a status of Deleted.
Select the desired cluster and locate the Restore button. Follow the prompts to create a new cluster. During this process, the system will use your previous backups to populate the new instance.
After provisioning is complete, verify that the data matches your original state.

You can apply the same procedure to restore any Postgres clusters you had configured on a secondary location.

Note

AI components (such as the GenAI Builder UI in the Launchpad section) will automatically reappear in the HM console once the restore is initiated. Due to the large size of container images and profiles, synchronization may take some time.

4. Validate the restore

The restoration procedure is now complete. To ensure a successful recovery, we recommend checking for data integrity. Log in to the newly provisioned Postgres cluster and run a few test queries to confirm your data is current and accessible.

Tip

If you are performing this as part of a DR drill, internally document the total "Time to Restore" (TTR) for both the database and AI layers to help refine your recovery objectives (RTO).

← Prev

Configuring backups to support disaster recovery (DR)

↑ Up

Hybrid Manager backup and disaster recovery (DR)

Enabling the Migration Portal AI Copilot

Hybrid Manager disaster recovery Innovation Release

Warning

Before you start

Prerequisites

Required tools

1. Make backups available in the new HM instance

Important

Note

2. Recovery steps

Restore HM-internal databases

Note

Restart HM services

Configure the Velero plugin

Important

Restore resources

3. Restore Postgres clusters

Note

4. Validate the restore

Tip

← Prev

↑ Up

Next →