Disaster Recovery#
Backup#
The state of a Hopsworks cluster is split between data and metadata and distributed across multiple services. This section explains how to take consistent backups for the offline and online feature stores as well as cluster metadata.
In Hopsworks, a consistent backup should back up the following services:
- RonDB: cluster metadata and the online feature store data.
- HopsFS: offline feature store data plus checkpoints and logs for feature engineering applications.
- Opensearch: search metadata, logs, dashboards, and user embeddings.
- Kubernetes objects: cluster credentials, backup metadata, serving metadata, and project namespaces with service accounts, roles, secrets, and configmaps.
- Python environments: custom project environments are stored in your configured container registry. Back up the registry separately. If a project and its environment are deleted, you must recreate the environment after restore.
Besides the above services, Hopsworks uses also Apache Kafka which carries in-flight data heading to the online feature store. In the event of a total cluster loss, running jobs with in-flight data must be replayed.
Prerequisites#
When enabling backup in Hopsworks, cron jobs are configured for RonDB and Opensearch. For HopsFS, backups rely on versioning in the object store. For Kubernetes objects, Hopsworks uses Velero to snapshot the required resources. Before enabling backups:
- Enable versioning on the S3-compatible bucket used for HopsFS.
- Install and configure Velero with the AWS plugin (S3).
Install Velero#
Velero provides backup and restore for Kubernetes resources. Install it with either the Velero CLI or Helm (Velero docs: Velero basic install guide).
- Using the Velero CLI, set up the CRDs and deployment:
velero install \
--image velero/velero:v1.17.1 \
--plugins velero/velero-plugin-for-aws:v1.13.0 \
--no-default-backup-location \
--no-secret \
--use-volume-snapshots=false \
--wait
- Using the Velero Helm chart:
helm repo add vmware-tanzu https://vmware-tanzu.github.io/helm-charts
helm repo update
helm install velero vmware-tanzu/velero \
--namespace velero \
--version 11.2.0 \
--create-namespace \
--set "initContainers[0].name=velero-plugin-for-aws" \
--set "initContainers[0].image=velero/velero-plugin-for-aws:v1.13.0" \
--set "initContainers[0].volumeMounts[0].mountPath=/target" \
--set "initContainers[0].volumeMounts[0].name=plugins" \
--set-json configuration.backupStorageLocation='[]' \
--set "credentials.useSecret=false" \
--set "snapshotsEnabled=false" \
--wait
Configuring Backup#
Note
Backup is only supported for clusters that use S3-compatible object storage.
You can enable backups during installation or a later upgrade. Set the schedule with a cron expression in the values file:
global:
_hopsworks:
backups:
enabled: true
schedule: "@weekly"
After configuring backups, go to the cluster settings and open the Backup tab. You should see enabled at the top level and for all services if everything is configured correctly.
If any service is misconfigured, the backup status shows as partial. In the example below, Velero is disabled because it was not configured correctly. Fix partial backups before relying on them for recovery.
Cleanup#
Use the backup time-to-live (ttl) flag to automatically prune backups older than the configured duration.
global:
_hopsworks:
backups:
enabled: true
schedule: "@weekly"
ttl: 60d
For S3 object storage, you can also configure a bucket lifecycle policy to expire old object versions. Example for AWS S3:
{
"Rules": [
{
"ID": "HopsFSBlocksRetentionPolicy",
"Status": "Enabled",
"Filter": {},
"Expiration": {
"ExpiredObjectDeleteMarker": true
},
"NoncurrentVersionExpiration": {
"NoncurrentDays": 60
}
}
]
}
Restore#
Note
Restore is only supported in a newly created cluster; in-place restore is not supported. Use the exact Hopsworks version that was used to create the backup.
The restore process has two phases:
- Restore Kubernetes objects required for the cluster restore.
- Install the cluster with Helm using the correct backup IDs.
Restore Kubernetes objects#
Restore the Kubernetes objects that were backed up using Velero.
- Ensure that Velero is installed and configured with the AWS plugin as described in the prerequisites.
-
Set up a Velero backup storage location to point to the S3 bucket.
-
If you are using AWS S3 and access is controlled by an IAM role:
kubectl apply -f - <<EOF apiVersion: velero.io/v1 kind: BackupStorageLocation metadata: name: hopsworks-bsl namespace: velero spec: provider: aws config: region: REGION objectStorage: bucket: BUCKET_NAME prefix: k8s_backup EOF -
If you are using an S3-compatible object storage, provide credentials and endpoint:
cat << EOF > hopsworks-bsl-credentials [default] aws_access_key_id=YOUR_ACCESS_KEY aws_secret_access_key=YOUR_SECRET_KEY EOF kubectl create secret generic -n velero hopsworks-bsl-credentials --from-file=cloud=hopsworks-bsl-credentials kubectl apply -f - <<EOF apiVersion: velero.io/v1 kind: BackupStorageLocation metadata: name: hopsworks-bsl namespace: velero spec: provider: aws config: region: REGION s3Url: ENDPOINT credential: key: cloud name: hopsworks-bsl-credentials objectStorage: bucket: BUCKET_NAME prefix: k8s_backup EOF -
After the backup storage location becomes available, restore the backups. The following script restores the latest available backup. To restore a specific backup, set
backupNameinstead ofscheduleName.
echo "=== Waiting for Velero BackupStorageLocation hopsworks-bsl to become Available ==="
until [ "$(kubectl get backupstoragelocations hopsworks-bsl -n velero -o jsonpath='{.status.phase}' 2>/dev/null)" = "Available" ]; do
echo "Still waiting..."; sleep 5;
done
echo "=== Waiting for Velero to sync the backups from hopsworks-bsl ==="
until [ "$(kubectl get backups -n velero -ojson | jq -r '[.items[] | select(.spec.storageLocation == "hopsworks-bsl")] | length' 2>/dev/null)" != "0" ]; do
echo "Still waiting..."; sleep 5;
done
# Restores the latest - if specific backup is needed then backupName instead
echo "=== Creating Velero Restore object for k8s-backups-main ==="
RESTORE_SUFFIX=$(date +%s)
kubectl apply -f - <<EOF
apiVersion: velero.io/v1
kind: Restore
metadata:
name: k8s-backups-main-restore-$RESTORE_SUFFIX
namespace: velero
spec:
scheduleName: k8s-backups-main
EOF
echo "=== Waiting for Velero restore to finish ==="
until [ "$(kubectl get restore k8s-backups-main-restore-$RESTORE_SUFFIX -n velero -o jsonpath='{.status.phase}' 2>/dev/null)" = "Completed" ]; do
echo "Still waiting..."; sleep 5;
done
# Restores the latest - if specific backup is needed then backupName instead
echo "=== Creating Velero Restore object for k8s-backups-users-resources ==="
kubectl apply -f - <<EOF
apiVersion: velero.io/v1
kind: Restore
metadata:
name: k8s-backups-users-resources-restore-$RESTORE_SUFFIX
namespace: velero
spec:
scheduleName: k8s-backups-users-resources
EOF
echo "=== Waiting for Velero restore to finish ==="
until [ "$(kubectl get restore k8s-backups-users-resources-restore-$RESTORE_SUFFIX -n velero -o jsonpath='{.status.phase}' 2>/dev/null)" = "Completed" ]; do
echo "Still waiting..."; sleep 5;
done
After the restore completes, verify the restored resources in Kubernetes. RonDB and Opensearch store their backup metadata in the rondb-backups-metadata and opensearch-backups-metadata configmaps. Use the commands below to list successful backup IDs (newest first) that can be referenced during cluster installation.
kubectl get configmap rondb-backups-metadata -n hopsworks -o json \
| jq -r '.data | to_entries[] | select(.value | fromjson | .state == "SUCCESS") | .key' \
| sort -nr
kubectl get configmap opensearch-backups-metadata -n hopsworks -o json \
| jq -r '.data | to_entries[] | select(.value | fromjson | .state == "SUCCESS") | .key' \
| sort -nr
Restore on Cluster installation#
To restore a cluster during installation, configure the backup ID in the values YAML file:
global:
_hopsworks:
backups:
enabled: true
schedule: "@weekly"
restoreFromBackup:
backupId: "254811200"
Customizations#
Warning
Even if you override the backup IDs for RonDB and Opensearch, you must still set .global._hopsworks.restoreFromBackup.backupId to ensure HopsFS is restored.
To restore a different backup ID for RonDB:
global:
_hopsworks:
backups:
enabled: true
schedule: "@weekly"
restoreFromBackup:
backupId: "254811200"
rondb:
rondb:
restoreFromBackup:
backupId: "254811140"
To restore a different backup for Opensearch:
global:
_hopsworks:
backups:
enabled: true
schedule: "@weekly"
restoreFromBackup:
backupId: "254811200"
olk:
opensearch:
restore:
repositories:
default:
snapshots:
default:
snapshot_name: "254811140"
You can also customize the Opensearch restore process to skip specific indices:
global:
_hopsworks:
backups:
enabled: true
schedule: "@weekly"
restoreFromBackup:
backupId: "254811200"
olk:
opensearch:
restore:
repositories:
default:
snapshots:
default:
snapshot_name: "254811140"
payload:
indices: "-myindex"