Disaster Recovery#

Backup#

The state of a Hopsworks cluster is split between data and metadata and distributed across multiple services. This section explains how to take consistent backups for the offline and online feature stores as well as cluster metadata.

In Hopsworks, a consistent backup should back up the following services:

RonDB: cluster metadata and the online feature store data.
HopsFS: offline feature store data plus checkpoints and logs for feature engineering applications.
Opensearch: search metadata, logs, dashboards, and user embeddings.
Kubernetes objects: cluster credentials, backup metadata, serving metadata, and project namespaces with service accounts, roles, secrets, and configmaps.
Python environments: custom project environments are stored in your configured container registry. Back up the registry separately. If a project and its environment are deleted, you must recreate the environment after restore.

Besides the above services, Hopsworks uses also Apache Kafka which carries in-flight data heading to the online feature store. In the event of a total cluster loss, running jobs with in-flight data must be replayed.

Prerequisites#

When enabling backup in Hopsworks, cron jobs are configured for RonDB and Opensearch. For HopsFS, backups rely on versioning in the object store. For Kubernetes objects, Hopsworks uses Velero to snapshot the required resources. Before enabling backups:

Enable versioning on the S3-compatible bucket used for HopsFS.
Install and configure Velero with the AWS plugin (S3).

Install Velero#

Velero provides backup and restore for Kubernetes resources. Install it with either the Velero CLI or Helm (Velero docs: Velero basic install guide).

Using the Velero CLI, set up the CRDs and deployment:

velero install \
    --image velero/velero:v1.17.1 \
    --plugins velero/velero-plugin-for-aws:v1.13.0 \
    --no-default-backup-location \
    --no-secret \
    --use-volume-snapshots=false \
    --wait

Using the Velero Helm chart:

helm repo add vmware-tanzu https://vmware-tanzu.github.io/helm-charts
helm repo update

helm install velero vmware-tanzu/velero \
  --namespace velero \
  --version 11.2.0 \
  --create-namespace \
  --set "initContainers[0].name=velero-plugin-for-aws" \
  --set "initContainers[0].image=velero/velero-plugin-for-aws:v1.13.0" \
  --set "initContainers[0].volumeMounts[0].mountPath=/target" \
  --set "initContainers[0].volumeMounts[0].name=plugins" \
  --set-json configuration.backupStorageLocation='[]' \
  --set "credentials.useSecret=false" \
  --set "snapshotsEnabled=false" \
  --wait

Configuring Backup#

Note

Backup is only supported for clusters that use S3-compatible object storage.

You can enable backups during installation or a later upgrade. Set the schedule with a cron expression in the values file:

global:
  _hopsworks:
    backups:
      enabled: true
      schedule: "@weekly"

After configuring backups, go to the cluster settings and open the Backup tab. You should see enabled at the top level and for all services if everything is configured correctly.

If any service is misconfigured, the backup status shows as partial. In the example below, Velero is disabled because it was not configured correctly. Fix partial backups before relying on them for recovery.

Cleanup#

Use the backup time-to-live (ttl) flag to automatically prune backups older than the configured duration.

global:
  _hopsworks:
    backups:
      enabled: true
      schedule: "@weekly"
      ttl: 60d

For S3 object storage, you can also configure a bucket lifecycle policy to expire old object versions. Example for AWS S3:

{
  "Rules": [
    {
      "ID": "HopsFSBlocksRetentionPolicy",
      "Status": "Enabled",
      "Filter": {},
      "Expiration": {
        "ExpiredObjectDeleteMarker": true
      },
      "NoncurrentVersionExpiration": {
        "NoncurrentDays": 60
      }
    }
  ]
}

Restore#

Note

Restore is only supported in a newly created cluster; in-place restore is not supported. Use the exact Hopsworks version that was used to create the backup.

The restore process has two phases:

Restore Kubernetes objects required for the cluster restore.
Install the cluster with Helm using the correct backup IDs.

Restore Kubernetes objects#

Restore the Kubernetes objects that were backed up using Velero.

Ensure that Velero is installed and configured with the AWS plugin as described in the prerequisites.
Set up a Velero backup storage location to point to the S3 bucket.

If you are using AWS S3 and access is controlled by an IAM role:

kubectl apply -f - <<EOF
apiVersion: velero.io/v1
kind: BackupStorageLocation
metadata:
name: hopsworks-bsl
namespace: velero
spec:
provider: aws
config:
    region: REGION
objectStorage:
    bucket: BUCKET_NAME
    prefix: k8s_backup
EOF

If you are using an S3-compatible object storage, provide credentials and endpoint:

cat << EOF > hopsworks-bsl-credentials
[default]
aws_access_key_id=YOUR_ACCESS_KEY
aws_secret_access_key=YOUR_SECRET_KEY
EOF

kubectl create secret generic -n velero hopsworks-bsl-credentials --from-file=cloud=hopsworks-bsl-credentials

kubectl apply -f - <<EOF
apiVersion: velero.io/v1
kind: BackupStorageLocation
metadata:
name: hopsworks-bsl
namespace: velero
spec:
provider: aws
config:
    region: REGION
    s3Url: ENDPOINT
credential:
    key: cloud
    name: hopsworks-bsl-credentials
objectStorage:
    bucket: BUCKET_NAME
    prefix: k8s_backup
EOF

After the backup storage location becomes available, restore the backups. The following script restores the latest available backup. To restore a specific backup, set backupName instead of scheduleName.

echo "=== Waiting for Velero BackupStorageLocation  hopsworks-bsl to become Available ==="
until [ "$(kubectl get backupstoragelocations hopsworks-bsl -n velero -o jsonpath='{.status.phase}' 2>/dev/null)" = "Available" ]; do
  echo "Still waiting..."; sleep 5;
done

echo "=== Waiting for Velero to sync the backups from hopsworks-bsl ==="
until [ "$(kubectl get backups -n velero -ojson | jq -r '[.items[] | select(.spec.storageLocation == "hopsworks-bsl")] | length' 2>/dev/null)" != "0" ]; do
  echo "Still waiting..."; sleep 5;
done


# Restores the latest - if specific backup is needed then backupName instead
echo "=== Creating Velero Restore object for k8s-backups-main ==="
RESTORE_SUFFIX=$(date +%s)
kubectl apply -f - <<EOF
apiVersion: velero.io/v1
kind: Restore
metadata:
  name: k8s-backups-main-restore-$RESTORE_SUFFIX
  namespace: velero
spec:
  scheduleName: k8s-backups-main
EOF

echo "=== Waiting for Velero restore to finish ==="
until [ "$(kubectl get restore k8s-backups-main-restore-$RESTORE_SUFFIX -n velero -o jsonpath='{.status.phase}' 2>/dev/null)" = "Completed" ]; do
  echo "Still waiting..."; sleep 5;
done

# Restores the latest - if specific backup is needed then backupName instead
echo "=== Creating Velero Restore object for k8s-backups-users-resources ==="
kubectl apply -f - <<EOF
apiVersion: velero.io/v1
kind: Restore
metadata:
  name: k8s-backups-users-resources-restore-$RESTORE_SUFFIX
  namespace: velero
spec:
  scheduleName: k8s-backups-users-resources
EOF

echo "=== Waiting for Velero restore to finish ==="
until [ "$(kubectl get restore k8s-backups-users-resources-restore-$RESTORE_SUFFIX -n velero -o jsonpath='{.status.phase}' 2>/dev/null)" = "Completed" ]; do
  echo "Still waiting..."; sleep 5;
done

After the restore completes, verify the restored resources in Kubernetes. RonDB and Opensearch store their backup metadata in the rondb-backups-metadata and opensearch-backups-metadata configmaps. Use the commands below to list successful backup IDs (newest first) that can be referenced during cluster installation.

kubectl get configmap rondb-backups-metadata -n hopsworks -o json \
| jq -r '.data | to_entries[] | select(.value | fromjson | .state == "SUCCESS") | .key' \
| sort -nr

kubectl get configmap opensearch-backups-metadata -n hopsworks -o json \
| jq -r '.data | to_entries[] | select(.value | fromjson | .state == "SUCCESS") | .key' \
| sort -nr

Restore on Cluster installation#

To restore a cluster during installation, configure the backup ID in the values YAML file:

global:
  _hopsworks:
    backups:
      enabled: true
      schedule: "@weekly"
    restoreFromBackup:
      backupId: "254811200"

Customizations#

Warning

Even if you override the backup IDs for RonDB and Opensearch, you must still set .global._hopsworks.restoreFromBackup.backupId to ensure HopsFS is restored.

To restore a different backup ID for RonDB:

global:
  _hopsworks:
    backups:
      enabled: true
      schedule: "@weekly"
    restoreFromBackup:
      backupId: "254811200"

rondb:
  rondb:
    restoreFromBackup:
      backupId: "254811140"

To restore a different backup for Opensearch:

global:
  _hopsworks:
    backups:
      enabled: true
      schedule: "@weekly"
    restoreFromBackup:
      backupId: "254811200"

olk:
  opensearch:
    restore:
      repositories:
        default:
          snapshots:
            default:
              snapshot_name: "254811140"

You can also customize the Opensearch restore process to skip specific indices:

global:
  _hopsworks:
    backups:
      enabled: true
      schedule: "@weekly"
    restoreFromBackup:
      backupId: "254811200"

olk:
  opensearch:
    restore:
      repositories:
        default:
          snapshots:
            default:
              snapshot_name: "254811140"
              payload:
                indices: "-myindex"