Skip to content

How To Configure Scaling For A Deployment#

Introduction#

This guide explains how to set up autoscaling for model deployments using either the web UI or the Python API.

Deployments use Knative Pod Autoscaler (KPA) to automatically scale the number of replicas based on traffic. Autoscaling enables the deployment to use resources more efficiently, by growing and shrinking the allocated resources according to its actual, real-time usage.

See Scale metrics and Scaling parameters for details on the available scaling options.

Web UI#

Step 1: Create new deployment#

If you have at least one model already trained and saved in the Model Registry, navigate to the deployments page by clicking on the Deployments tab on the navigation menu on the left.

Deployments navigation tab
Deployments navigation tab

Once in the deployments page, you can create a new deployment by either clicking on New deployment (if there are no existing deployments) or on Create new deployment it the top-right corner. Both options will open the deployment creation form.

Step 2: Go to advanced options#

A simplified creation form will appear including the most common deployment fields from all available configurations. Autoscaling is part of the advanced options of a deployment. To navigate to the advanced creation form, click on Advanced options.

Advance options
Advanced options. Go to advanced deployment creation form

Step 3: Configure autoscaling#

In the Autoscaling section of the advanced form, you can configure the scaling parameters for the predictor and/or the transformer (if available). You can set the scale metric, target value, minimum and maximum instances, as well as the panic and stable window parameters.

Autoscaling configuration for the predictor and transformer components
Autoscaling configuration for the predictor and transformer

Once you are done with the changes, click on Create new deployment at the bottom of the page to create the deployment for your model.

Code#

Step 1: Connect to Hopsworks#

import hopsworks


project = hopsworks.login()

# get Hopsworks Model Registry handle
mr = project.get_model_registry()

# get Hopsworks Model Serving handle
ms = project.get_model_serving()

Step 2: Define the predictor scaling configuration#

You can use the PredictorScalingConfig class to configure the scaling options according to your preferences. Default values for scaling metrics and parameters are listed in the Scale metrics and Scaling parameters sections above.

from hsml.scaling_config import PredictorScalingConfig


predictor_scaling = PredictorScalingConfig(
    min_instances=1, max_instances=5, scale_metric="RPS", target=100
)

Step 3 (Optional): Define the transformer scaling configuration#

If a transformer script is also provided, you can use the TransformerScalingConfig class to configure the scaling options according to your preferences. Default values for scaling metrics and parameters are listed in the Scale metrics and Scaling parameters sections above.

from hsml.scaling_config import TransformerScalingConfig


transformer_scaling = TransformerScalingConfig(
    min_instances=1, max_instances=3, scale_metric="CONCURRENCY", target=50
)

Step 4: Create a deployment with the scaling configuration#

my_model = mr.get_model("my_model", version=1)

# optional
my_transformer = ms.create_transformer(
    script_file="Resources/my_transformer.py",
    scaling_configuration=transformer_scaling
)

my_deployment = my_model.deploy(
  scaling_configuration=predictor_scaling,
  # optional:
  transformer=my_transformer
)

API Reference#

PredictorScalingConfig

TransformerScalingConfig

Scale metrics#

The autoscaler supports two metrics to determine when to scale. See Knative autoscaling metrics for more details.

Scale Metric Default Target Description
RPS 200 Requests per second per replica
CONCURRENCY 100 Concurrent requests per replica

Scaling parameters#

The following parameters can be used to fine-tune the autoscaling behavior. See scale bounds, autoscaling concepts and scale-to-zero in the Knative documentation for more details.

Parameter Default Range Description
minInstances ≥ 0 Minimum replicas (0 enables scale-to-zero)
maxInstances ≥ 1 Maximum replicas (cannot be less than min)
panicWindowPercentage 10.0 1–100 Panic window as percentage of stable window
stableWindowSeconds 60 6–3600 Stable window duration in seconds
panicThresholdPercentage 200.0 > 0 Traffic threshold to trigger panic mode
scaleToZeroRetentionSeconds 0 ≥ 0 Time to retain pods before scaling to zero

Cluster-level constraints

Administrators can set cluster-wide limits on the maximum and minimum number of instances. When the minimum is set to 0, scale-to-zero is enforced for all deployments.