How To Configure Scaling For A Deployment#

Introduction#

This guide explains how to set up autoscaling for model deployments using either the web UI or the Python API.

Deployments use Knative Pod Autoscaler (KPA) to automatically scale the number of replicas based on traffic. Autoscaling enables the deployment to use resources more efficiently, by growing and shrinking the allocated resources according to its actual, real-time usage.

See Scale metrics and Scaling parameters for details on the available scaling options.

Web UI#

Step 1: Create new deployment#

If you have at least one model already trained and saved in the Model Registry, navigate to the deployments page by clicking on the Deployments tab on the navigation menu on the left.

Once in the deployments page, you can create a new deployment by either clicking on New deployment (if there are no existing deployments) or on Create new deployment it the top-right corner. Both options will open the deployment creation form.

Step 2: Go to advanced options#

A simplified creation form will appear including the most common deployment fields from all available configurations. Autoscaling is part of the advanced options of a deployment. To navigate to the advanced creation form, click on Advanced options.

Advance options — Advanced options. Go to advanced deployment creation form

Step 3: Configure autoscaling#

In the Autoscaling section of the advanced form, you can configure the scaling parameters for the predictor and/or the transformer (if available). You can set the scale metric, target value, minimum and maximum instances, as well as the panic and stable window parameters.

Autoscaling configuration for the predictor and transformer components — Autoscaling configuration for the predictor and transformer

Once you are done with the changes, click on Create new deployment at the bottom of the page to create the deployment for your model.

Code#

Step 1: Connect to Hopsworks#

Python

import hopsworks


project = hopsworks.login()

# get Hopsworks Model Registry handle
mr = project.get_model_registry()

# get Hopsworks Model Serving handle
ms = project.get_model_serving()

Step 2: Define the predictor scaling configuration#

You can use the PredictorScalingConfig class to configure the scaling options according to your preferences. Default values for scaling metrics and parameters are listed in the Scale metrics and Scaling parameters sections above.

Python

from hsml.scaling_config import PredictorScalingConfig


predictor_scaling = PredictorScalingConfig(
    min_instances=1, max_instances=5, scale_metric="RPS", target=100
)

Step 3 (Optional): Define the transformer scaling configuration#

If a transformer script is also provided, you can use the TransformerScalingConfig class to configure the scaling options according to your preferences. Default values for scaling metrics and parameters are listed in the Scale metrics and Scaling parameters sections above.

Python

from hsml.scaling_config import TransformerScalingConfig


transformer_scaling = TransformerScalingConfig(
    min_instances=1, max_instances=3, scale_metric="CONCURRENCY", target=50
)

Step 4: Create a deployment with the scaling configuration#

Python

my_model = mr.get_model("my_model", version=1)

# optional
my_transformer = ms.create_transformer(
    script_file="Resources/my_transformer.py",
    scaling_configuration=transformer_scaling
)

my_deployment = my_model.deploy(
  scaling_configuration=predictor_scaling,
  # optional:
  transformer=my_transformer
)

API Reference#

PredictorScalingConfig

TransformerScalingConfig

Scale metrics#

The autoscaler supports two metrics to determine when to scale. See Knative autoscaling metrics for more details.

Scale Metric	Default Target	Description
RPS	200	Requests per second per replica
CONCURRENCY	100	Concurrent requests per replica

Scaling parameters#

The following parameters can be used to fine-tune the autoscaling behavior. See scale bounds, autoscaling concepts and scale-to-zero in the Knative documentation for more details.

Parameter	Default	Range	Description
`minInstances`	—	≥ 0	Minimum replicas (0 enables scale-to-zero)
`maxInstances`	—	≥ 1	Maximum replicas (cannot be less than min)
`panicWindowPercentage`	10.0	1–100	Panic window as percentage of stable window
`stableWindowSeconds`	60	6–3600	Stable window duration in seconds
`panicThresholdPercentage`	200.0	> 0	Traffic threshold to trigger panic mode
`scaleToZeroRetentionSeconds`	0	≥ 0	Time to retain pods before scaling to zero

Cluster-level constraints

Administrators can set cluster-wide limits on the maximum and minimum number of instances. When the minimum is set to 0, scale-to-zero is enforced for all deployments.