Model Serving

In Hopsworks, you can easily deploy models from the model registry using KServe, the standard open-source framework for model serving on Kubernetes. You can deploy models programmatically using Model.deploy or via the UI. A KServe model deployment can include the following components:

Predictor (KServe component): A predictor runs a model server (Python, TensorFlow Serving, or vLLM) that loads a trained model, handles inference requests and returns predictions.
Transformer (KServe component): A pre-processing and post-processing component that can transform model inputs before predictions are made, and predictions before these are delivered back to the client. Not available for vLLM deployments.
Inference Logger: Hopsworks logs inputs and outputs of transformers and predictors to a Kafka topic that is part of the same project as the model. Not available for vLLM deployments.
Inference Batcher: Inference requests can be batched to improve throughput (at the cost of slightly higher latency).
Istio Model Endpoint: You can publish a model over REST(HTTP) or gRPC using a Hopsworks API key, accessible via path-based routing through Istio. API keys have scopes to ensure the principle of least privilege access control to resources managed by Hopsworks. For more details on path-based routing of requests through Istio, see REST API Guide.

Host-based routing

The Istio Model Endpoint supports host-based routing for inference requests; however, this approach is considered legacy. Path-based routing is recommended for new deployments.

Models deployed on KServe in Hopsworks can be easily integrated with the Hopsworks Feature Store using either a Transformer or Predictor Python script, that builds the predictor's input feature vector using the application input and pre-computed features from the Feature Store.

Model Serving Guide

More information can be found in the Model Serving guide.

Python deployments

For deploying Python scripts without a model artifact, see the Python Deployments page.