Model Serving
In Hopsworks, you can easily deploy models from the model registry using KServe, the standard open-source framework for model serving on Kubernetes. You can deploy models programmatically using Model.deploy or via the UI. A KServe model deployment can include the following components:
Predictor (KServe component)-
A predictor runs a model server (Python, TensorFlow Serving, or vLLM) that loads a trained model, handles inference requests and returns predictions.
Transformer (KServe component)-
A pre-processing and post-processing component that can transform model inputs before predictions are made, and predictions before these are delivered back to the client. Not available for vLLM deployments.
Inference Logger-
Hopsworks logs inputs and outputs of transformers and predictors to a Kafka topic that is part of the same project as the model. Not available for vLLM deployments.
Inference Batcher-
Inference requests can be batched to improve throughput (at the cost of slightly higher latency).
Istio Model Endpoint-
You can publish a model over REST(HTTP) or gRPC using a Hopsworks API key, accessible via path-based routing through Istio. API keys have scopes to ensure the principle of least privilege access control to resources managed by Hopsworks. For more details on path-based routing of requests through Istio, see REST API Guide.
Host-based routing
The Istio Model Endpoint supports host-based routing for inference requests; however, this approach is considered legacy. Path-based routing is recommended for new deployments.
Models deployed on KServe in Hopsworks can be easily integrated with the Hopsworks Feature Store using either a Transformer or Predictor Python script, that builds the predictor's input feature vector using the application input and pre-computed features from the Feature Store.
Model Serving Guide
More information can be found in the Model Serving guide.
Python deployments
For deploying Python scripts without a model artifact, see the Python Deployments page.