Hopsworks Model Serving REST API#

Introduction#

Hopsworks provides model serving capabilities by leveraging KServe as the model serving platform and Istio as the ingress gateway to the model deployments.

This document explains how to interact with a model deployment via REST API.

Tutorials

End-to-end examples are available in the hopsworks-tutorials repository.

Sending Inference Requests through Istio Ingress#

The full inference URL is constructed by combining a base path with a model server-specific suffix. See URL Paths for the complete URL format and examples.

Authentication#

All requests must include an API Key for authentication. You can create an API key by following this guide.

Include the key in the authorization header:

authorization: ApiKey <API_KEY_VALUE>

Headers#

Header	Description	Example Value
`authorization`	API key for authentication.	`ApiKey <your_api_key>`
`content-type`	Request payload type (always JSON).	`application/json`

URL Paths#

Deployed models are accessible through the Istio ingress gateway using path-based routing. The full URL is constructed by combining the base path with a model server-specific suffix. This URL is also provided on the model deployment page in the Hopsworks UI.

<base_url>/<server-specific_suffix>

Where server-specific_suffix depends on the model server type (see ML Inference Paths or OpenAI-compatible Paths).

Base URL#

The base URL is composed of the Istio ingress gateway IP, the project name, and the deployment name.

https://<ISTIO_GATEWAY_IP>/v1/<project_name>/<deployment_name>

Host-based routing (legacy)

Prior to path-based routing, requests were routed using a Host header matching the model deployment hostname, and https://<ISTIO_GATEWAY_IP> as base url.

Host: <deployment-name>.<project-name>.<knative-domain-name>

Each model deployment gets its own Knative-generated hostname, and routing depends on the Host header matching Istio ingress gateway rules.

Path-based routing (described above) is the preferred method for external access.

ML Inference#

For model deployments using Python, KServe sklearnserver, or TensorFlow Serving, the URL follows the KServe V1 inference protocol.

Supported verbs and path format

Model Server	Supported Verbs	Path Format
Python	`predict`	`<base_url>/v1/models/<name>:<verb>`
KServe sklearnserver	`predict`	`<base_url>/v1/models/<name>:<verb>`
TensorFlow Serving	`predict`, `classify`, `regress`	`<base_url>/v1/models/<name>:<verb>`

Hopsworks Python API

ML inference urls can be retrieved using the Deployment class.

# Returns: https://<istio-host>/v1/<project>/<deployment>/v1/models/<name>:predict
inference_url = deployment.get_inference_url()

OpenAI-compatible#

vLLM deployments provide an OpenAI API-compatible endpoint at <base_url>/v1/, allowing you to send any standard OpenAI API request to the vLLM server.

e.g., Chat Completions endpoint

<base_url>/v1/chat/completions

Refer to the official vLLM OpenAI-compatible server documentation for details about the available APIs.

Hopsworks Python API

OpenAI-compatible urls can be retrieved using the Deployment class.

# Returns: https://<istio-host>/v1/<project>/<deployment>/v1
# Append /chat/completions or /completions for specific endpoints
openai_url = deployment.get_openai_url()

Request Format#

The request format depends on the model server being used.

For predictive inference (TensorFlow, sklearn, or Python model server), the request must be sent as a JSON object containing an inputs or instances field. See more information on the request format.

REST API example for Predictive Inference (Tensorflow or SkLearn or Python Serving)

PythonCurl

import requests


data = {"inputs": [[4641025220953719, 4920355418495856]]}

headers = {"authorization": "ApiKey <your_api_key>", "content-type": "application/json"}

response = requests.post(
    "https://<ISTIO_GATEWAY_IP>/v1/my_project/fraud/v1/models/fraud:predict",
    headers=headers,
    json=data,
)
print(response.json())

curl -X POST "https://<ISTIO_GATEWAY_IP>/v1/my_project/fraud/v1/models/fraud:predict" \
  -H "authorization: ApiKey <your_api_key>" \
  -H "content-type: application/json" \
  -d '{
        "inputs": [
          [4641025220953719, 4920355418495856]
        ]
      }'

For generative inference (vLLM), the request follows the OpenAI specification supported by the vLLM OpenAI-compatible server.

vLLM chat completions

PythonCurl

import requests


data = {
    "model": "my-llm",
    "messages": [{"role": "user", "content": "Hello, how are you?"}],
}

headers = {"authorization": "ApiKey <your_api_key>", "content-type": "application/json"}

response = requests.post(
    "https://<ISTIO_GATEWAY_IP>/v1/my_project/my-llm/v1/chat/completions",
    headers=headers,
    json=data,
)
print(response.json())

curl -X POST "https://<ISTIO_GATEWAY_IP>/v1/my_project/my-llm/v1/chat/completions" \
-H "authorization: ApiKey <your_api_key>" \
-H "content-type: application/json" \
-d '{
        "model": "my-llm",
        "messages": [
            {"role": "user", "content": "Hello, how are you?"}
        ]
    }'

CORS#

The Istio EnvoyFilter handles CORS preflight (OPTIONS) requests automatically. Allowed origins can be configured via istio.envoyFilter.corsAllowedOrigins in the Helm chart configuration.

Response#

The model returns predictions in a JSON object. The response depends on the model server implementation.