AWS SageMaker Integration#
Connecting to the Feature Store from SageMaker requires setting up a Feature Store API key for SageMaker and installing the HSFS on SageMaker. This guide explains step by step how to connect to the Feature Store from SageMaker.
Generate an API key#
For instructions on how to generate an API key follow this user guide. For the SageMaker integration to work make sure you add the following scopes to your API key:
- featurestore
- project
- job
- kafka
Quickstart API key Argument#
API key as Argument
To get started quickly, without saving the Hopsworks API in a secret storage, you can simply supply it as an argument when instantiating a connection:
import hsfs
conn = hsfs.connection(
host='my_instance', # DNS of your Feature Store instance
port=443, # Port to reach your Hopsworks instance, defaults to 443
project='my_project', # Name of your Hopsworks Feature Store project
api_key_value='apikey', # The API key to authenticate with Hopsworks
hostname_verification=True # Disable for self-signed certificates
)
fs = conn.get_feature_store() # Get the project's default feature store
Store the API key on AWS#
The API key now needs to be stored on AWS, so it can be retrieved from within SageMaker notebooks.
Identify your SageMaker role#
You need to know the IAM role used by your SageMaker instance to set up the API key for it. You can find it in the overview of your SageMaker notebook instance of the AWS Management Console.
In this example, the name of the role is AmazonSageMaker-ExecutionRole-20190511T072435.
Store the API key#
You have two options to make your API key accessible from SageMaker:
Option 1: Using the AWS Systems Manager Parameter Store#
Store the API key in the AWS Systems Manager Parameter Store#
- In the AWS Management Console, ensure that your active region is the region you use for SageMaker.
- Go to the AWS Systems Manager choose Parameter Store in the left navigation bar and select Create Parameter.
- As name, enter
/hopsworks/role/[MY_SAGEMAKER_ROLE]/type/api-key
replacing[MY_SAGEMAKER_ROLE]
with the AWS role used by the SageMaker instance that should access the Feature Store. - Select Secure String as type and create the parameter.
Grant access to the Parameter Store from the SageMaker notebook role#
- In the AWS Management Console, go to IAM, select Roles and then the role that is used when creating SageMaker notebook instances.
- Select Add inline policy.
- Choose Systems Manager as service, expand the Read access level and check GetParameter.
- Expand Resources and select Add ARN.
- Enter the region of the Systems Manager as well as the name of the parameter WITHOUT the leading slash e.g.
hopsworks/role/[MY_SAGEMAKER_ROLE]/type/api-key
and click Add. - Click on Review, give the policy a name and click on Create policy.
Option 2: Using the AWS Secrets Manager#
Store the API key in the AWS Secrets Manager#
- In the AWS Management Console, ensure that your active region is the region you use for SageMaker.
- Go to the AWS Secrets Manager and select Store new secret.
- Select Other type of secrets and add api-key as the key and paste the API key created in the previous step as the value.
- Click next.
- As secret name, enter
hopsworks/role/[MY_SAGEMAKER_ROLE]
replacing[MY_SAGEMAKER_ROLE]
with the AWS role used by the SageMaker instance that should access the Feature Store. - Select next twice and finally store the secret.
- Then click on the secret in the secrets list and take note of the Secret ARN.
Grant access to the SecretsManager to the SageMaker notebook role#
- In the AWS Management Console, go to IAM, select Roles and then the role that is used when creating SageMaker notebook instances.
- Select Add inline policy.
- Choose Secrets Manager as service, expand the Read access level and check GetSecretValue.
- Expand Resources and select Add ARN.
- Paste the ARN of the secret created in the previous step.
- Click on Review, give the policy a name and click on Create policy.
Install HSFS#
To be able to access the Hopsworks Feature Store, the HSFS
Python library needs to be installed. One way of achieving this is by opening a Python notebook in SageMaker and installing the HSFS
with a magic command and pip:
!pip install hsfs[python]~=[HOPSWORKS_VERSION]
Hive Dependencies
By default, HSFS
assumes Spark/EMR is used as execution engine and therefore Hive dependencies are not installed. Hence, on AWS SageMaker, if you are planning to use a regular Python Kernel without Spark/EMR, make sure to install the "python" extra dependencies (hsfs[python]
).
Matching Hopsworks version
The major version of HSFS
needs to match the major version of Hopsworks.
Note that the library will not be persistent. For information around how to permanently install a library to SageMaker, see Install External Libraries and Kernels in Notebook Instances.
Connect to the Feature Store#
You are now ready to connect to the Hopsworks Feature Store from SageMaker:
import hsfs
conn = hsfs.connection(
'my_instance', # DNS of your Feature Store instance
443, # Port to reach your Hopsworks instance, defaults to 443
'my_project', # Name of your Hopsworks Feature Store project
secrets_store='secretsmanager', # Either parameterstore or secretsmanager
hostname_verification=True, # Disable for self-signed certificates
engine='python' # Choose Python as engine if you haven't set up AWS EMR
)
fs = conn.get_feature_store() # Get the project's default feature store
Engine
HSFS
uses either Apache Spark or Pandas/Polars on Python as an execution engine to perform queries against the feature store. Most AWS SageMaker Kernels have PySpark installed but are not connected to AWS EMR by default, hence, the engine
option of the connection let's you overwrite the default behaviour. By default, HSFS
will try to use Spark as engine if PySpark is available, however, if Spark/EMR is not configured, you will have to set the engine manually to "python"
. Please refer to the EMR integration guide to setup EMR with the Hopsworks Feature Store.
Ports
If you have trouble connecting, please ensure that the Security Group of your Hopsworks instance on AWS is configured to allow incoming traffic from your SageMaker instance on ports 443, 9083 and 9085 (443,9083,9085). See VPC Security Groups for more information. If your SageMaker instances are not in the same VPC as your Hopsworks instance and the Hopsworks instance is not accessible from the internet then you will need to configure VPC Peering on AWS.
Next Steps#
For more information about how to use the Feature Store, see the Quickstart Guide.