Skip to content

Configure EMR for the Hopsworks Feature Store#

To enable EMR to access the Hopsworks Feature Store, you need to set up a Hopsworks API key, add a bootstrap action and configurations to your EMR cluster.

Info

Ensure Networking is set up correctly before proceeding with this guide.

Step 1: Set up a Hopsworks API key#

In order for EMR clusters to be able to communicate with the Hopsworks Feature Store, the clients running on EMR need to be able to access a Hopsworks API key.

Generate an API key#

In Hopsworks, click on your username in the top-right corner and select Account Settings to open the user settings. Select API. Give the key a name and select the project scope before creating the key. Copy the key into your clipboard for the next step.

Scopes

The API key should contain at least the following scopes:

  1. project

Generating an API key on Hopsworks
API keys can be created in the User Settings on Hopsworks

Info

You are only able to retrieve the API key once. If you did not manage to copy it to your clipboard, delete it and create a new one.

Store the API key in the AWS Secrets Manager#

In the AWS management console ensure that your active region is the region you use for EMR. Go to the AWS Secrets Manager and select Store new secret. Select Other type of secrets and add api-key as the key and paste the API key created in the previous step as the value. Click next.

Store a Hopsworks API key in the Secrets Manager
Store a Hopsworks API key in the Secrets Manager

As a secret name, enter hopsworks/featurestore. Select next twice and finally store the secret. Then click on the secret in the secrets list and take note of the Secret ARN.

Name the secret
Name the secret

Grant access to the secret to the EMR EC2 instance profile#

Identify your EMR EC2 instance profile in the EMR cluster summary:

Identify your EMR EC2 instance profile
Identify your EMR EC2 instance profile

In the AWS Management Console, go to IAM, select Roles and then the EC2 instance profile used by your EMR cluster. Select Add inline policy. Choose Secrets Manager as a service, expand the Read access level and check GetSecretValue. Expand Resources and select Add ARN. Paste the ARN of the secret created in the previous step. Click on Review, give the policy a name und click on Create policy.

Configure the access policy for the Secrets Manager
Configure the access policy for the Secrets Manager

Step 2: Configure your EMR cluster#

Add the Hopsworks Feature Store configuration to your EMR cluster#

In order for EMR to be able to talk to the Feature Store, you need to update the Hadoop and Spark configurations. Copy the configuration below and replace ip-XXX-XX-XX-XXX.XX-XXXX-X.compute.internal with the private DNS name of your Hopsworks master node.

[
  {
    "Classification": "hadoop-env",
    "Properties": {

    },
    "Configurations": [
      {
        "Classification": "export",
        "Properties": {
          "HADOOP_CLASSPATH": "$HADOOP_CLASSPATH:/usr/lib/hopsworks/client/*"
        },
        "Configurations": [

        ]
      }
    ]
  },
  {
    "Classification": "spark-defaults",
    "Properties": {
      "spark.hadoop.hops.ipc.server.ssl.enabled": true,
      "spark.hadoop.fs.hopsfs.impl": "io.hops.hopsfs.client.HopsFileSystem",
      "spark.hadoop.client.rpc.ssl.enabled.protocol": "TLSv1.2",
      "spark.hadoop.hops.ssl.hostname.verifier": "ALLOW_ALL",
      "spark.hadoop.hops.rpc.socket.factory.class.default": "io.hops.hadoop.shaded.org.apache.hadoop.net.HopsSSLSocketFactory",
      "spark.hadoop.hops.ssl.keystores.passwd.name": "/usr/lib/hopsworks/material_passwd",
      "spark.hadoop.hops.ssl.keystore.name": "/usr/lib/hopsworks/keyStore.jks",
      "spark.hadoop.hops.ssl.trustore.name": "/usr/lib/hopsworks/trustStore.jks",
      "spark.serializer": "org.apache.spark.serializer.KryoSerializer",
      "spark.executor.extraClassPath": "/usr/lib/hopsworks/client/*",
      "spark.driver.extraClassPath": "/usr/lib/hopsworks/client/*",
      "spark.sql.hive.metastore.jars": "/usr/lib/hopsworks/apache-hive-bin/lib/*",
      "spark.hadoop.hive.metastore.uris": "thrift://ip-XXX-XX-XX-XXX.XX-XXXX-X.compute.internal:9083"
    }
  },
]

When you create your EMR cluster, add the configuration:

Note

Don't forget to replace ip-XXX-XX-XX-XXX.XX-XXXX-X.compute.internal with the private DNS name of your Hopsworks master node.

Configure EMR to access the Feature Store
Configure EMR to access the Feature Store

Add the Bootstrap Action to your EMR cluster#

EMR requires Hopsworks connectors to be able to communicate with the Hopsworks Feature Store. These connectors can be installed with the bootstrap action shown below. Copy the content into a file and name the file hopsworks.sh. Copy that file into any S3 bucket that is readable by your EMR clusters and take note of the S3 URI of that file e.g., s3://my-emr-init/hopsworks.sh.

#!/bin/bash
set -e

if [ "$#" -ne 3 ]; then
    echo "Usage hopsworks.sh HOPSWORKS_API_KEY_SECRET, HOPSWORKS_HOST, PROJECT_NAME"
    exit 1
fi

SECRET_NAME=$1
HOST=$2
PROJECT=$3

API_KEY=$(aws secretsmanager get-secret-value --secret-id $SECRET_NAME | jq -r .SecretString | jq -r '.["api-key"]')

PROJECT_ID=$(curl -H "Authorization: ApiKey ${API_KEY}" https://$HOST/hopsworks-api/api/project/getProjectInfo/$PROJECT | jq -r .projectId)

sudo yum -y install python3-devel.x86_64 || true

sudo mkdir /usr/lib/hopsworks
sudo chown hadoop:hadoop /usr/lib/hopsworks
cd /usr/lib/hopsworks

curl -o client.tar.gz -H "Authorization: ApiKey ${API_KEY}" https://$HOST/hopsworks-api/api/project/$PROJECT_ID/client

tar -xvf client.tar.gz
tar -xzf client/apache-hive-*-bin.tar.gz || true
mv apache-hive-*-bin apache-hive-bin
rm client.tar.gz
rm client/apache-hive-*-bin.tar.gz

curl -H "Authorization: ApiKey ${API_KEY}" https://$HOST/hopsworks-api/api/project/$PROJECT_ID/credentials | jq -r .kStore | base64 -d > keyStore.jks

curl -H "Authorization: ApiKey ${API_KEY}" https://$HOST/hopsworks-api/api/project/$PROJECT_ID/credentials | jq -r .tStore | base64 -d > trustStore.jks

echo -n $(curl -H "Authorization: ApiKey ${API_KEY}" https://$HOST/hopsworks-api/api/project/$PROJECT_ID/credentials | jq -r .password) > material_passwd

chmod -R o-rwx /usr/lib/hopsworks

sudo pip3 install --upgrade hsfs~=X.X.0

Note

Don't forget to replace X.X.0 with the major and minor version of your Hopsworks deployment.

HSFS version needs to match the major version of Hopsworks
To find your Hopsworks version, enter any of your projects and go to the settings tab inside your project.

Add the bootstrap actions when configuring your EMR cluster. Provide 3 arguments to the bootstrap action: The name of the API key secret e.g., hopsworks/featurestore, the public DNS name of your Hopsworks cluster, such as ad005770-33b5-11eb-b5a7-bfabd757769f.cloud.hopsworks.ai, and the name of your Hopsworks project, e.g. demo_fs_meb10179.

Set the bootstrap action for EMR
Set the bootstrap action for EMR

Your EMR cluster will now be able to access your Hopsworks Feature Store.

Next Steps#

Use the Connection API to connect to the Hopsworks Feature Store. For more information about how to use the Feature Store, see the Quickstart Guide.