Networking#
In order for Spark to communicate with the Hopsworks Feature Store from EMR, networking needs to be set up correctly. This includes deploying the Hopsworks Feature Store to either the same VPC or enable VPC peering between the VPC of the EMR cluster and the Hopsworks Feature Store.
Step 1: Ensure network connectivity#
The DataFrame API needs to be able to connect directly to the IP on which the Feature Store is listening. This means that if you deploy the Feature Store on AWS you will either need to deploy the Feature Store in the same VPC as your EMR cluster or to set up VPC Peering between your EMR VPC and the Feature Store VPC.
Option 1: Deploy the Feature Store in the EMR VPC
When deploying the Hopsworks Feature Store, select the EMR VPC and Availability Zone as the VPC and Availability Zone of your Feature Store. Identify your EMR VPC in the Summary of your EMR cluster:
Option 2: Set up VPC peering
Follow the guide VPC Peering to set up VPC peering between the Feature Store and EMR. Get your Feature Store VPC ID and CIDR by searching for the Feature Store VPC in the AWS Management Console:
Step 2: Configure the Security Group#
The Feature Store Security Group needs to be configured to allow traffic from your EMR clusters to be able to connect to the Feature Store.
Open your feature store instance under EC2 in the AWS Management Console and ensure that ports 443, 3306, 9083, 9085, 8020 and 30010 (443,3306,8020,30010,9083,9085) are reachable from the EMR Security Group:
Connectivity from the EMR Security Group can be allowed by opening the Security Group, adding a port to the Inbound rules and setting the EMR master and core security group as source:
You can find your EMR security groups in the EMR cluster summary:
Next Steps#
Continue with the Configure EMR for the Hopsworks Feature Store, in order to be able to use the Hopsworks Feature Store.