How-To set up a ADLS Storage Connector#
Introduction#
Azure Data Lake Storage (ADLS) Gen2 is a HDFS-compatible filesystem on Azure for data analytics. The ADLS Gen2 filesystem stores its data in Azure Blob storage, ensuring low-cost storage, high availability, and disaster recovery. In Hopsworks, you can access ADLS Gen2 by defining a Storage Connector and creating and granting permissions to a service principal.
In this guide, you will configure a Storage Connector in Hopsworks to save all the authentication information needed in order to set up a connection to your Azure ADLS filesystem. When you're finished, you'll be able to read files using Spark through HSFS APIs. You can also use the connector to write out training data from the Feature Store, in order to make it accessible by third parties.
Note
Currently, it is only possible to create storage connectors in the Hopsworks UI. You cannot create a storage connector programmatically.
Prerequisites#
Before you begin this guide you'll need to retrieve the following information from your Azure ADLS account:
- Data Lake Storage Gen2 Account: Create an Azure Data Lake Storage Gen2 account and initialize a filesystem, enabling the hierarchical namespace. Note that your storage account must belong to an Azure resource group.
- Azure AD application and service principal: Create an Azure AD application and service principal that can access your ADLS storage account and its resource group.
- Service Principal Registration: Register the service principal, granting it a role assignment such as Storage Blob Data Contributor, on the Azure Data Lake Storage Gen2 account.
Info
When you specify the 'container name' in the ADLS storage connector, you need to have previously created that container - the Hopsworks Feature Store will not create that storage container for you.
Creation in the UI#
Step 1: Set up new storage connector#
Head to the Storage Connector View on Hopsworks (1) and set up a new storage connector (2).
Step 2: Enter ADLS Information#
Enter the details for your ADLS connector. Start by giving it a name and an optional description.
Step 3: Azure Create an ADLS Resource#
When programmatically signing in, you need to pass the tenant ID with your authentication request and the application ID. You also need a certificate or an authentication key (described in the following section). To get those values, use the following steps:
- Select Azure Active Directory.
- From App registrations in Azure AD, select your application.
-
Copy the Directory (tenant) ID and store it in your application code.
-
Copy the Application ID and store it in your application code.
-
Create an Application Secret and copy it into the Service Credential field.
Common Problems#
If you get a permission denied error when writing or reading to/from a ADLS container, it is often because the storage principal (app) does not have the correct permissions. Have you added the "Storage Blob Data Owner" or "Storage Blob Data Contributor" role to the resource group for your storage account (or the subscription for your storage group, if you apply roles at the subscription level)? Go to your resource group, then in "Access Control (IAM)", click the "Add" button to add a "role assignment".
If you get an error "StatusCode=404 StatusDescription=The specified filesystem does not exist.", then maybe you have not created the storage account or the storage container.
References#
Next Steps#
Move on to the usage guide for storage connectors to see how you can use your newly created ADLS connector.