Compute Engines
Compute Engines#
In order to execute a feature pipeline to write to the Feature Store, as well as to retrieve data from the Feature Store, you need a compute engine. Hopsworks Feature Store APIs are built around dataframes, that means feature data is inserted into the Feature Store from a Dataframe and likewise when reading data from the Feature Store, it is returned as a Dataframe.
As such, Hopsworks supports three computational engines:
- Apache Spark: Spark Dataframes and Spark Structured Streaming Dataframes are supported, both from Python environments (PySpark) and from Scala environments.
- Pandas: For pure Python environments without dependencies on Spark, Hopsworks supports Pandas Dataframes.
- Apache Flink experimental: Flink Datastreams are currently supported as an experimental feature, you can reach out to Hopsworks for guidance.
Hopsworks supports running compute on the platform itself in the form of Jobs or in Jupyter Notebooks. Alternatlively, you can also connect to Hopsworks using Python or Spark from external environments, given that there is network connectivity.
Functionality Support#
Hopsworks is aiming to provide funtional parity between the computational engines, however, there are certain Hopsworks functionalities which are exclusive to the engines.
Functionality | Method | Spark | Python | Comment |
---|---|---|---|---|
Training Dataset Creation from dataframes | TrainingDataset.save() |
- | Functionality was deprecated in version 3.0 | |
Data validation using Great Expectations for streaming dataframes | FeatureGroup.validate() FeatureGroup.insert_stream() |
- | - | insert_stream does not perform any data validation even when a expectation suite is attached. |
Stream ingestion | FeatureGroup.insert_stream() |
- | Python/Pandas has currently no notion of streaming. | |
Reading from Streaming Storage Connectors | KafkaConnector.read_stream() |
- | Python/Pandas has currenlty no notion of streaming. | |
Reading training data from external storage other than S3 | FeatureView.get_training_data() |
- | Reading training data that was written to external storage using a Storage Connector other than S3 can currently not be read using HSFS APIs, instead you will have to use the storage's native client. | |
Reading External Feature Groups into Dataframe | ExternalFeatureGroup.read() |
- | Reading an External Feature Group directly into a Pandas Dataframe is not supported, however, you can use the Query API to create Feature Views/Training Data containing External Feature Groups. | |
Read Querys containing External Feature Groups into Dataframe | Query.read() |
- | Reading a Query containing an External Feature Group directly into a Pandas Dataframe is not supported, however, you can use the Query to create Feature Views/Training Data and write the data to a Storage Connector, from where you can read up the data into a Pandas Dataframe. |
Python#
Inside Hopsworks#
If you are using Spark or Python within Hopsworks, there is no further configuration required. Head over to the Getting Started Guide.
Outside Hopsworks#
Connecting to the Feature Store from any Python environment, such as your local environment or Google Colab, requires setting up an API Key and installing the HSFS Python client library. The Python integration guide explains step by step how to connect to the Feature Store from any Python environment.
Spark#
Inside Hopsworks#
If you are using Spark or Python within Hopsworks, there is no further configuration required. Head over to the Getting Started Guide.
Outside Hopsworks#
Connecting to the Feature Store from an external Spark cluster, such as Cloudera or Databricks, requires configuring it with the Hopsworks client jars, configuration and certificates. The Spark integration guide explains step by step how to connect to the Feature Store from an external Spark cluster.