Feature Store#
Retrieval#
get_feature_store#
Connection.get_feature_store(name=None)
Get a reference to a feature store to perform operations on.
Defaulting to the project's default feature store. Shared feature stores can be
retrieved by passing the name
argument.
Arguments
- name
str
: The name of the feature store, defaults toNone
.
Returns
FeatureStore
. A feature store handle object to perform operations on.
Properties#
description#
Description of the feature store.
hive_endpoint#
Hive endpoint for the offline feature store.
id#
Id of the feature store.
mysql_server_endpoint#
MySQL server endpoint for the online feature store.
name#
Name of the feature store.
offline_featurestore_name#
Name of the offline feature store database.
online_enabled#
Indicator whether online feature store is enabled.
online_featurestore_name#
Name of the online feature store database.
project_id#
Id of the project in which the feature store is located.
project_name#
Name of the project in which the feature store is located.
Methods#
create_feature_group#
FeatureStore.create_feature_group(
name,
version=None,
description="",
online_enabled=False,
time_travel_format="HUDI",
partition_key=[],
primary_key=[],
hudi_precombine_key=None,
features=[],
statistics_config=None,
)
Create a feature group metadata object.
Lazy
This method is lazy and does not persist any metadata or feature data in the
feature store on its own. To persist the feature group and save feature data
along the metadata in the feature store, call the save()
method with a
DataFrame.
Arguments
- name
str
: Name of the feature group to create. - version
Optional[int]
: Version of the feature group to retrieve, defaults toNone
and will create the feature group with incremented version from the last version in the feature store. - description
Optional[str]
: A string describing the contents of the feature group to improve discoverability for Data Scientists, defaults to empty string""
. - online_enabled
Optional[bool]
: Define whether the feature group should be made available also in the online feature store for low latency access, defaults toFalse
. - time_travel_format
Optional[str]
: Format used for time travel, defaults to"HUDI"
. - partition_key
Optional[List[str]]
: A list of feature names to be used as partition key when writing the feature data to the offline storage, defaults to empty list[]
. - primary_key
Optional[List[str]]
: A list of feature names to be used as primary key for the feature group. This primary key can be a composite key of multiple features and will be used as joining key, if not specified otherwise. Defaults to empty list[]
, and the first column of the DataFrame will be used as primary key. - hudi_precombine_key
Optional[str]
: A feature name to be used as a precombine key for the"HUDI"
feature group. Defaults toNone
. If feature group has time travel format"HUDI"
and hudi precombine key was not specified then the first primary key of the feature group will be used as hudi precombine key. - features
Optional[List[hsfs.feature.Feature]]
: Optionally, define the schema of the feature group manually as a list ofFeature
objects. Defaults to empty list[]
and will use the schema information of the DataFrame provided in thesave
method. - statistics_config
Optional[Union[hsfs.StatisticsConfig, bool, dict]]
: A configuration object, or a dictionary with keys "enabled
" to generally enable descriptive statistics computation for this feature group,"correlations
" to turn on feature correlation computation and"histograms"
to compute feature value frequencies. The values should be booleans indicating the setting. To fully turn off statistics computation passstatistics_config=False
. Defaults toNone
and will compute only descriptive statistics.
Returns
FeatureGroup
. The feature group metadata object.
create_on_demand_feature_group#
FeatureStore.create_on_demand_feature_group(
name,
storage_connector,
query=None,
data_format=None,
path="",
options={},
version=None,
description="",
features=[],
statistics_config=None,
)
Create a on-demand feature group metadata object.
Lazy
This method is lazy and does not persist any metadata or feature data in the
feature store on its own. To persist the feature group and save feature data
along the metadata in the feature store, call the save()
method.
Arguments
- name
str
: Name of the on-demand feature group to create. - query
Optional[str]
: A string containing a SQL query valid for the target data source. the query will be used to pull data from the data sources when the feature group is used. - data_format
Optional[str]
: If the on-demand feature groups refers to a directory with data, the data format to use when reading it - path
Optional[str]
: The location within the scope of the storage connector, from where to read the data for the on-demand feature group - storage_connector
hsfs.StorageConnector
: the storage connector to use to establish connectivity with the data source. - version
Optional[int]
: Version of the on-demand feature group to retrieve, defaults toNone
and will create the feature group with incremented version from the last version in the feature store. - description
Optional[str]
: A string describing the contents of the on-demand feature group to improve discoverability for Data Scientists, defaults to empty string""
. - features
Optional[List[hsfs.feature.Feature]]
: Optionally, define the schema of the on-demand feature group manually as a list ofFeature
objects. Defaults to empty list[]
and will use the schema information of the DataFrame resulting by executing the provided query against the data source. - statistics_config
Optional[Union[hsfs.StatisticsConfig, bool, dict]]
: A configuration object, or a dictionary with keys "enabled
" to generally enable descriptive statistics computation for this on-demand feature group,"correlations
" to turn on feature correlation computation and"histograms"
to compute feature value frequencies. The values should be booleans indicating the setting. To fully turn off statistics computation passstatistics_config=False
. Defaults toNone
and will compute only descriptive statistics.
Returns
OnDemandFeatureGroup
. The on-demand feature group metadata object.
create_training_dataset#
FeatureStore.create_training_dataset(
name,
version=None,
description="",
data_format="tfrecords",
storage_connector=None,
splits={},
location="",
seed=None,
statistics_config=None,
label=[],
)
Create a training dataset metadata object.
Lazy
This method is lazy and does not persist any metadata or feature data in the
feature store on its own. To materialize the training dataset and save
feature data along the metadata in the feature store, call the save()
method with a DataFrame
or Query
.
Data Formats
The feature store currently supports the following data formats for training datasets:
- tfrecord
- csv
- tsv
- parquet
- avro
- orc
Currently not supported petastorm, hdf5 and npy file formats.
Arguments
- name
str
: Name of the training dataset to create. - version
Optional[int]
: Version of the training dataset to retrieve, defaults toNone
and will create the training dataset with incremented version from the last version in the feature store. - description
Optional[str]
: A string describing the contents of the training dataset to improve discoverability for Data Scientists, defaults to empty string""
. - data_format
Optional[str]
: The data format used to save the training dataset, defaults to"tfrecords"
-format. - storage_connector
Optional[hsfs.StorageConnector]
: Storage connector defining the sink location for the training dataset, defaults toNone
, and materializes training dataset on HopsFS. - splits
Optional[Dict[str, float]]
: A dictionary defining training dataset splits to be created. Keys in the dictionary define the name of the split asstr
, values represent percentage of samples in the split asfloat
. Currently, only random splits are supported. Defaults to empty dict{}
, creating only a single training dataset without splits. - location
Optional[str]
: Path to complement the sink storage connector with, e.g if the storage connector points to an S3 bucket, this path can be used to define a sub-directory inside the bucket to place the training dataset. Defaults to""
, saving the training dataset at the root defined by the storage connector. - seed
Optional[int]
: Optionally, define a seed to create the random splits with, in order to guarantee reproducability, defaults toNone
. - statistics_config
Optional[Union[hsfs.StatisticsConfig, bool, dict]]
: A configuration object, or a dictionary with keys "enabled
" to generally enable descriptive statistics computation for this feature group,"correlations
" to turn on feature correlation computation and"histograms"
to compute feature value frequencies. The values should be booleans indicating the setting. To fully turn off statistics computation passstatistics_config=False
. Defaults toNone
and will compute only descriptive statistics. - label
Optional[List[str]]
: A list of feature names constituting the prediction label/feature of the training dataset. When replaying aQuery
during model inference, the label features can be omitted from the feature vector retrieval. Defaults to[]
, no label.
Returns:
TrainingDataset
: The training dataset metadata object.
get_feature_group#
FeatureStore.get_feature_group(name, version=None)
Get a feature group entity from the feature store.
Getting a feature group from the Feature Store means getting its metadata handle
so you can subsequently read the data into a Spark or Pandas DataFrame or use
the Query
-API to perform joins between feature groups.
Arguments
- name
str
: Name of the feature group to get. - version
Optional[int]
: Version of the feature group to retrieve, defaults toNone
and will return theversion=1
.
Returns
FeatureGroup
: The feature group metadata object.
Raises
RestAPIError
: If unable to retrieve feature group from the feature store.
get_on_demand_feature_group#
FeatureStore.get_on_demand_feature_group(name, version=None)
Get a on-demand feature group entity from the feature store.
Getting a on-demand feature group from the Feature Store means getting its
metadata handle so you can subsequently read the data into a Spark or
Pandas DataFrame or use the Query
-API to perform joins between feature groups.
Arguments
- name
str
: Name of the on-demand feature group to get. - version
Optional[int]
: Version of the on-demand feature group to retrieve, defaults toNone
and will return theversion=1
.
Returns
OnDemandFeatureGroup
: The on-demand feature group metadata object.
Raises
RestAPIError
: If unable to retrieve feature group from the feature store.
get_online_storage_connector#
FeatureStore.get_online_storage_connector()
Get the storage connector for the Online Feature Store of the respective project's feature store.
The returned storage connector depends on the project that you are connected to.
Returns
StorageConnector
. JDBC storage connector to the Online Feature Store.
get_storage_connector#
FeatureStore.get_storage_connector(name)
Get a previously created storage connector from the feature store.
Storage connectors encapsulate all information needed for the execution engine to read and write to specific storage. This storage can be S3, a JDBC compliant database or the distributed filesystem HOPSFS.
If you want to connect to the online feature store, see the
get_online_storage_connector
method to get the JDBC connector for the Online
Feature Store.
Getting a Storage Connector
sc = fs.get_storage_connector("demo_fs_meb10000_Training_Datasets")
td = fs.create_training_dataset(..., storage_connector=sc, ...)
Arguments
- name
str
: Name of the storage connector to retrieve.
Returns
StorageConnector
. Storage connector object.
get_training_dataset#
FeatureStore.get_training_dataset(name, version=None)
Get a training dataset entity from the feature store.
Getting a training dataset from the Feature Store means getting its metadata handle so you can subsequently read the data into a Spark or Pandas DataFrame.
Arguments
- name
str
: Name of the training dataset to get. - version
Optional[int]
: Version of the training dataset to retrieve, defaults toNone
and will return theversion=1
.
Returns
TrainingDataset
: The training dataset metadata object.
Raises
RestAPIError
: If unable to retrieve feature group from the feature store.
sql#
FeatureStore.sql(query, dataframe_type="default", online=False)