Training Dataset#
TrainingDataset#
hsfs.training_dataset.TrainingDataset(
name,
version,
data_format,
featurestore_id,
location="",
event_start_time=None,
event_end_time=None,
coalesce=False,
description=None,
storage_connector=None,
splits=None,
validation_size=None,
test_size=None,
train_start=None,
train_end=None,
validation_start=None,
validation_end=None,
test_start=None,
test_end=None,
seed=None,
created=None,
creator=None,
features=None,
statistics_config=None,
featurestore_name=None,
id=None,
inode_id=None,
training_dataset_type=None,
from_query=None,
querydto=None,
label=None,
transformation_functions=None,
train_split=None,
time_split_size=None,
extra_filter=None,
**kwargs
)
Creation#
create_training_dataset#
FeatureStore.create_training_dataset(
name,
version=None,
description="",
data_format="tfrecords",
coalesce=False,
storage_connector=None,
splits=None,
location="",
seed=None,
statistics_config=None,
label=None,
transformation_functions=None,
train_split=None,
)
Create a training dataset metadata object.
Deprecated
TrainingDataset
is deprecated, use FeatureView
instead. From version 3.0 training datasets created with this API are not visibile in the API anymore.
Lazy
This method is lazy and does not persist any metadata or feature data in the feature store on its own. To materialize the training dataset and save feature data along the metadata in the feature store, call the save()
method with a DataFrame
or Query
.
Data Formats
The feature store currently supports the following data formats for training datasets:
- tfrecord
- csv
- tsv
- parquet
- avro
- orc
Currently not supported petastorm, hdf5 and npy file formats.
Arguments
- name
str
: Name of the training dataset to create. - version
int | None
: Version of the training dataset to retrieve, defaults toNone
and will create the training dataset with incremented version from the last version in the feature store. - description
str | None
: A string describing the contents of the training dataset to improve discoverability for Data Scientists, defaults to empty string""
. - data_format
str | None
: The data format used to save the training dataset, defaults to"tfrecords"
-format. - coalesce
bool | None
: If true the training dataset data will be coalesced into a single partition before writing. The resulting training dataset will be a single file per split. Default False. - storage_connector
hsfs.StorageConnector | None
: Storage connector defining the sink location for the training dataset, defaults toNone
, and materializes training dataset on HopsFS. - splits
Dict[str, float] | None
: A dictionary defining training dataset splits to be created. Keys in the dictionary define the name of the split asstr
, values represent percentage of samples in the split asfloat
. Currently, only random splits are supported. Defaults to empty dict{}
, creating only a single training dataset without splits. - location
str | None
: Path to complement the sink storage connector with, e.g if the storage connector points to an S3 bucket, this path can be used to define a sub-directory inside the bucket to place the training dataset. Defaults to""
, saving the training dataset at the root defined by the storage connector. - seed
int | None
: Optionally, define a seed to create the random splits with, in order to guarantee reproducability, defaults toNone
. - statistics_config
hsfs.StatisticsConfig | bool | dict | None
: A configuration object, or a dictionary with keys "enabled
" to generally enable descriptive statistics computation for this feature group,"correlations
" to turn on feature correlation computation and"histograms"
to compute feature value frequencies. The values should be booleans indicating the setting. To fully turn off statistics computation passstatistics_config=False
. Defaults toNone
and will compute only descriptive statistics. - label
List[str] | None
: A list of feature names constituting the prediction label/feature of the training dataset. When replaying aQuery
during model inference, the label features can be omitted from the feature vector retrieval. Defaults to[]
, no label. - transformation_functions
Dict[str, hsfs.transformation_function.TransformationFunction] | None
: A dictionary mapping tansformation functions to to the features they should be applied to before writing out the training data and at inference time. Defaults to{}
, no transformations. - train_split
str | None
: Ifsplits
is set, provide the name of the split that is going to be used for training. The statistics of this split will be used for transformation functions if necessary. Defaults toNone
.
Returns:
TrainingDataset
: The training dataset metadata object.
Retrieval#
get_training_dataset#
FeatureStore.get_training_dataset(name, version=None)
Get a training dataset entity from the feature store.
Deprecated
TrainingDataset
is deprecated, use FeatureView
instead. You can still retrieve old training datasets using this method, but after upgrading the old training datasets will also be available under a Feature View with the same name and version.
It is recommended to use this method only for old training datasets that have been created directly from Dataframes and not with Query objects.
Getting a training dataset from the Feature Store means getting its metadata handle so you can subsequently read the data into a Spark or Pandas DataFrame.
Arguments
- name
str
: Name of the training dataset to get. - version
int | None
: Version of the training dataset to retrieve, defaults toNone
and will return theversion=1
.
Returns
TrainingDataset
: The training dataset metadata object.
Raises
hsfs.client.exceptions.RestAPIError
: If unable to retrieve training dataset from the feature store.
Properties#
coalesce#
If true the training dataset data will be coalesced into a single partition before writing. The resulting training dataset will be a single file per split
data_format#
File format of the training dataset.
description#
event_end_time#
event_start_time#
extra_filter#
feature_store_id#
feature_store_name#
Name of the feature store in which the feature group is located.
id#
Training dataset id.
label#
The label/prediction feature of the training dataset.
Can be a composite of multiple features.
location#
Path to the training dataset location. Can be an empty string if e.g. the training dataset is in-memory.
name#
Name of the training dataset.
query#
Query to generate this training dataset from online feature store.
schema#
Training dataset schema.
seed#
Seed used to perform random split, ensure reproducibility of the random split at a later date.
serving_keys#
Set of primary key names that is used as keys in input dict object for get_serving_vector
method.
splits#
Training dataset splits. train
, test
or eval
and corresponding percentages.
statistics#
Get computed statistics for the training dataset.
Returns
Statistics
. Object with statistics information.
statistics_config#
Statistics configuration object defining the settings for statistics computation of the training dataset.
storage_connector#
Storage connector.
test_end#
test_size#
test_start#
train_end#
train_split#
Set name of training dataset split that is used for training.
train_start#
training_dataset_type#
transformation_functions#
Set transformation functions.
validation_end#
validation_size#
validation_start#
version#
Version number of the training dataset.
write_options#
User provided options to write training dataset.
Methods#
add_tag#
TrainingDataset.add_tag(name, value)
Attach a tag to a training dataset.
A tag consists of a
Arguments
- name
str
: Name of the tag to be added. - value: Value of the tag to be added.
Raises
hsfs.client.exceptions.RestAPIError
in case the backend fails to add the tag.
compute_statistics#
TrainingDataset.compute_statistics()
Compute the statistics for the training dataset and save them to the feature store.
delete#
TrainingDataset.delete()
Delete training dataset and all associated metadata.
Drops only HopsFS data
Note that this operation drops only files which were materialized in HopsFS. If you used a Storage Connector for a cloud storage such as S3, the data will not be deleted, but you will not be able to track it anymore from the Feature Store.
Potentially dangerous operation
This operation drops all metadata associated with this version of the training dataset and and the materialized data in HopsFS.
Raises
hsfs.client.exceptions.RestAPIError
.
delete_tag#
TrainingDataset.delete_tag(name)
Delete a tag attached to a training dataset.
Arguments
- name
str
: Name of the tag to be removed.
Raises
hsfs.client.exceptions.RestAPIError
in case the backend fails to delete the tag.
from_response_json#
TrainingDataset.from_response_json(json_dict)
from_response_json_single#
TrainingDataset.from_response_json_single(json_dict)
get_query#
TrainingDataset.get_query(online=True, with_label=False)
Returns the query used to generate this training dataset
Arguments
- online
bool
: boolean, optional. Return the query for the online storage, else for offline storage, defaults toTrue
- for online storage. - with_label
bool
: Indicator whether the query should contain features which were marked as prediction label/feature when the training dataset was created, defaults toFalse
.
Returns
str
. Query string for the chosen storage used to generate this training dataset.
get_serving_vector#
TrainingDataset.get_serving_vector(entry, external=None)
Returns assembled serving vector from online feature store.
Arguments
- entry
Dict[str, Any]
: dictionary of training dataset feature group primary key names as keys and values provided by serving application. - external
bool | None
: boolean, optional. If set to True, the connection to the online feature store is established using the same host as for thehost
parameter in thehsfs.connection()
method. If set to False, the online feature store storage connector is used which relies on the private IP. Defaults to True if connection to Hopsworks is established from external environment (e.g AWS Sagemaker or Google Colab), otherwise to False.
Returns
list
List of feature values related to provided primary keys, ordered according to positions of this features in training dataset query.
get_serving_vectors#
TrainingDataset.get_serving_vectors(entry, external=None)
Returns assembled serving vectors in batches from online feature store.
Arguments
- entry
Dict[str, List[Any]]
: dict of feature group primary key names as keys and value as list of primary keys provided by serving application. - external
bool | None
: boolean, optional. If set to True, the connection to the online feature store is established using the same host as for thehost
parameter in thehsfs.connection()
method. If set to False, the online feature store storage connector is used which relies on the private IP. Defaults to True if connection to Hopsworks is established from external environment (e.g AWS Sagemaker or Google Colab), otherwise to False.
Returns
List[list]
List of lists of feature values related to provided primary keys, ordered according to positions of this features in training dataset query.
get_tag#
TrainingDataset.get_tag(name)
Get the tags of a training dataset.
Arguments
- name: Name of the tag to get.
Returns
tag value
Raises
hsfs.client.exceptions.RestAPIError
in case the backend fails to retrieve the tag.
get_tags#
TrainingDataset.get_tags()
Returns all tags attached to a training dataset.
Returns
Dict[str, obj]
of tags.
Raises
hsfs.client.exceptions.RestAPIError
in case the backend fails to retrieve the tags.
init_prepared_statement#
TrainingDataset.init_prepared_statement(batch=None, external=None)
Initialise and cache parametrized prepared statement to retrieve feature vector from online feature store.
Arguments
- batch
bool | None
: boolean, optional. If set to True, prepared statements will be initialised for retrieving serving vectors as a batch. - external
bool | None
: boolean, optional. If set to True, the connection to the online feature store is established using the same host as for thehost
parameter in thehsfs.connection()
method. If set to False, the online feature store storage connector is used which relies on the private IP. Defaults to True if connection to Hopsworks is established from external environment (e.g AWS Sagemaker or Google Colab), otherwise to False.
insert#
TrainingDataset.insert(features, overwrite, write_options=None)
Insert additional feature data into the training dataset.
Deprecated
insert
method is deprecated.
This method appends data to the training dataset either from a Feature Store Query
, a Spark or Pandas DataFrame
, a Spark RDD, two-dimensional Python lists or Numpy ndarrays. The schemas must match for this operation.
This can also be used to overwrite all data in an existing training dataset.
Arguments
- features
hsfs.constructor.query.Query | pandas.DataFrame | hsfs.training_dataset.pyspark.sql.DataFrame | hsfs.training_dataset.pyspark.RDD | numpy.ndarray | List[list]
: Feature data to be materialized. - overwrite
bool
: Whether to overwrite the entire data in the training dataset. - write_options
Dict[Any, Any] | None
: Additional write options as key-value pairs, defaults to{}
. When using thepython
engine, write_options can contain the following entries:- key
spark
and value an object of type hsfs.core.job_configuration.JobConfiguration to configure the Hopsworks Job used to compute the training dataset. - key
wait_for_job
and valueTrue
orFalse
to configure whether or not to the insert call should return only after the Hopsworks Job has finished. By default it waits.
- key
Returns
Job
: When using the python
engine, it returns the Hopsworks Job that was launched to create the training dataset.
Raises
hsfs.client.exceptions.RestAPIError
: Unable to create training dataset metadata.
json#
TrainingDataset.json()
read#
TrainingDataset.read(split=None, read_options=None)
Read the training dataset into a dataframe.
It is also possible to read only a specific split.
Arguments
- split: Name of the split to read, defaults to
None
, reading the entire training dataset. If the training dataset has split, thesplit
parameter is mandatory. - read_options: Additional read options as key/value pairs, defaults to
{}
.
Returns
DataFrame
: The spark dataframe containing the feature data of the training dataset.
save#
TrainingDataset.save(features, write_options=None)
Materialize the training dataset to storage.
This method materializes the training dataset either from a Feature Store Query
, a Spark or Pandas DataFrame
, a Spark RDD, two-dimensional Python lists or Numpy ndarrays. From v2.5 onward, filters are saved along with the Query
.
Engine Support
Creating Training Datasets from Dataframes is only supported using Spark as Engine.
Arguments
- features
hsfs.constructor.query.Query | pandas.DataFrame | hsfs.training_dataset.pyspark.sql.DataFrame | hsfs.training_dataset.pyspark.RDD | numpy.ndarray | List[list]
: Feature data to be materialized. - write_options
Dict[Any, Any] | None
: Additional write options as key-value pairs, defaults to{}
. When using thepython
engine, write_options can contain the following entries:- key
spark
and value an object of type hsfs.core.job_configuration.JobConfiguration to configure the Hopsworks Job used to compute the training dataset. - key
wait_for_job
and valueTrue
orFalse
to configure whether or not to the save call should return only after the Hopsworks Job has finished. By default it waits.
- key
Returns
Job
: When using the python
engine, it returns the Hopsworks Job that was launched to create the training dataset.
Raises
hsfs.client.exceptions.RestAPIError
: Unable to create training dataset metadata.
show#
TrainingDataset.show(n, split=None)
Show the first n
rows of the training dataset.
You can specify a split from which to retrieve the rows.
Arguments
- n
int
: Number of rows to show. - split
str | None
: Name of the split to show, defaults toNone
, showing the first rows when taking all splits together.
to_dict#
TrainingDataset.to_dict()
update_from_response_json#
TrainingDataset.update_from_response_json(json_dict)
update_statistics_config#
TrainingDataset.update_statistics_config()
Update the statistics configuration of the training dataset.
Change the statistics_config
object and persist the changes by calling this method.
Returns
TrainingDataset
. The updated metadata object of the training dataset.
Raises
hsfs.client.exceptions.RestAPIError
.