Skip to content

Training Dataset#

[source]

TrainingDataset#

hsfs.training_dataset.TrainingDataset(
    name,
    version,
    data_format,
    featurestore_id,
    location="",
    event_start_time=None,
    event_end_time=None,
    coalesce=False,
    description=None,
    storage_connector=None,
    splits=None,
    validation_size=None,
    test_size=None,
    train_start=None,
    train_end=None,
    validation_start=None,
    validation_end=None,
    test_start=None,
    test_end=None,
    seed=None,
    created=None,
    creator=None,
    features=None,
    statistics_config=None,
    featurestore_name=None,
    id=None,
    inode_id=None,
    training_dataset_type=None,
    from_query=None,
    querydto=None,
    label=None,
    transformation_functions=None,
    train_split=None,
    time_split_size=None,
    extra_filter=None,
    **kwargs
)

Creation#

[source]

create_training_dataset#

FeatureStore.create_training_dataset(
    name,
    version=None,
    description="",
    data_format="tfrecords",
    coalesce=False,
    storage_connector=None,
    splits=None,
    location="",
    seed=None,
    statistics_config=None,
    label=None,
    transformation_functions=None,
    train_split=None,
)

Create a training dataset metadata object.

Deprecated

TrainingDataset is deprecated, use FeatureView instead. From version 3.0 training datasets created with this API are not visibile in the API anymore.

Lazy

This method is lazy and does not persist any metadata or feature data in the feature store on its own. To materialize the training dataset and save feature data along the metadata in the feature store, call the save() method with a DataFrame or Query.

Data Formats

The feature store currently supports the following data formats for training datasets:

  1. tfrecord
  2. csv
  3. tsv
  4. parquet
  5. avro
  6. orc

Currently not supported petastorm, hdf5 and npy file formats.

Arguments

  • name str: Name of the training dataset to create.
  • version int | None: Version of the training dataset to retrieve, defaults to None and will create the training dataset with incremented version from the last version in the feature store.
  • description str | None: A string describing the contents of the training dataset to improve discoverability for Data Scientists, defaults to empty string "".
  • data_format str | None: The data format used to save the training dataset, defaults to "tfrecords"-format.
  • coalesce bool | None: If true the training dataset data will be coalesced into a single partition before writing. The resulting training dataset will be a single file per split. Default False.
  • storage_connector hsfs.StorageConnector | None: Storage connector defining the sink location for the training dataset, defaults to None, and materializes training dataset on HopsFS.
  • splits Dict[str, float] | None: A dictionary defining training dataset splits to be created. Keys in the dictionary define the name of the split as str, values represent percentage of samples in the split as float. Currently, only random splits are supported. Defaults to empty dict{}, creating only a single training dataset without splits.
  • location str | None: Path to complement the sink storage connector with, e.g if the storage connector points to an S3 bucket, this path can be used to define a sub-directory inside the bucket to place the training dataset. Defaults to "", saving the training dataset at the root defined by the storage connector.
  • seed int | None: Optionally, define a seed to create the random splits with, in order to guarantee reproducability, defaults to None.
  • statistics_config hsfs.StatisticsConfig | bool | dict | None: A configuration object, or a dictionary with keys "enabled" to generally enable descriptive statistics computation for this feature group, "correlations" to turn on feature correlation computation and "histograms" to compute feature value frequencies. The values should be booleans indicating the setting. To fully turn off statistics computation pass statistics_config=False. Defaults to None and will compute only descriptive statistics.
  • label List[str] | None: A list of feature names constituting the prediction label/feature of the training dataset. When replaying a Query during model inference, the label features can be omitted from the feature vector retrieval. Defaults to [], no label.
  • transformation_functions Dict[str, hsfs.transformation_function.TransformationFunction] | None: A dictionary mapping tansformation functions to to the features they should be applied to before writing out the training data and at inference time. Defaults to {}, no transformations.
  • train_split str | None: If splits is set, provide the name of the split that is going to be used for training. The statistics of this split will be used for transformation functions if necessary. Defaults to None.

Returns:

TrainingDataset: The training dataset metadata object.


Retrieval#

[source]

get_training_dataset#

FeatureStore.get_training_dataset(name, version=None)

Get a training dataset entity from the feature store.

Deprecated

TrainingDataset is deprecated, use FeatureView instead. You can still retrieve old training datasets using this method, but after upgrading the old training datasets will also be available under a Feature View with the same name and version.

It is recommended to use this method only for old training datasets that have been created directly from Dataframes and not with Query objects.

Getting a training dataset from the Feature Store means getting its metadata handle so you can subsequently read the data into a Spark or Pandas DataFrame.

Arguments

  • name str: Name of the training dataset to get.
  • version int | None: Version of the training dataset to retrieve, defaults to None and will return the version=1.

Returns

TrainingDataset: The training dataset metadata object.

Raises

  • hsfs.client.exceptions.RestAPIError: If unable to retrieve training dataset from the feature store.

Properties#

[source]

coalesce#

If true the training dataset data will be coalesced into a single partition before writing. The resulting training dataset will be a single file per split


[source]

data_format#

File format of the training dataset.


[source]

description#


[source]

event_end_time#


[source]

event_start_time#


[source]

extra_filter#


[source]

feature_store_id#


[source]

feature_store_name#

Name of the feature store in which the feature group is located.


[source]

id#

Training dataset id.


[source]

label#

The label/prediction feature of the training dataset.

Can be a composite of multiple features.


[source]

location#

Path to the training dataset location. Can be an empty string if e.g. the training dataset is in-memory.


[source]

name#

Name of the training dataset.


[source]

query#

Query to generate this training dataset from online feature store.


[source]

schema#

Training dataset schema.


[source]

seed#

Seed used to perform random split, ensure reproducibility of the random split at a later date.


[source]

serving_keys#

Set of primary key names that is used as keys in input dict object for get_serving_vector method.


[source]

splits#

Training dataset splits. train, test or eval and corresponding percentages.


[source]

statistics#

Get computed statistics for the training dataset.

Returns

Statistics. Object with statistics information.


[source]

statistics_config#

Statistics configuration object defining the settings for statistics computation of the training dataset.


[source]

storage_connector#

Storage connector.


[source]

test_end#


[source]

test_size#


[source]

test_start#


[source]

train_end#


[source]

train_split#

Set name of training dataset split that is used for training.


[source]

train_start#


[source]

training_dataset_type#


[source]

transformation_functions#

Set transformation functions.


[source]

validation_end#


[source]

validation_size#


[source]

validation_start#


[source]

version#

Version number of the training dataset.


[source]

write_options#

User provided options to write training dataset.


Methods#

[source]

add_tag#

TrainingDataset.add_tag(name, value)

Attach a tag to a training dataset.

A tag consists of a pair. Tag names are unique identifiers across the whole cluster. The value of a tag can be any valid json - primitives, arrays or json objects.

Arguments

  • name str: Name of the tag to be added.
  • value: Value of the tag to be added.

Raises

hsfs.client.exceptions.RestAPIError in case the backend fails to add the tag.


[source]

compute_statistics#

TrainingDataset.compute_statistics()

Compute the statistics for the training dataset and save them to the feature store.


[source]

delete#

TrainingDataset.delete()

Delete training dataset and all associated metadata.

Drops only HopsFS data

Note that this operation drops only files which were materialized in HopsFS. If you used a Storage Connector for a cloud storage such as S3, the data will not be deleted, but you will not be able to track it anymore from the Feature Store.

Potentially dangerous operation

This operation drops all metadata associated with this version of the training dataset and and the materialized data in HopsFS.

Raises

hsfs.client.exceptions.RestAPIError.


[source]

delete_tag#

TrainingDataset.delete_tag(name)

Delete a tag attached to a training dataset.

Arguments

  • name str: Name of the tag to be removed.

Raises

hsfs.client.exceptions.RestAPIError in case the backend fails to delete the tag.


[source]

from_response_json#

TrainingDataset.from_response_json(json_dict)

[source]

from_response_json_single#

TrainingDataset.from_response_json_single(json_dict)

[source]

get_query#

TrainingDataset.get_query(online=True, with_label=False)

Returns the query used to generate this training dataset

Arguments

  • online bool: boolean, optional. Return the query for the online storage, else for offline storage, defaults to True - for online storage.
  • with_label bool: Indicator whether the query should contain features which were marked as prediction label/feature when the training dataset was created, defaults to False.

Returns

str. Query string for the chosen storage used to generate this training dataset.


[source]

get_serving_vector#

TrainingDataset.get_serving_vector(entry, external=None)

Returns assembled serving vector from online feature store.

Arguments

  • entry Dict[str, Any]: dictionary of training dataset feature group primary key names as keys and values provided by serving application.
  • external bool | None: boolean, optional. If set to True, the connection to the online feature store is established using the same host as for the host parameter in the hsfs.connection() method. If set to False, the online feature store storage connector is used which relies on the private IP. Defaults to True if connection to Hopsworks is established from external environment (e.g AWS Sagemaker or Google Colab), otherwise to False.

Returns

list List of feature values related to provided primary keys, ordered according to positions of this features in training dataset query.


[source]

get_serving_vectors#

TrainingDataset.get_serving_vectors(entry, external=None)

Returns assembled serving vectors in batches from online feature store.

Arguments

  • entry Dict[str, List[Any]]: dict of feature group primary key names as keys and value as list of primary keys provided by serving application.
  • external bool | None: boolean, optional. If set to True, the connection to the online feature store is established using the same host as for the host parameter in the hsfs.connection() method. If set to False, the online feature store storage connector is used which relies on the private IP. Defaults to True if connection to Hopsworks is established from external environment (e.g AWS Sagemaker or Google Colab), otherwise to False.

Returns

List[list] List of lists of feature values related to provided primary keys, ordered according to positions of this features in training dataset query.


[source]

get_tag#

TrainingDataset.get_tag(name)

Get the tags of a training dataset.

Arguments

  • name: Name of the tag to get.

Returns

tag value

Raises

hsfs.client.exceptions.RestAPIError in case the backend fails to retrieve the tag.


[source]

get_tags#

TrainingDataset.get_tags()

Returns all tags attached to a training dataset.

Returns

Dict[str, obj] of tags.

Raises

hsfs.client.exceptions.RestAPIError in case the backend fails to retrieve the tags.


[source]

init_prepared_statement#

TrainingDataset.init_prepared_statement(batch=None, external=None)

Initialise and cache parametrized prepared statement to retrieve feature vector from online feature store.

Arguments

  • batch bool | None: boolean, optional. If set to True, prepared statements will be initialised for retrieving serving vectors as a batch.
  • external bool | None: boolean, optional. If set to True, the connection to the online feature store is established using the same host as for the host parameter in the hsfs.connection() method. If set to False, the online feature store storage connector is used which relies on the private IP. Defaults to True if connection to Hopsworks is established from external environment (e.g AWS Sagemaker or Google Colab), otherwise to False.

[source]

insert#

TrainingDataset.insert(features, overwrite, write_options=None)

Insert additional feature data into the training dataset.

Deprecated

insert method is deprecated.

This method appends data to the training dataset either from a Feature Store Query, a Spark or Pandas DataFrame, a Spark RDD, two-dimensional Python lists or Numpy ndarrays. The schemas must match for this operation.

This can also be used to overwrite all data in an existing training dataset.

Arguments

  • features hsfs.constructor.query.Query | pandas.DataFrame | hsfs.training_dataset.pyspark.sql.DataFrame | hsfs.training_dataset.pyspark.RDD | numpy.ndarray | List[list]: Feature data to be materialized.
  • overwrite bool: Whether to overwrite the entire data in the training dataset.
  • write_options Dict[Any, Any] | None: Additional write options as key-value pairs, defaults to {}. When using the python engine, write_options can contain the following entries:
    • key spark and value an object of type hsfs.core.job_configuration.JobConfiguration to configure the Hopsworks Job used to compute the training dataset.
    • key wait_for_job and value True or False to configure whether or not to the insert call should return only after the Hopsworks Job has finished. By default it waits.

Returns

Job: When using the python engine, it returns the Hopsworks Job that was launched to create the training dataset.

Raises

  • hsfs.client.exceptions.RestAPIError: Unable to create training dataset metadata.

[source]

json#

TrainingDataset.json()

[source]

read#

TrainingDataset.read(split=None, read_options=None)

Read the training dataset into a dataframe.

It is also possible to read only a specific split.

Arguments

  • split: Name of the split to read, defaults to None, reading the entire training dataset. If the training dataset has split, the split parameter is mandatory.
  • read_options: Additional read options as key/value pairs, defaults to {}.

Returns

DataFrame: The spark dataframe containing the feature data of the training dataset.


[source]

save#

TrainingDataset.save(features, write_options=None)

Materialize the training dataset to storage.

This method materializes the training dataset either from a Feature Store Query, a Spark or Pandas DataFrame, a Spark RDD, two-dimensional Python lists or Numpy ndarrays. From v2.5 onward, filters are saved along with the Query.

Engine Support

Creating Training Datasets from Dataframes is only supported using Spark as Engine.

Arguments

  • features hsfs.constructor.query.Query | pandas.DataFrame | hsfs.training_dataset.pyspark.sql.DataFrame | hsfs.training_dataset.pyspark.RDD | numpy.ndarray | List[list]: Feature data to be materialized.
  • write_options Dict[Any, Any] | None: Additional write options as key-value pairs, defaults to {}. When using the python engine, write_options can contain the following entries:
    • key spark and value an object of type hsfs.core.job_configuration.JobConfiguration to configure the Hopsworks Job used to compute the training dataset.
    • key wait_for_job and value True or False to configure whether or not to the save call should return only after the Hopsworks Job has finished. By default it waits.

Returns

Job: When using the python engine, it returns the Hopsworks Job that was launched to create the training dataset.

Raises

  • hsfs.client.exceptions.RestAPIError: Unable to create training dataset metadata.

[source]

show#

TrainingDataset.show(n, split=None)

Show the first n rows of the training dataset.

You can specify a split from which to retrieve the rows.

Arguments

  • n int: Number of rows to show.
  • split str | None: Name of the split to show, defaults to None, showing the first rows when taking all splits together.

[source]

to_dict#

TrainingDataset.to_dict()

[source]

update_from_response_json#

TrainingDataset.update_from_response_json(json_dict)

[source]

update_statistics_config#

TrainingDataset.update_statistics_config()

Update the statistics configuration of the training dataset.

Change the statistics_config object and persist the changes by calling this method.

Returns

TrainingDataset. The updated metadata object of the training dataset.

Raises

hsfs.client.exceptions.RestAPIError.