Skip to content

Training Dataset#

The training dataset abstraction in Hopsworks Feature Store allows users to group a set of features (potentially from multiple different feature groups) with labels for training a model to do a particular prediction task. The training dataset is a versioned and managed dataset and is stored in HopsFS as tfrecords, parquet, csv, or tsv files.

Versioning#

Training Dataset can be versioned. Data Scientist should use the version to indicate to the model, as well as to the schema or the feature engineering logic of the features associated to this training dataset.

Creation#

To create training dataset, the user supplies a Pandas, Numpy or Spark dataframe with features and labels together with metadata. Once the training dataset has been created, the dataset is discoverable in the feature registry and users can use it to train models.

[source]

create_training_dataset#

FeatureStore.create_training_dataset(
    name,
    version=None,
    description="",
    data_format="tfrecords",
    storage_connector=None,
    splits={},
    location="",
    seed=None,
    statistics_config=None,
    label=[],
)

Create a training dataset metadata object.

Lazy

This method is lazy and does not persist any metadata or feature data in the feature store on its own. To materialize the training dataset and save feature data along the metadata in the feature store, call the save() method with a DataFrame or Query.

Data Formats

The feature store currently supports the following data formats for training datasets:

  1. tfrecord
  2. csv
  3. tsv
  4. parquet
  5. avro
  6. orc

Currently not supported petastorm, hdf5 and npy file formats.

Arguments

  • name str: Name of the training dataset to create.
  • version Optional[int]: Version of the training dataset to retrieve, defaults to None and will create the training dataset with incremented version from the last version in the feature store.
  • description Optional[str]: A string describing the contents of the training dataset to improve discoverability for Data Scientists, defaults to empty string "".
  • data_format Optional[str]: The data format used to save the training dataset, defaults to "tfrecords"-format.
  • storage_connector Optional[hsfs.StorageConnector]: Storage connector defining the sink location for the training dataset, defaults to None, and materializes training dataset on HopsFS.
  • splits Optional[Dict[str, float]]: A dictionary defining training dataset splits to be created. Keys in the dictionary define the name of the split as str, values represent percentage of samples in the split as float. Currently, only random splits are supported. Defaults to empty dict{}, creating only a single training dataset without splits.
  • location Optional[str]: Path to complement the sink storage connector with, e.g if the storage connector points to an S3 bucket, this path can be used to define a sub-directory inside the bucket to place the training dataset. Defaults to "", saving the training dataset at the root defined by the storage connector.
  • seed Optional[int]: Optionally, define a seed to create the random splits with, in order to guarantee reproducability, defaults to None.
  • statistics_config Optional[Union[hsfs.StatisticsConfig, bool, dict]]: A configuration object, or a dictionary with keys "enabled" to generally enable descriptive statistics computation for this feature group, "correlations" to turn on feature correlation computation and "histograms" to compute feature value frequencies. The values should be booleans indicating the setting. To fully turn off statistics computation pass statistics_config=False. Defaults to None and will compute only descriptive statistics.
  • label Optional[List[str]]: A list of feature names constituting the prediction label/feature of the training dataset. When replaying a Query during model inference, the label features can be omitted from the feature vector retrieval. Defaults to [], no label.

Returns:

TrainingDataset: The training dataset metadata object.


Tagging Training Datasets#

The feature store enables users to attach tags to training dataset in order to make them discoverable across feature stores. A tag is a simple {key: value} association, providing additional information about the data, such as for example geographic origin. This is useful in an organization as it makes easier to discover for data scientists, reduces duplicated work in terms of for example data preparation. The tagging feature is only available in the enterprise version.

Define tags that can be attached#

The first step is to define a set of tags that can be attached. Such as for example “Country” to tag data as being from a certain geographic location and “Sport” to further associate a type of Sport with the data.

Define tags that can be attached

Attach tags using the UI#

Tags can then be attached using the feature store UI or programmatically using the API. Attaching tags to feature group.

Attach tags using the UI

Retrieval#

[source]

get_training_dataset#

FeatureStore.get_training_dataset(name, version=None)

Get a training dataset entity from the feature store.

Getting a training dataset from the Feature Store means getting its metadata handle so you can subsequently read the data into a Spark or Pandas DataFrame.

Arguments

  • name str: Name of the training dataset to get.
  • version Optional[int]: Version of the training dataset to retrieve, defaults to None and will return the version=1.

Returns

TrainingDataset: The training dataset metadata object.

Raises

  • RestAPIError: If unable to retrieve feature group from the feature store.

Properties#

[source]

data_format#

File format of the training dataset.


[source]

description#


[source]

id#

Training dataset id.


[source]

label#

The label/prediction feature of the training dataset.

Can be a composite of multiple features.


[source]

location#

Path to the training dataset location.


[source]

name#

Name of the training dataset.


[source]

query#

Query to generate this training dataset from online feature store.


[source]

schema#

Training dataset schema.


[source]

seed#

Seed.


[source]

splits#

Training dataset splits. train, test or eval and corresponding percentages.


[source]

statistics#

Get the latest computed statistics for the training dataset.


[source]

statistics_config#

Statistics configuration object defining the settings for statistics computation of the training dataset.


[source]

storage_connector#

Storage connector.


[source]

version#

Version number of the training dataset.


[source]

write_options#

User provided options to write training dataset.


Methods#

[source]

add_tag#

TrainingDataset.add_tag(name, value=None)

Attach a name/value tag to a training dataset.

A tag can consist of a name only or a name/value pair. Tag names are unique identifiers.

Arguments

  • name str: Name of the tag to be added.
  • value Optional[str]: Value of the tag to be added, defaults to None.

[source]

compute_statistics#

TrainingDataset.compute_statistics()

Recompute the statistics for the training dataset and save them to the feature store.


[source]

delete_tag#

TrainingDataset.delete_tag(name)

Delete a tag from a training dataset.

Tag names are unique identifiers.

Arguments

  • name str: Name of the tag to be removed.

[source]

get_query#

TrainingDataset.get_query(online=True, with_label=False)

Returns the query used to generate this training dataset

Arguments

  • online bool: boolean, optional. Return the query for the online storage, else for offline storage, defaults to True - for online storage.
  • with_label bool: Indicator whether the query should contain features which were marked as prediction label/feature when the training dataset was created, defaults to False.

Returns

str. Query string for the chosen storage used to generate this training dataset.


[source]

get_statistics#

TrainingDataset.get_statistics(commit_time=None)

Returns the statistics for this training dataset at a specific time.

If commit_time is None, the most recent statistics are returned.

Arguments

  • commit_time Optional[str]: Commit time in the format YYYYMMDDhhmmss, defaults to None.

Returns

Statistics. Object with statistics information.


[source]

get_tag#

TrainingDataset.get_tag(name=None)

Get the tags of a training dataset.

Tag names are unique identifiers. Returns all tags if no tag name is specified.

Arguments

  • name: Name of the tag to get, defaults to None.

Returns

List[Tag]. List of tags as name/value pairs.


[source]

insert#

TrainingDataset.insert(features, overwrite, write_options={})

Insert additional feature data into the training dataset.

This method appends data to the training dataset either from a Feature Store Query, a Spark or Pandas DataFrame, a Spark RDD, two-dimensional Python lists or Numpy ndarrays. The schemas must match for this operation.

This can also be used to overwrite all data in an existing training dataset.

Arguments

  • features Union[hsfs.Query, pandas.DataFrame, pyspark.sql.DataFrame, pyspark.RDD, numpy.ndarray, List[list]]: Feature data to be materialized.
  • overwrite bool: Whether to overwrite the entire data in the training dataset.
  • write_options Optional[Dict[Any, Any]]: Additional write options as key/value pairs. Defaults to {}.

Returns

TrainingDataset: The updated training dataset metadata object, the previous TrainingDataset object on which you call save is also updated.

Raises

  • RestAPIError: Unable to create training dataset metadata.

[source]

read#

TrainingDataset.read(split=None, read_options={})

Read the training dataset into a dataframe.

It is also possible to read only a specific split.

Arguments

  • split: Name of the split to read, defaults to None, reading the entire training dataset.
  • read_options: Additional read options as key/value pairs, defaults to {}.

Returns

DataFrame: The spark dataframe containing the feature data of the training dataset.


[source]

save#

TrainingDataset.save(features, write_options={})

Materialize the training dataset to storage.

This method materializes the training dataset either from a Feature Store Query, a Spark or Pandas DataFrame, a Spark RDD, two-dimensional Python lists or Numpy ndarrays.

Arguments

  • features Union[hsfs.Query, pandas.DataFrame, pyspark.sql.DataFrame, pyspark.RDD, numpy.ndarray, List[list]]: Feature data to be materialized.
  • write_options Optional[Dict[Any, Any]]: Additional write options as key/value pairs. Defaults to {}.

Returns

TrainingDataset: The updated training dataset metadata object, the previous TrainingDataset object on which you call save is also updated.

Raises

  • RestAPIError: Unable to create training dataset metadata.

[source]

show#

TrainingDataset.show(n, split=None)

Show the first n rows of the training dataset.

You can specify a split from which to retrieve the rows.

Arguments

  • n int: Number of rows to show.
  • split Optional[str]: Name of the split to show, defaults to None, showing the first rows when taking all splits together.

[source]

tf_data#

TrainingDataset.tf_data(
    target_name,
    split=None,
    feature_names=None,
    var_len_features=[],
    is_training=True,
    cycle_length=2,
)

Returns an object with utility methods to read training dataset as tf.data.Dataset object and handle it for further processing.

Arguments

  • target_name str: Name of the target variable.
  • split Optional[str]: Name of training dataset split. For example, "train", "test" or "val", defaults to None, returning the full training dataset.
  • feature_names Optional[list]: Names of training variables, defaults to None.
  • var_len_features Optional[list]: Feature names that have variable length and need to be returned as tf.io.VarLenFeature, defaults to [].
  • is_training Optional[bool]: Whether it is for training, testing or validation. Defaults to True.
  • cycle_length Optional[int]: Number of files to be read and deserialized in parallel, defaults to 2.

Returns

TFDataEngine. An object with utility methods to generate and handle tf.data.Dataset object.


TFData engine#

[source]

tf_record_dataset#

TFDataEngine.tf_record_dataset(
    batch_size=None,
    num_epochs=None,
    one_hot_encode_labels=False,
    num_classes=None,
    process=False,
    serialized_ndarray_fname=[],
)

Reads tfrecord files and returns ParallelMapDataset or PrefetchDataset object, depending on process set to False or True, respectively.

If process set to False returned object ParallelMapDataset can be further processed by user. For example applying custom transformations to features, batching, caching etc. process=True will return PrefetchDataset object, that contains tuple of feature vector and label, already batched and ready to input into model training.

Example of using tf_record_dataset:

connection = hsfs.connection()
fs = connection.get_feature_store();
td = fs.get_training_dataset("sample_model", 3)
td.tf_data(target_name = "id").tf_record_dataset(batch_size=1, num_epochs=1, process=True)

Arguments

  • batch_size Optional[int]: Size of batch, defaults to None.
  • num_epochs Optional[int]: Number of epochs to train, defaults to None.
  • one_hot_encode_labels Optional[bool]: If set True then one hot encode labels, defaults to False.
  • num_classes Optional[int]: If above True then provide number of target classes, defaults to None.
  • process Optional[bool]: If set True api will optimise tf data read operation, and return feature vector for model with single input, defaults to False.
  • serialized_ndarray_fname Optional[list]: Names of features that contain serialised multi dimensional arrays, defaults to [].

Returns

PrefetchDataset. If process is set to True.
ParallelMapDataset. If process is set to False.


[source]

tf_csv_dataset#

TFDataEngine.tf_csv_dataset(
    batch_size=None, num_epochs=None, one_hot_encode_labels=False, num_classes=None, process=False
)

Reads csv files and returns CsvDatasetV2 or PrefetchDataset object, depending on process set to False or True, respectively.

If process set to False returned object CsvDatasetV2 can be further processed by user. For example applying custom transformations to features, batching, caching etc. process=True will return PrefetchDataset object, that contains tuple of feature vector and label, already batched and ready to input into model training.

Example of using tf_record_dataset:

connection = hsfs.connection()
fs = connection.get_feature_store();
td = fs.get_training_dataset("sample_model", 1)
td.tf_data(target_name = "id").tf_csv_dataset(batch_size=1, num_epochs=1, process=True)

Arguments

  • batch_size Optional[int]: Size of batch, defaults to None.
  • num_epochs Optional[int]: Number of epochs to train, defaults to None.
  • one_hot_encode_labels Optional[bool]: If set true then one hot encode labels, defaults to False.
  • num_classes Optional[int]: If above true then provide number of target classes, defaults to None.
  • process Optional[bool]: If set true api will optimise tf data read operation, and return feature vector for model with single input, defaults to False.
  • serialized_ndarray_fname: Names of features that contain serialised multi dimensional arrays, defaults to [].

Returns

PrefetchDataset. If process is set to True.
CsvDatasetV2. If process is set to False.