Skip to content

hsfs.training_dataset #

TrainingDataset #

Bases: TrainingDatasetBase

id property writable #

Training dataset id.

write_options property writable #

User provided options to write training dataset.

schema property writable #

Training dataset schema.

statistics property #

statistics: Statistics

Get computed statistics for the training dataset.

query property #

Query to generate this training dataset from online feature store.

label property writable #

label: str | list[str]

The label/prediction feature of the training dataset.

Can be a composite of multiple features.

feature_store_id property #

feature_store_id: int

ID of the feature store to which this training dataset belongs.

feature_store_name property #

feature_store_name: str

Name of the feature store in which the feature group is located.

serving_keys property #

serving_keys: set[str]

Set of primary key names that is used as keys in input dict object for get_serving_vector method.

save #

save(
    features: query_module.Query
    | pd.DataFrame
    | TypeVar("pyspark.sql.DataFrame")
    | TypeVar("pyspark.RDD")
    | np.ndarray
    | list[list],
    write_options: dict[Any, Any] | None = None,
) -> Job

Materialize the training dataset to storage.

This method materializes the training dataset either from a Feature Store Query, a Spark or Pandas DataFrame, a Spark RDD, two-dimensional Python lists or Numpy ndarrays. From v2.5 onward, filters are saved along with the Query.

Engine Support

Creating Training Datasets from Dataframes is only supported using Spark as Engine.

PARAMETER DESCRIPTION
features

Feature data to be materialized.

TYPE: query_module.Query | pd.DataFrame | TypeVar('pyspark.sql.DataFrame') | TypeVar('pyspark.RDD') | np.ndarray | list[list]

write_options

Additional write options as key-value pairs, defaults to {}. When using the python engine, write_options can contain the following entries: * Key spark and value an object of type hsfs.core.job_configuration.JobConfiguration to configure the Hopsworks Job used to compute the training dataset. * Key wait_for_job and value True or False to configure whether or not to the save call should return only after the Hopsworks Job has finished. By default it waits.

TYPE: dict[Any, Any] | None DEFAULT: None

RETURNS DESCRIPTION
Job

When using the python engine, it returns the Hopsworks Job that was launched to create the training dataset.

RAISES DESCRIPTION
hopsworks.client.exceptions.RestAPIError

Unable to create training dataset metadata.

insert #

insert(
    features: query_module.Query
    | pd.DataFrame
    | TypeVar("pyspark.sql.DataFrame")
    | TypeVar("pyspark.RDD")
    | np.ndarray
    | list[list],
    overwrite: bool,
    write_options: dict[Any, Any] | None = None,
) -> Job

Insert additional feature data into the training dataset.

Deprecated

insert method is deprecated.

This method appends data to the training dataset either from a Feature Store Query, a Spark or Pandas DataFrame, a Spark RDD, two-dimensional Python lists or Numpy ndarrays. The schemas must match for this operation.

This can also be used to overwrite all data in an existing training dataset.

PARAMETER DESCRIPTION
features

Feature data to be materialized.

TYPE: query_module.Query | pd.DataFrame | TypeVar('pyspark.sql.DataFrame') | TypeVar('pyspark.RDD') | np.ndarray | list[list]

overwrite

Whether to overwrite the entire data in the training dataset.

TYPE: bool

write_options

Additional write options as key-value pairs, defaults to {}. When using the python engine, write_options can contain the following entries: * Key spark and value an object of type hsfs.core.job_configuration.JobConfiguration to configure the Hopsworks Job used to compute the training dataset. * Key wait_for_job and value True or False to configure whether or not to the insert call should return only after the Hopsworks Job has finished. By default it waits.

TYPE: dict[Any, Any] | None DEFAULT: None

RETURNS DESCRIPTION
Job

When using the python engine, it returns the Hopsworks Job that was launched to create the training dataset.

RAISES DESCRIPTION
hopsworks.client.exceptions.RestAPIError

Unable to create training dataset metadata.

read #

read(
    split: str | None = None,
    read_options: dict | None = None,
) -> TypeVar("pyspark.sql.DataFrame")

Read the training dataset into a dataframe.

It is also possible to read only a specific split.

PARAMETER DESCRIPTION
split

Name of the split to read; by default reads the entire training dataset. If the training dataset has split, the split parameter is mandatory.

TYPE: str | None DEFAULT: None

read_options

Additional read options as key/value pairs, defaults to {}.

TYPE: dict | None DEFAULT: None

RETURNS DESCRIPTION
TypeVar('pyspark.sql.DataFrame')

The spark dataframe containing the feature data of the training dataset.

compute_statistics #

compute_statistics()

Compute the statistics for the training dataset and save them to the feature store.

show #

show(n: int, split: str = None)

Show the first n rows of the training dataset.

You can specify a split from which to retrieve the rows.

PARAMETER DESCRIPTION
n

Number of rows to show.

TYPE: int

split

Name of the split to show, defaults to None, showing the first rows when taking all splits together.

TYPE: str DEFAULT: None

add_tag #

add_tag(name: str, value: Any)

Attach a tag to a training dataset.

A tag consists of a pair. Tag names are unique identifiers across the whole cluster. The value of a tag can be any valid json - primitives, arrays or json objects.

PARAMETER DESCRIPTION
name

Name of the tag to be added.

TYPE: str

value

Value of the tag to be added.

TYPE: Any

RAISES DESCRIPTION
hopsworks.client.exceptions.RestAPIError

in case the backend fails to add the tag.

delete_tag #

delete_tag(name: str)

Delete a tag attached to a training dataset.

PARAMETER DESCRIPTION
name

Name of the tag to be removed.

TYPE: str

RAISES DESCRIPTION
hopsworks.client.exceptions.RestAPIError

in case the backend fails to delete the tag.

get_tag #

get_tag(name: str) -> Any

Get the tags of a training dataset.

PARAMETER DESCRIPTION
name

Name of the tag to get.

TYPE: str

RETURNS DESCRIPTION
Any

tag value

RAISES DESCRIPTION
hopsworks.client.exceptions.RestAPIError

in case the backend fails to retrieve the tag.

get_tags #

get_tags() -> dict[str, tag.Tag]

Returns all tags attached to a training dataset.

RETURNS DESCRIPTION
dict[str, tag.Tag]

Dictionary of tags.

RAISES DESCRIPTION
hopsworks.client.exceptions.RestAPIError

in case the backend fails to retrieve the tags.

update_statistics_config #

update_statistics_config() -> TrainingDataset

Update the statistics configuration of the training dataset.

Change the statistics_config object and persist the changes by calling this method.

RETURNS DESCRIPTION
TrainingDataset

The updated metadata object of the training dataset.

RAISES DESCRIPTION
hopsworks.client.exceptions.RestAPIError

in case the backend encounters an issue

delete #

delete()

Delete training dataset and all associated metadata.

Drops only HopsFS data

Note that this operation drops only files which were materialized in HopsFS. If you used a Storage Connector for a cloud storage such as S3, the data will not be deleted, but you will not be able to track it anymore from the Feature Store.

Potentially dangerous operation

This operation drops all metadata associated with this version of the training dataset and and the materialized data in HopsFS.

RAISES DESCRIPTION
hopsworks.client.exceptions.RestAPIError

In case of a server error.

get_query #

get_query(
    online: bool = True, with_label: bool = False
) -> str | None

Returns the query used to generate this training dataset.

PARAMETER DESCRIPTION
online

Return the query for the online storage, else for offline storage, defaults to True - for online storage.

TYPE: bool DEFAULT: True

with_label

Indicator whether the query should contain features which were marked as prediction label/feature when the training dataset was created, defaults to False.

TYPE: bool DEFAULT: False

RETURNS DESCRIPTION
str | None

Query string for the chosen storage used to generate this training dataset.

init_prepared_statement #

init_prepared_statement(
    batch: bool | None = None, external: bool | None = None
)

Initialise and cache parametrized prepared statement to retrieve feature vector from online feature store.

PARAMETER DESCRIPTION
batch

If set to True, prepared statements will be initialised for retrieving serving vectors as a batch.

TYPE: bool | None DEFAULT: None

external

If set to True, the connection to the online feature store is established using the same host as for the host parameter in the hopsworks.login method. If set to False, the online feature store storage connector is used which relies on the private IP. Defaults to True if connection to Hopsworks is established from external environment (e.g AWS Sagemaker or Google Colab), otherwise to False.

TYPE: bool | None DEFAULT: None

get_serving_vector #

get_serving_vector(
    entry: dict[str, Any], external: bool | None = None
) -> list

Returns assembled serving vector from online feature store.

PARAMETER DESCRIPTION
entry

Dictionary of training dataset feature group primary key names as keys and values provided by serving application.

TYPE: dict[str, Any]

external

If set to True, the connection to the online feature store is established using the same host as for the host parameter in the hopsworks.login method. If set to False, the online feature store storage connector is used which relies on the private IP. Defaults to True if connection to Hopsworks is established from external environment (e.g AWS Sagemaker or Google Colab), otherwise to False.

TYPE: bool | None DEFAULT: None

RETURNS DESCRIPTION
list

List of feature values related to provided primary keys, ordered according to positions of this features in training dataset query.

get_serving_vectors #

get_serving_vectors(
    entry: dict[str, list[Any]],
    external: bool | None = None,
) -> list[list]

Returns assembled serving vectors in batches from online feature store.

PARAMETER DESCRIPTION
entry

Dict of feature group primary key names as keys and value as list of primary keys provided by serving application.

TYPE: dict[str, list[Any]]

external

If set to True, the connection to the online feature store is established using the same host as for the host parameter in the hopsworks.login method. If set to False, the online feature store storage connector is used which relies on the private IP. Defaults to True if connection to Hopsworks is established from external environment (e.g AWS Sagemaker or Google Colab), otherwise to False.

TYPE: bool | None DEFAULT: None

RETURNS DESCRIPTION
list[list]

List of lists of feature values related to provided primary keys, ordered according to positions of this features in training dataset query.