Feature View#
FeatureView#
hsfs.feature_view.FeatureView(
name,
query,
featurestore_id,
id=None,
version=None,
description="",
labels=[],
transformation_functions={},
)
Creation#
create_feature_view#
FeatureStore.create_feature_view(
name, query, version=None, description="", labels=[], transformation_functions={}
)
Create a feature view metadata object and saved it to Hopsworks.
Arguments
- name
str
: Name of the feature view to create. - query
hsfs.constructor.query.Query
: Feature storeQuery
. - version
Optional[int]
: Version of the feature view to create, defaults toNone
and will create the feature view with incremented version from the last version in the feature store. - description
Optional[str]
: A string describing the contents of the feature view to improve discoverability for Data Scientists, defaults to empty string""
. - labels
Optional[List[str]]
: A list of feature names constituting the prediction label/feature of the feature view. When replaying aQuery
during model inference, the label features can be omitted from the feature vector retrieval. Defaults to[]
, no label. - transformation_functions
Optional[Dict[str, hsfs.transformation_function.TransformationFunction]]
: A dictionary mapping tansformation functions to to the features they should be applied to before writing out the vector and at inference time. Defaults to{}
, no transformations.
Returns:
FeatureView
: The feature view metadata object.
Retrieval#
get_feature_view#
FeatureStore.get_feature_view(name, version=None)
Get a feature view entity from the feature store.
Getting a feature view from the Feature Store means getting its metadata.
Arguments
- name
str
: Name of the feature view to get. - version
Optional[int]
: Version of the feature view to retrieve, defaults toNone
and will return theversion=1
.
Returns
FeatureView
: The feature view metadata object.
Raises
RestAPIError
: If unable to retrieve feature view from the feature store.
get_feature_views#
FeatureStore.get_feature_views(name)
Get a list of all versions of a feature view entity from the feature store.
Getting a feature view from the Feature Store means getting its metadata.
Arguments
- name: Name of the feature view to get.
Returns
FeatureView
: List of feature view metadata objects.
Raises
RestAPIError
: If unable to retrieve feature view from the feature store.
Properties#
description#
featurestore_id#
Feature store id.
id#
Feature view id.
labels#
The labels/prediction feature of the feature view.
Can be a composite of multiple features.
name#
Name of the feature view.
query#
schema#
Feature view schema.
transformation_functions#
Set transformation functions.
version#
Version number of the feature view.
Methods#
add_tag#
FeatureView.add_tag(name, value)
add_training_dataset_tag#
FeatureView.add_training_dataset_tag(training_dataset_version, name, value)
clean#
FeatureView.clean(feature_store_id, feature_view_name, feature_view_version)
Delete the feature view and all associated metadata.
Potentially dangerous operation
This operation drops all metadata associated with this version of the feature view and related training dataset and materialized data in HopsFS.
Arguments
- feature_store_id
int
: int. Id of feature store. - feature_view_name
str
: str. Name of feature view. - feature_view_version
str
: str. Version of feature view.
Raises
RestAPIError
.
create_train_test_split#
FeatureView.create_train_test_split(
test_size=None,
train_start="",
train_end="",
test_start="",
test_end="",
storage_connector=None,
location="",
description="",
data_format="csv",
coalesce=False,
seed=None,
statistics_config=None,
write_options={},
)
Create a training dataset and save data into location
.
Data Formats
The feature store currently supports the following data formats for training datasets:
- tfrecord
- csv
- tsv
- parquet
- avro
- orc
Currently not supported petastorm, hdf5 and npy file formats.
Arguments
- test_size
Optional[float]
: size of test set. - train_start
Optional[Union[str, int, datetime.datetime, datetime.date]]
: datatime.datetime, datetime.date, unix timestamp in seconds (int), or string. The String should be formatted in one of the following formats%Y%m%d
,%Y%m%d%H
,%Y%m%d%H%M
,%Y%m%d%H%M%S
, or%Y%m%d%H%M%S%f
. - train_end
Optional[Union[str, int, datetime.datetime, datetime.date]]
: datatime.datetime, datetime.date, unix timestamp in seconds (int), or string. The String should be formatted in one of the following formats%Y%m%d
,%Y%m%d%H
,%Y%m%d%H%M
,%Y%m%d%H%M%S
, or%Y%m%d%H%M%S%f
. - test_start
Optional[Union[str, int, datetime.datetime, datetime.date]]
: datatime.datetime, datetime.date, unix timestamp in seconds (int), or string. The String should be formatted in one of the following formats%Y%m%d
,%Y%m%d%H
,%Y%m%d%H%M
,%Y%m%d%H%M%S
, or%Y%m%d%H%M%S%f
. - test_end
Optional[Union[str, int, datetime.datetime, datetime.date]]
: datatime.datetime, datetime.date, unix timestamp in seconds (int), or string. The String should be formatted in one of the following ormats%Y%m%d
,%Y%m%d%H
,%Y%m%d%H%M
,%Y%m%d%H%M%S
, or%Y%m%d%H%M%S%f
. - storage_connector
Optional[hsfs.StorageConnector]
: Storage connector defining the sink location for the training dataset, defaults toNone
, and materializes training dataset on HopsFS. - location
Optional[str]
: Path to complement the sink storage connector with, e.g if the storage connector points to an S3 bucket, this path can be used to define a sub-directory inside the bucket to place the training dataset. Defaults to""
, saving the training dataset at the root defined by the storage connector. - description
Optional[str]
: A string describing the contents of the training dataset to improve discoverability for Data Scientists, defaults to empty string""
. - data_format
Optional[str]
: The data format used to save the training dataset, defaults to"csv"
-format. - coalesce
Optional[bool]
: If true the training dataset data will be coalesced into a single partition before writing. The resulting training dataset will be a single file per split. Default False. - seed
Optional[int]
: Optionally, define a seed to create the random splits with, in order to guarantee reproducability, defaults toNone
. - statistics_config
Optional[Union[hsfs.StatisticsConfig, bool, dict]]
: A configuration object, or a dictionary with keys "enabled
" to generally enable descriptive statistics computation for this feature group,"correlations
" to turn on feature correlation computation and"histograms"
to compute feature value frequencies. The values should be booleans indicating the setting. To fully turn off statistics computation passstatistics_config=False
. Defaults toNone
and will compute only descriptive statistics. - write_options
Optional[Dict[Any, Any]]
: Additional write options as key-value pairs, defaults to{}
. When using thepython
engine, write_options can contain the following entries:- key
spark
and value an object of type hsfs.core.job_configuration.JobConfiguration to configure the Hopsworks Job used to compute the training dataset. - key
wait_for_job
and valueTrue
orFalse
to configure whether or not to the save call should return only after the Hopsworks Job has finished. By default it waits.
- key
Returns
(td_version, Job
): Tuple of training dataset version and job.
When using the python
engine, it returns the Hopsworks Job
that was launched to create the training dataset.
create_train_validation_test_split#
FeatureView.create_train_validation_test_split(
validation_size=None,
test_size=None,
train_start="",
train_end="",
validation_start="",
validation_end="",
test_start="",
test_end="",
storage_connector=None,
location="",
description="",
data_format="csv",
coalesce=False,
seed=None,
statistics_config=None,
write_options={},
)
Create a training dataset and save data into location
.
Data Formats
The feature store currently supports the following data formats for training datasets:
- tfrecord
- csv
- tsv
- parquet
- avro
- orc
Currently not supported petastorm, hdf5 and npy file formats.
Arguments
- validation_size
Optional[float]
: size of validation set. - test_size
Optional[float]
: size of test set. - train_start
Optional[Union[str, int, datetime.datetime, datetime.date]]
: datatime.datetime, datetime.date, unix timestamp in seconds (int), or string. The String should be formatted in one of the following formats%Y%m%d
,%Y%m%d%H
,%Y%m%d%H%M
,%Y%m%d%H%M%S
, or%Y%m%d%H%M%S%f
. - train_end
Optional[Union[str, int, datetime.datetime, datetime.date]]
: datatime.datetime, datetime.date, unix timestamp in seconds (int), or string. The String should be formatted in one of the following formats%Y%m%d
,%Y%m%d%H
,%Y%m%d%H%M
,%Y%m%d%H%M%S
, or%Y%m%d%H%M%S%f
. - validation_start
Optional[Union[str, int, datetime.datetime, datetime.date]]
: tdatatime.datetime, datetime.date, unix timestamp in seconds (int), or string. The String should be formatted in one of the following formats%Y%m%d
,%Y%m%d%H
,%Y%m%d%H%M
,%Y%m%d%H%M%S
, or%Y%m%d%H%M%S%f
. - validation_end
Optional[Union[str, int, datetime.datetime, datetime.date]]
: datatime.datetime, datetime.date, unix timestamp in seconds (int), or string. The String should be formatted in one of the following formats%Y%m%d
,%Y%m%d%H
,%Y%m%d%H%M
,%Y%m%d%H%M%S
, or%Y%m%d%H%M%S%f
. - test_start
Optional[Union[str, int, datetime.datetime, datetime.date]]
: datatime.datetime, datetime.date, unix timestamp in seconds (int), or string. The String should be formatted in one of the following formats%Y%m%d
,%Y%m%d%H
,%Y%m%d%H%M
,%Y%m%d%H%M%S
, or%Y%m%d%H%M%S%f
. - test_end
Optional[Union[str, int, datetime.datetime, datetime.date]]
: datatime.datetime, datetime.date, unix timestamp in seconds (int), or string. The String should be formatted in one of the following formats%Y%m%d
,%Y%m%d%H
,%Y%m%d%H%M
,%Y%m%d%H%M%S
, or%Y%m%d%H%M%S%f
. - storage_connector
Optional[hsfs.StorageConnector]
: Storage connector defining the sink location for the training dataset, defaults toNone
, and materializes training dataset on HopsFS. - location
Optional[str]
: Path to complement the sink storage connector with, e.g if the storage connector points to an S3 bucket, this path can be used to define a sub-directory inside the bucket to place the training dataset. Defaults to""
, saving the training dataset at the root defined by the storage connector. - description
Optional[str]
: A string describing the contents of the training dataset to improve discoverability for Data Scientists, defaults to empty string""
. - data_format
Optional[str]
: The data format used to save the training dataset, defaults to"csv"
-format. - coalesce
Optional[bool]
: If true the training dataset data will be coalesced into a single partition before writing. The resulting training dataset will be a single file per split. Default False. - seed
Optional[int]
: Optionally, define a seed to create the random splits with, in order to guarantee reproducability, defaults toNone
. - statistics_config
Optional[Union[hsfs.StatisticsConfig, bool, dict]]
: A configuration object, or a dictionary with keys "enabled
" to generally enable descriptive statistics computation for this feature group,"correlations
" to turn on feature correlation computation and"histograms"
to compute feature value frequencies. The values should be booleans indicating the setting. To fully turn off statistics computation passstatistics_config=False
. Defaults toNone
and will compute only descriptive statistics. - write_options
Optional[Dict[Any, Any]]
: Additional write options as key-value pairs, defaults to{}
. When using thepython
engine, write_options can contain the following entries:- key
spark
and value an object of type hsfs.core.job_configuration.JobConfiguration to configure the Hopsworks Job used to compute the training dataset. - key
wait_for_job
and valueTrue
orFalse
to configure whether or not to the save call should return only after the Hopsworks Job has finished. By default it waits.
- key
Returns
(td_version, Job
): Tuple of training dataset version and job.
When using the python
engine, it returns the Hopsworks Job
that was launched to create the training dataset.
create_training_data#
FeatureView.create_training_data(
start_time="",
end_time="",
storage_connector=None,
location="",
description="",
data_format="csv",
coalesce=False,
seed=None,
statistics_config=None,
write_options={},
)
Create a training dataset and save data into location
.
Data Formats
The feature store currently supports the following data formats for training datasets:
- tfrecord
- csv
- tsv
- parquet
- avro
- orc
Currently not supported petastorm, hdf5 and npy file formats.
Arguments
- start_time
Optional[Union[str, int, datetime.datetime, datetime.date]]
: datatime.datetime, datetime.date, unix timestamp in seconds (int), or string. The String should be formatted in one of the following formats%Y%m%d
,%Y%m%d%H
,%Y%m%d%H%M
,%Y%m%d%H%M%S
, or%Y%m%d%H%M%S%f
. - end_time
Optional[Union[str, int, datetime.datetime, datetime.date]]
: datatime.datetime, datetime.date, unix timestamp in seconds (int), or string. The String should be formatted in one of the following formats%Y%m%d
,%Y%m%d%H
,%Y%m%d%H%M
,%Y%m%d%H%M%S
, or%Y%m%d%H%M%S%f
. - storage_connector
Optional[hsfs.StorageConnector]
: Storage connector defining the sink location for the training dataset, defaults toNone
, and materializes training dataset on HopsFS. - location
Optional[str]
: Path to complement the sink storage connector with, e.g if the storage connector points to an S3 bucket, this path can be used to define a sub-directory inside the bucket to place the training dataset. Defaults to""
, saving the training dataset at the root defined by the storage connector. - description
Optional[str]
: A string describing the contents of the training dataset to improve discoverability for Data Scientists, defaults to empty string""
. - data_format
Optional[str]
: The data format used to save the training dataset, defaults to"csv"
-format. - coalesce
Optional[bool]
: If true the training dataset data will be coalesced into a single partition before writing. The resulting training dataset will be a single file per split. Default False. - seed
Optional[int]
: Optionally, define a seed to create the random splits with, in order to guarantee reproducability, defaults toNone
. - statistics_config
Optional[Union[hsfs.StatisticsConfig, bool, dict]]
: A configuration object, or a dictionary with keys "enabled
" to generally enable descriptive statistics computation for this feature group,"correlations
" to turn on feature correlation computation and"histograms"
to compute feature value frequencies. The values should be booleans indicating the setting. To fully turn off statistics computation passstatistics_config=False
. Defaults toNone
and will compute only descriptive statistics. - write_options
Optional[Dict[Any, Any]]
: Additional write options as key-value pairs, defaults to{}
. When using thepython
engine, write_options can contain the following entries:- key
spark
and value an object of type hsfs.core.job_configuration.JobConfiguration to configure the Hopsworks Job used to compute the training dataset. - key
wait_for_job
and valueTrue
orFalse
to configure whether or not to the save call should return only after the Hopsworks Job has finished. By default it waits.
- key
Returns
(td_version, Job
): Tuple of training dataset version and job.
When using the python
engine, it returns the Hopsworks Job
that was launched to create the training dataset.
delete#
FeatureView.delete()
Delete current feature view and all associated metadata.
Potentially dangerous operation
This operation drops all metadata associated with this version of the feature view and related training dataset and materialized data in HopsFS.
Raises
RestAPIError
.
delete_all_training_datasets#
FeatureView.delete_all_training_datasets()
delete_tag#
FeatureView.delete_tag(name)
delete_training_dataset#
FeatureView.delete_training_dataset(version)
delete_training_dataset_tag#
FeatureView.delete_training_dataset_tag(training_dataset_version, name)
from_response_json#
FeatureView.from_response_json(json_dict)
get_batch_data#
FeatureView.get_batch_data(start_time=None, end_time=None, read_options=None)
Get a batch of data from an event time interval.
Arguments
- start_time: datatime.datetime, datetime.date, unix timestamp in seconds (int), or string. The String should be
formatted in one of the following formats
%Y%m%d
,%Y%m%d%H
,%Y%m%d%H%M
,%Y%m%d%H%M%S
, or%Y%m%d%H%M%S%f
. - end_time: datatime.datetime, datetime.date, unix timestamp in seconds (int), or string. The String should be
formatted in one of the following formats
%Y%m%d
,%Y%m%d%H
,%Y%m%d%H%M
,%Y%m%d%H%M%S
, or%Y%m%d%H%M%S%f
. - read_options: User provided read options. Defaults to
{}
.
get_batch_query#
FeatureView.get_batch_query(start_time=None, end_time=None)
Get a query string of batch query.
Arguments
- start_time
Optional[Union[str, int, datetime.datetime, datetime.date]]
: Optional. Start time of the batch query. datatime.datetime, datetime.date, unix timestamp in seconds (int), or string. The String should be formatted in one of the following formats%Y%m%d
,%Y%m%d%H
,%Y%m%d%H%M
,%Y%m%d%H%M%S
, or%Y%m%d%H%M%S%f
. - end_time
Optional[Union[str, int, datetime.datetime, datetime.date]]
: Optional. End time of the batch query. datatime.datetime, datetime.date, unix timestamp in seconds (int), or string. The String should be formatted in one of the following formats%Y%m%d
,%Y%m%d%H
,%Y%m%d%H%M
,%Y%m%d%H%M%S
, or%Y%m%d%H%M%S%f
.
Returns
str
: batch query
get_feature_vector#
FeatureView.get_feature_vector(entry, passed_features={}, external=None)
Returns assembled serving vector from online feature store.
Arguments
- entry
List[Dict[str, Any]]
: dictionary of feature group primary key and values provided by serving application. - passed_features
Optional[Dict[str, Any]]
: dictionary of feature values provided by the application at runtime. They can replace features values fetched from the feature store as well as providing feature values which are not available in the feature store. - external
Optional[bool]
: boolean, optional. If set to True, the connection to the online feature store is established using the same host as for thehost
parameter in thehsfs.connection()
method. If set to False, the online feature store storage connector is used which relies on the private IP. Defaults to True if connection to Hopsworks is established from external environment (e.g AWS Sagemaker or Google Colab), otherwise to False.
Returns
list
List of feature values related to provided primary keys, ordered according to positions of this
features in the feature view query.
get_feature_vectors#
FeatureView.get_feature_vectors(entry, passed_features={}, external=None)
Returns assembled serving vectors in batches from online feature store.
Arguments
- entry
List[Dict[str, Any]]
: a list of dictionary of feature group primary key and values provided by serving application. - passed_features
Optional[List[Dict[str, Any]]]
: a list of dictionary of feature values provided by the application at runtime. They can replace features values fetched from the feature store as well as providing feature values which are not available in the feature store. - external
Optional[bool]
: boolean, optional. If set to True, the connection to the online feature store is established using the same host as for thehost
parameter in thehsfs.connection()
method. If set to False, the online feature store storage connector is used which relies on the private IP. Defaults to True if connection to Hopsworks is established from external environment (e.g AWS Sagemaker or Google Colab), otherwise to False.
Returns
List[list]
List of lists of feature values related to provided primary keys, ordered according to positions of this features in the feature view query.
get_tag#
FeatureView.get_tag(name)
get_tags#
FeatureView.get_tags()
get_train_test_split#
FeatureView.get_train_test_split(training_dataset_version, read_options=None)
Get training data from storage or feature groups.
Info
If a materialised training data has deleted. Use recreate_training_dataset()
to
recreate the training data.
Arguments
- version: training dataset version
- read_options
Optional[Dict[Any, Any]]
: Additional read options as key-value pairs, defaults to{}
. When using thepython
engine, read_options can contain the following entries:- key
spark
and value an object of type hsfs.core.job_configuration.JobConfiguration to configure the Hopsworks Job used to compute the training dataset.
- key
Returns
(X_train, X_test, y_train, y_test): Tuple of dataframe of features and labels
get_train_validation_test_split#
FeatureView.get_train_validation_test_split(training_dataset_version, read_options=None)
Get training data from storage or feature groups.
Info
If a materialised training data has deleted. Use recreate_training_dataset()
to
recreate the training data.
Arguments
- version: training dataset version
- read_options
Optional[Dict[Any, Any]]
: Additional read options as key-value pairs, defaults to{}
. When using thepython
engine, read_options can contain the following entries:- key
spark
and value an object of type hsfs.core.job_configuration.JobConfiguration to configure the Hopsworks Job used to compute the training dataset.
- key
Returns
(X_train, X_val, X_test, y_train, y_val, y_test): Tuple of dataframe of features and labels
get_training_data#
FeatureView.get_training_data(training_dataset_version, read_options=None)
Get training data from storage or feature groups.
Info
If a materialised training data has deleted. Use recreate_training_dataset()
to
recreate the training data.
External Storage Support
Reading training data that was written to external storage using a Storage Connector other than S3 can currently not be read using HSFS APIs with Python as Engine, instead you will have to use the storage's native client.
Arguments
- version: training dataset version
- read_options
Optional[Dict[Any, Any]]
: Additional read options as key-value pairs, defaults to{}
. When using thepython
engine, read_options can contain the following entries:- key
spark
and value an object of type hsfs.core.job_configuration.JobConfiguration to configure the Hopsworks Job used to compute the training dataset.
- key
Returns
(X, y): Tuple of dataframe of features and labels
get_training_dataset_tag#
FeatureView.get_training_dataset_tag(training_dataset_version, name)
get_training_dataset_tags#
FeatureView.get_training_dataset_tags(training_dataset_version)
init_batch_scoring#
FeatureView.init_batch_scoring(training_dataset_version=None)
Initialise and cache parametrized transformation functions.
Arguments
- training_dataset_version
Optional[int]
: int, optional. Default to be 1. Transformation statistics are fetched from training dataset and apply in serving vector.
init_serving#
FeatureView.init_serving(training_dataset_version=None, external=None)
Initialise and cache parametrized prepared statement to retrieve feature vector from online feature store.
Arguments
- training_dataset_version
Optional[int]
: int, optional. Default to be 1. Transformation statistics are fetched from training dataset and apply in serving vector. - batch: boolean, optional. If set to True, prepared statements will be initialised for retrieving serving vectors as a batch.
- external
Optional[bool]
: boolean, optional. If set to True, the connection to the online feature store is established using the same host as for thehost
parameter in thehsfs.connection()
method. If set to False, the online feature store storage connector is used which relies on the private IP. Defaults to True if connection to Hopsworks is established from external environment (e.g AWS Sagemaker or Google Colab), otherwise to False.
json#
FeatureView.json()
preview_feature_vector#
FeatureView.preview_feature_vector(external=None)
Returns a sample of assembled serving vector from online feature store.
Arguments
- external
Optional[bool]
: boolean, optional. If set to True, the connection to the online feature store is established using the same host as for thehost
parameter in thehsfs.connection()
method. If set to False, the online feature store storage connector is used which relies on the private IP. Defaults to True if connection to Hopsworks is established from external environment (e.g AWS Sagemaker or Google Colab), otherwise to False.
Returns
list
List of feature values, ordered according to positions of this
features in training dataset query.
preview_feature_vectors#
FeatureView.preview_feature_vectors(n, external=None)
Returns n samples of assembled serving vectors in batches from online feature store.
Arguments
- n
int
: int. Number of feature vectors to return. - external
Optional[bool]
: boolean, optional. If set to True, the connection to the online feature store is established using the same host as for thehost
parameter in thehsfs.connection()
method. If set to False, the online feature store storage connector is used which relies on the private IP. Defaults to True if connection to Hopsworks is established from external environment (e.g AWS Sagemaker or Google Colab), otherwise to False.
Returns
List[list]
List of lists of feature values , ordered according to
positions of this features in training dataset query.
purge_all_training_data#
FeatureView.purge_all_training_data()
purge_training_data#
FeatureView.purge_training_data(version)
recreate_training_dataset#
FeatureView.recreate_training_dataset(version, write_options=None)
Recreate a training dataset.
Info
If a materialised training data has deleted. Use recreate_training_dataset()
to
recreate the training data.
Arguments
- version
int
: training dataset version - read_options: Additional read options as key-value pairs, defaults to
{}
. When using thepython
engine, read_options can contain the following entries:- key
spark
and value an object of type hsfs.core.job_configuration.JobConfiguration to configure the Hopsworks Job used to compute the training dataset.
- key
Returns
Job
: When using the python
engine, it returns the Hopsworks Job
that was launched to create the training dataset.
to_dict#
FeatureView.to_dict()
train_test_split#
FeatureView.train_test_split(
test_size=None,
train_start="",
train_end="",
test_start="",
test_end="",
description="",
statistics_config=None,
read_options=None,
)
Get training data from feature groups.
Info
If a materialised training data has deleted. Use recreate_training_dataset()
to
recreate the training data.
Arguments
- test_size
Optional[float]
: size of test set. Should be between 0 and 1. - train_start
Optional[Union[str, int, datetime.datetime, datetime.date]]
: datatime.datetime, datetime.date, unix timestamp in seconds (int), or string. The String should be formatted in one of the following formats%Y%m%d
,%Y%m%d%H
,%Y%m%d%H%M
,%Y%m%d%H%M%S
, or%Y%m%d%H%M%S%f
. - train_end
Optional[Union[str, int, datetime.datetime, datetime.date]]
: datatime.datetime, datetime.date, unix timestamp in seconds (int), or string. The String should be formatted in one of the following formats%Y%m%d
,%Y%m%d%H
,%Y%m%d%H%M
,%Y%m%d%H%M%S
, or%Y%m%d%H%M%S%f
. - test_start
Optional[Union[str, int, datetime.datetime, datetime.date]]
: datatime.datetime, datetime.date, unix timestamp in seconds (int), or string. The String should be formatted in one of the following formats%Y%m%d
,%Y%m%d%H
,%Y%m%d%H%M
,%Y%m%d%H%M%S
, or%Y%m%d%H%M%S%f
. - test_end
Optional[Union[str, int, datetime.datetime, datetime.date]]
: datatime.datetime, datetime.date, unix timestamp in seconds (int), or string. The String should be formatted in one of the following formats%Y%m%d
,%Y%m%d%H
,%Y%m%d%H%M
,%Y%m%d%H%M%S
, or%Y%m%d%H%M%S%f
. - description
Optional[str]
: A string describing the contents of the training dataset to improve discoverability for Data Scientists, defaults to empty string""
. - statistics_config
Optional[Union[hsfs.StatisticsConfig, bool, dict]]
: A configuration object, or a dictionary with keys "enabled
" to generally enable descriptive statistics computation for this feature group,"correlations
" to turn on feature correlation computation and"histograms"
to compute feature value frequencies. The values should be booleans indicating the setting. To fully turn off statistics computation passstatistics_config=False
. Defaults toNone
and will compute only descriptive statistics. - read_options
Optional[Dict[Any, Any]]
: Additional read options as key-value pairs, defaults to{}
. When using thepython
engine, read_options can contain the following entries:- key
spark
and value an object of type hsfs.core.job_configuration.JobConfiguration to configure the Hopsworks Job used to compute the training dataset.
- key
Returns
(X_train, X_test, y_train, y_test): Tuple of dataframe of features and labels
train_validation_test_split#
FeatureView.train_validation_test_split(
validation_size=None,
test_size=None,
train_start="",
train_end="",
validation_start="",
validation_end="",
test_start="",
test_end="",
description="",
statistics_config=None,
read_options=None,
)
Get training data from feature groups.
Info
If a materialised training data has deleted. Use recreate_training_dataset()
to
recreate the training data.
Arguments
- validation_size
Optional[float]
: size of validation set. Should be between 0 and 1. - test_size
Optional[float]
: size of test set. Should be between 0 and 1. - train_start
Optional[Union[str, int, datetime.datetime, datetime.date]]
: datatime.datetime, datetime.date, unix timestamp in seconds (int), or string. The String should be formatted in one of the following formats%Y%m%d
,%Y%m%d%H
,%Y%m%d%H%M
,%Y%m%d%H%M%S
, or%Y%m%d%H%M%S%f
. - train_end
Optional[Union[str, int, datetime.datetime, datetime.date]]
: datatime.datetime, datetime.date, unix timestamp in seconds (int), or string. The String should be formatted in one of the following formats%Y%m%d
,%Y%m%d%H
,%Y%m%d%H%M
,%Y%m%d%H%M%S
, or%Y%m%d%H%M%S%f
. - validation_start
Optional[Union[str, int, datetime.datetime, datetime.date]]
: datatime.datetime, datetime.date, unix timestamp in seconds (int), or string. The String should be formatted in one of the following formats%Y%m%d
,%Y%m%d%H
,%Y%m%d%H%M
,%Y%m%d%H%M%S
, or%Y%m%d%H%M%S%f
. - validation_end
Optional[Union[str, int, datetime.datetime, datetime.date]]
: datatime.datetime, datetime.date, unix timestamp in seconds (int), or string. The String should be formatted in one of the following formats%Y%m%d
,%Y%m%d%H
,%Y%m%d%H%M
,%Y%m%d%H%M%S
, or%Y%m%d%H%M%S%f
. - test_start
Optional[Union[str, int, datetime.datetime, datetime.date]]
: datatime.datetime, datetime.date, unix timestamp in seconds (int), or string. The String should be formatted in one of the following formats%Y%m%d
,%Y%m%d%H
,%Y%m%d%H%M
,%Y%m%d%H%M%S
, or%Y%m%d%H%M%S%f
. - test_end
Optional[Union[str, int, datetime.datetime, datetime.date]]
: datatime.datetime, datetime.date, unix timestamp in seconds (int), or string. The String should be formatted in one of the following formats%Y%m%d
,%Y%m%d%H
,%Y%m%d%H%M
,%Y%m%d%H%M%S
, or%Y%m%d%H%M%S%f
. - description
Optional[str]
: A string describing the contents of the training dataset to improve discoverability for Data Scientists, defaults to empty string""
. - statistics_config
Optional[Union[hsfs.StatisticsConfig, bool, dict]]
: A configuration object, or a dictionary with keys "enabled
" to generally enable descriptive statistics computation for this feature group,"correlations
" to turn on feature correlation computation and"histograms"
to compute feature value frequencies. The values should be booleans indicating the setting. To fully turn off statistics computation passstatistics_config=False
. Defaults toNone
and will compute only descriptive statistics. - read_options
Optional[Dict[Any, Any]]
: Additional read options as key-value pairs, defaults to{}
. When using thepython
engine, read_options can contain the following entries:- key
spark
and value an object of type hsfs.core.job_configuration.JobConfiguration to configure the Hopsworks Job used to compute the training dataset.
- key
Returns
(X_train, X_val, X_test, y_train, y_val, y_test): Tuple of dataframe of features and labels
training_data#
FeatureView.training_data(
start_time=None, end_time=None, description="", statistics_config=None, read_options=None
)
Get training data from feature groups.
Info
If a materialised training data has deleted. Use recreate_training_dataset()
to
recreate the training data.
Arguments
- start_time
Optional[Union[str, int, datetime.datetime, datetime.date]]
: datatime.datetime, datetime.date, unix timestamp in seconds (int), or string. The String should be formatted in one of the following formats%Y%m%d
,%Y%m%d%H
,%Y%m%d%H%M
,%Y%m%d%H%M%S
, or%Y%m%d%H%M%S%f
. - end_time
Optional[Union[str, int, datetime.datetime, datetime.date]]
: datatime.datetime, datetime.date, unix timestamp in seconds (int), or string. The String should be formatted in one of the following formats%Y%m%d
,%Y%m%d%H
,%Y%m%d%H%M
,%Y%m%d%H%M%S
, or%Y%m%d%H%M%S%f
. - description
Optional[str]
: A string describing the contents of the training dataset to improve discoverability for Data Scientists, defaults to empty string""
. - statistics_config
Optional[Union[hsfs.StatisticsConfig, bool, dict]]
: A configuration object, or a dictionary with keys "enabled
" to generally enable descriptive statistics computation for this feature group,"correlations
" to turn on feature correlation computation and"histograms"
to compute feature value frequencies. The values should be booleans indicating the setting. To fully turn off statistics computation passstatistics_config=False
. Defaults toNone
and will compute only descriptive statistics. - read_options
Optional[Dict[Any, Any]]
: Additional read options as key-value pairs, defaults to{}
. When using thepython
engine, read_options can contain the following entries:- key
spark
and value an object of type hsfs.core.job_configuration.JobConfiguration to configure the Hopsworks Job used to compute the training dataset.
- key
Returns
(X, y): Tuple of dataframe of features and labels. If there are no labels, y returns None
.
update#
FeatureView.update()
update_from_response_json#
FeatureView.update_from_response_json(json_dict)