Part 02: Training Data & Feature views¶
This is the second part of the quick start series of tutorials about Hopsworks Feature Store. This notebook explains how to read from a feature group and create training dataset within the feature store
🗒️ In this notebook we will see how to create a training dataset from the feature groups:¶
- Select the features we want to train our model on,
- How the features should be preprocessed,
- Create a dataset for training fraud detection model.
!pip install -U hopsworks --quiet
import hopsworks
project = hopsworks.login()
fs = project.get_feature_store()
🔪 Feature Selection ¶
We start by selecting all the features we want to include for model training/inference.
# Load feature groups.
trans_fg = fs.get_feature_group('transactions_fraud_online_fg', version=1)
profile_online_fg = fs.get_feature_group('profile_fraud_online_fg', version=1)
# Select features for training data.
ds_query = trans_fg.select(["fraud_label", "loc_delta_t_plus_1", "loc_delta_t_minus_1","time_delta_t_plus_1",
"time_delta_t_minus_1", "country"]).\
join(profile_online_fg.select_all())
# uncomment this if you would like to view query results
#ds_query.show(5)
Recall that you computed the features in transactions_fg
. If you had created multiple feature groups with identical schema for different window lengths, and wanted to include them in the join you would need to include a prefix argument in the join to avoid feature name clash. See the documentation for more details.
🤖 Transformation Functions </span>
We will preprocess our data using min-max scaling on numerical features and label encoding on categorical features. To do this we simply define a mapping between our features and transformation functions. This ensures that transformation functions such as min-max scaling are fitted only on the training data (and not the validation/test data), which ensures that there is no data leakage.
# Load the transformation functions.
min_max_scaler = fs.get_transformation_function(name="min_max_scaler")
label_encoder = fs.get_transformation_function(name="label_encoder")
# Map features to transformation functions.
transformation_functions = {
"loc_delta_t_plus_1": min_max_scaler,
"loc_delta_t_minus_1": min_max_scaler,
"time_delta_t_plus_1": min_max_scaler,
"time_delta_t_minus_1": min_max_scaler,
"country": label_encoder,
"gender": label_encoder,
}
⚙️ Feature View Creation ¶
The Feature Views allows schema in form of a query with filters, define a model target feature/label and additional transformation functions.
In order to create a Feature View we may use fs.create_feature_view()
feature_view = fs.create_feature_view(
name='transactions_fraud_online_fv',
query=ds_query,
labels=["fraud_label"],
transformation_functions=transformation_functions
)
To view and explore data in the feature view we can retrieve batch data using get_batch_data()
method
🏋️ Training Dataset Creation¶
In Hopsworks training data is a query where the projection (set of features) is determined by the parent FeatureView with an optional snapshot on disk of the data returned by the query.
Training Dataset may contain splits such as:
- Training set - the subset of training data used to train a model.
- Validation set - the subset of training data used to evaluate hparams when training a model
- Test set - the holdout subset of training data used to evaluate a mode
Training dataset is created using fs.create_training_dataset()
method.
From feature view APIs we can also create training datasts based on even time filters specifing start_time
and end_time
from datetime import datetime
date_format = "%Y-%m-%d %H:%M:%S"
# Create training datasets based event time filter
start_time = int(float(datetime.strptime("2022-01-01 00:00:01", date_format).timestamp()) * 1000)
end_time = int(float(datetime.strptime("2022-02-28 23:59:59", date_format).timestamp()) * 1000)
td_jan_feb_version, td_job = feature_view.create_training_data(
start_time = start_time,
end_time = end_time,
description = 'transactions fraud online training dataset jan/feb',
data_format = "csv",
coalesce = True,
write_options = {'wait_for_job': True},
)
start_time = int(float(datetime.strptime("2022-03-01 00:00:01", date_format).timestamp()) * 1000)
end_time = int(float(datetime.strptime("2022-03-31 23:59:59", date_format).timestamp()) * 1000)
td_mar_version, td_job = feature_view.create_training_data(
start_time = start_time,
end_time = end_time,
description = 'transactions fraud online training dataset mar',
data_format = "csv",
coalesce = True,
write_options = {'wait_for_job': True},
)
🪝 Training Dataset retreival ¶
To retrieve training data from storage (already materialised) or from feature groups direcly we can use get_training_dataset_splits
or get_training_dataset
methods. If version is not provided or provided version has not already existed, it creates a new version of training data according to given arguments and returns a dataframe. If version is provided and has already existed, it reads training data from storage or feature groups and returns a dataframe. If split is provided, it reads the specific split.
train_jan_feb_x, train_jan_feb_y = feature_view.get_training_data(td_jan_feb_version)
test_mar_x, test_mar_y = feature_view.get_training_data(td_mar_version)
train_jan_feb_x
test_mar_x
The feature view and training dataset are now visible in the UI
⛓️ Lineage ¶
In all the feature groups and feature view you can look at the relation between each abstractions; what feature group created which training dataset and that is used in which model. This allows for a clear undestanding of the pipeline in relation to each element.
⏭️ **Next:** Part 03 ¶
In the following notebook, you will train a model on the dataset you created in this notebook.