HSFS provides functionality to compute statistics for training datasets and feature groups and save these along with their other metadata in the feature store. These statistics are meant to be helpful for Data Scientists to perform explorative data analysis and then recognize suitable features or training datasets for models.
Statistics are configured on a training dataset or feature group level using a
This object can be passed at creation time of the dataset or group or it can later on be updated through the API.
hsfs.statistics_config.StatisticsConfig( enabled=True, correlations=False, histograms=False, exact_uniqueness=False, columns= )
For example, to enable all statistics (descriptive, histograms and correlations) for a training dataset:
from hsfs.statistics_config import StatisticsConfig td = fs.create_training_dataset("rain_dataset", version=1, label=”weekly_rain”, data_format=”tfrecords”, statistics_config=StatisticsConfig(true, true, true))
val td = (fs.createTrainingDataset() .name("rain_dataset") .version(1) .label(”weekly_rain”) .dataFormat(”tfrecords”) .statisticsConfig(new StatisticsConfig(true, true, true)) .build())
And similarly for feature groups.
By default all training datasets and feature groups will be configured such that only descriptive statistics
are computed. However, you can also enable
correlations or limit the features for which
statistics are computed.
Specify a subset of columns to compute statistics for.
Enable correlations as an additional statistic to be computed for each feature pair.
Enable statistics, by default this computes only descriptive statistics.
Enable exact uniqueness as an additional statistic to be computed for each feature.
Enable histograms as an additional statistic to be computed for each feature.