hsfs.transformation_statistics #

FeatureTransformationStatistics #

Data class that contains all the statistics parameters that can be used for transformations inside a custom transformation function.

feature_name `property` #

feature_name: str

Name of the feature.

count `property` #

count: int | None

Number of values.

completeness `property` #

completeness: float | None

Fraction of non-null values in a column.

num_non_null_values `property` #

num_non_null_values: int | None

Number of non-null values.

num_null_values `property` #

num_null_values: int | None

Number of null values.

approx_num_distinct_values `property` #

approx_num_distinct_values: int | None

Approximate number of distinct values.

min `property` #

min: float | None

Minimum value.

max `property` #

max: float | None

Maximum value.

sum `property` #

sum: float | None

Sum of all feature values.

mean `property` #

mean: float | None

Mean value.

stddev `property` #

stddev: float | None

Standard deviation of the feature values.

percentiles `property` #

percentiles: Mapping[str, float] | None

Percentiles.

distinctness `property` #

distinctness: float | None

Fraction of distinct values of a feature over the number of all its values. Distinct values occur at least once.

Example

$[a, a, b]$ contains two distinct values $a$ and $b$ , so distinctness is $2/3$ .

entropy `property` #

entropy: float | None

Entropy is a measure of the level of information contained in an event (feature value) when considering all possible events (all feature values).

Entropy is estimated using observed value counts as the negative sum of (value_count/total_count) * log(value_count/total_count).

Example

$[a, b, b, c, c]$ has three distinct values with counts $[1, 2, 2]$ .

Entropy is then $(-1/5*log(1/5)-2/5*log(2/5)-2/5*log(2/5)) = 1.055$ .

uniqueness `property` #

uniqueness: float | None

Fraction of unique values over the number of all values of a column. Unique values occur exactly once.

Example

$[a, a, b]$ contains one unique value $b$ , so uniqueness is $1/3$ .

exact_num_distinct_values `property` #

exact_num_distinct_values: int | None

Exact number of distinct values.

correlations `property` #

correlations: dict | None

Correlations of feature values.

histogram `property` #

histogram: dict | None

Histogram of feature values.

kll `property` #

kll: dict | None

KLL of feature values.

unique_values `property` #

unique_values: dict | None

Number of Unique Values.

TransformationStatistics #

Class that stores feature transformation statistics of all features that require training dataset statistics in a transformation function.

All statistics for a feature is initially initialized with null values and will be populated with values when training dataset is created for the soe.

PARAMETER	DESCRIPTION
`*features`	The features for which training dataset statistics need to be computed. TYPE: `str` DEFAULT: `()`

Example

# Defining transformation statistics
transformation_statistics = TransformationStatistics("feature1", "feature2")

# Accessing feature transformation statistics for a specific feature
feature_transformation_statistics_feature1 = transformation_statistics.feature1

hsfs.transformation_statistics #

FeatureTransformationStatistics #

feature_name property #

count property #

completeness property #

num_non_null_values property #

num_null_values property #

approx_num_distinct_values property #

min property #

max property #

sum property #

mean property #

stddev property #

percentiles property #

distinctness property #

entropy property #

uniqueness property #

exact_num_distinct_values property #

correlations property #

histogram property #

kll property #

unique_values property #

TransformationStatistics #

feature_name `property` #

count `property` #

completeness `property` #

num_non_null_values `property` #

num_null_values `property` #

approx_num_distinct_values `property` #

min `property` #

max `property` #

sum `property` #

mean `property` #

stddev `property` #

percentiles `property` #

distinctness `property` #

entropy `property` #

uniqueness `property` #

exact_num_distinct_values `property` #

correlations `property` #

histogram `property` #

kll `property` #

unique_values `property` #