Skip to content

Feature Transformation Statistics#

[source]

FeatureTransformationStatistics#

hsfs.transformation_statistics.FeatureTransformationStatistics(
    feature_name,
    count=None,
    completeness=None,
    num_non_null_values=None,
    num_null_values=None,
    approx_num_distinct_values=None,
    min=None,
    max=None,
    sum=None,
    mean=None,
    stddev=None,
    percentiles=None,
    distinctness=None,
    entropy=None,
    uniqueness=None,
    exact_num_distinct_values=None,
    extended_statistics=None,
    **kwargs
)

Data class that contains all the statistics parameters that can be used for transformations inside a custom transformation function.


Properties#

[source]

approx_num_distinct_values#

Approximate number of distinct values.


[source]

completeness#

Fraction of non-null values in a column.


[source]

correlations#

Correlations of feature values.


[source]

count#

Number of values.


[source]

distinctness#

Fraction of distinct values of a feature over the number of all its values. Distinct values occur at least once.

Example

[a, a, b] contains two distinct values a and b, so distinctness is 2/3.


[source]

entropy#

Entropy is a measure of the level of information contained in an event (feature value) when considering all possible events (all feature values). Entropy is estimated using observed value counts as the negative sum of (value_count/total_count) * log(value_count/total_count).

Example

[a, b, b, c, c] has three distinct values with counts [1, 2, 2].

Entropy is then (-1/5*log(1/5)-2/5*log(2/5)-2/5*log(2/5)) = 1.055.


[source]

exact_num_distinct_values#

Exact number of distinct values.


[source]

feature_name#

Name of the feature.


[source]

histogram#

Histogram of feature values.


[source]

kll#

KLL of feature values.


[source]

max#

Maximum value.


[source]

mean#

Mean value.


[source]

min#

Minimum value.


[source]

num_non_null_values#

Number of non-null values.


[source]

num_null_values#

Number of null values.


[source]

percentiles#

Percentiles.


[source]

stddev#

Standard deviation of the feature values.


[source]

sum#

Sum of all feature values.


[source]

unique_values#

Number of Unique Values.


[source]

uniqueness#

Fraction of unique values over the number of all values of a column. Unique values occur exactly once.

Example

[a, a, b] contains one unique value b, so uniqueness is 1/3.