Skip to content

hsfs.builtin_transformations #

equal_frequency_binner #

equal_frequency_binner(
    feature: pd.Series,
    statistics: TransformationStatistics = feature_statistics,
) -> pd.Series

Discretize numeric values into equal-frequency bins using training quartiles.

Uses Q1/Q2/Q3 percentiles as boundaries to form up to 4 bins. If quartiles have duplicates (constant regions), fewer bins are created. NaN inputs remain NaN.

PARAMETER DESCRIPTION
feature

Numeric feature to discretize.

TYPE: pd.Series

statistics

Training statistics providing quartile percentiles; populated automatically.

TYPE: TransformationStatistics DEFAULT: feature_statistics

RETURNS DESCRIPTION
pd.Series

Series of nullable integer bin indices from 0 to 3.

equal_width_binner #

equal_width_binner(
    feature: pd.Series,
    statistics: TransformationStatistics = feature_statistics,
    context: dict | None = None,
) -> pd.Series

Discretize numeric values into equal-width bins using training min/max.

Default number of bins is 10, configurable via context["n_bins"]. Values below min are placed in the first bin; values above max in the last bin. NaN inputs remain NaN.

PARAMETER DESCRIPTION
feature

Numeric feature to discretize.

TYPE: pd.Series

statistics

Training statistics providing min and max; populated automatically.

TYPE: TransformationStatistics DEFAULT: feature_statistics

context

Optional configuration dict; supports "n_bins" to set the number of bins (default 10).

TYPE: dict | None DEFAULT: None

RETURNS DESCRIPTION
pd.Series

Series of nullable integer bin indices starting from 0.

Using 20 bins
tf = equal_width_binner("feature")
tf.transformation_context = {"n_bins": 20}

impute_category #

impute_category(
    feature: pd.Series, context: dict | None = None
) -> pd.Series

Replace NaN values with a sentinel category string for categorical features.

The sentinel defaults to "__MISSING__" and can be overridden via context["value"].

Encoder chaining order matters

Downstream encoders (label_encoder, one_hot_encoder) trained before imputation will treat this sentinel as an unseen category and encode it as -1 / all-False. To get a dedicated encoding for the missing category, compute encoder statistics on already-imputed training data.

PARAMETER DESCRIPTION
feature

Categorical feature with NaN values to fill.

TYPE: pd.Series

context

Optional configuration dict; supports "value" to override the sentinel string (default "__MISSING__").

TYPE: dict | None DEFAULT: None

RETURNS DESCRIPTION
pd.Series

Series of strings with NaN replaced by the configured sentinel value.

Using a custom sentinel
tf = impute_category("country")
tf.transformation_context = {"value": "Unknown"}

impute_constant #

impute_constant(
    feature: pd.Series, context: dict | None = None
) -> pd.Series

Replace NaN values with a constant numeric fill value for numeric features.

The fill value is taken from context["value"] (default: 0.0).

PARAMETER DESCRIPTION
feature

Numeric feature with NaN values to fill.

TYPE: pd.Series

context

Optional configuration dict; supports "value" to set the fill value (default 0.0).

TYPE: dict | None DEFAULT: None

RETURNS DESCRIPTION
pd.Series

Series of float64 values with NaN replaced by the configured constant.

Using a custom fill value
tf = impute_constant("age")
tf.transformation_context = {"value": -1.0}

impute_mean #

impute_mean(
    feature: pd.Series,
    statistics: TransformationStatistics = feature_statistics,
) -> pd.Series

Replace NaN values with the training mean for numeric features.

If the training mean is itself NaN (no non-null training data), NaN values are left unchanged.

PARAMETER DESCRIPTION
feature

Numeric feature with NaN values to fill.

TYPE: pd.Series

statistics

Training statistics providing the mean; populated automatically.

TYPE: TransformationStatistics DEFAULT: feature_statistics

RETURNS DESCRIPTION
pd.Series

Series of float64 values with NaN replaced by the training mean.

impute_median #

impute_median(
    feature: pd.Series,
    statistics: TransformationStatistics = feature_statistics,
) -> pd.Series

Replace NaN values with the training median (50th percentile) for numeric features.

If the training median is NaN (no non-null training data), NaN values are left unchanged.

PARAMETER DESCRIPTION
feature

Numeric feature with NaN values to fill.

TYPE: pd.Series

statistics

Training statistics providing percentiles; populated automatically.

TYPE: TransformationStatistics DEFAULT: feature_statistics

RETURNS DESCRIPTION
pd.Series

Series of float64 values with NaN replaced by the training median.

impute_mode #

impute_mode(
    feature: pd.Series,
    statistics: TransformationStatistics = feature_statistics,
) -> pd.Series

Replace NaN values with the most frequent category from the training histogram for categorical features.

The mode is derived from the training-time histogram (the category with the highest count). Because the mode was seen during training, this chains safely into label_encoder and one_hot_encoder without producing unseen-category fallback values. If no histogram is available (statistics not computed), NaN values are left unchanged.

PARAMETER DESCRIPTION
feature

Categorical feature with NaN values to fill.

TYPE: pd.Series

statistics

Training statistics providing the histogram; populated automatically.

TYPE: TransformationStatistics DEFAULT: feature_statistics

RETURNS DESCRIPTION
pd.Series

Series of strings with NaN replaced by the most frequent training category.

label_encoder #

label_encoder(
    feature: pd.Series,
    statistics: TransformationStatistics = feature_statistics,
) -> pd.Series

log_transform #

log_transform(feature: pd.Series) -> pd.Series

Apply natural logarithm to a numeric feature.

Only strictly positive values are transformed: y = ln(x) if x > 0 else nan. This is useful to reduce skewness or model exponential relationships.

PARAMETER DESCRIPTION
feature

Numeric feature with positive values to transform.

TYPE: pd.Series

RETURNS DESCRIPTION
pd.Series

Series of float64 values with natural log applied; non-positive inputs become NaN.

min_max_scaler #

min_max_scaler(
    feature: pd.Series,
    statistics: TransformationStatistics = feature_statistics,
) -> pd.Series

one_hot_encoder #

one_hot_encoder(
    feature: pd.Series,
    statistics: TransformationStatistics = feature_statistics,
) -> pd.Series

Encode a categorical feature as a boolean one-hot DataFrame.

Creates one boolean column per category seen during training. Categories absent from training statistics are encoded as all-False. Output columns are sorted alphabetically for consistent ordering.

PARAMETER DESCRIPTION
feature

Categorical feature to encode.

TYPE: pd.Series

statistics

Training statistics providing the set of known categories; populated automatically.

TYPE: TransformationStatistics DEFAULT: feature_statistics

RETURNS DESCRIPTION
pd.Series

Boolean DataFrame with one column per category seen during training.

quantile_binner #

quantile_binner(
    feature: pd.Series,
    statistics: TransformationStatistics = feature_statistics,
) -> pd.Series

Discretize numeric values using quantile-based boundaries from training statistics.

Default quantiles are quartiles (0%, 25%, 50%, 75%, 100%). Creates up to 4 bins based on Q1, Q2 (median), and Q3. NaN inputs remain NaN.

PARAMETER DESCRIPTION
feature

Numeric feature to discretize.

TYPE: pd.Series

statistics

Training statistics providing quartile percentiles; populated automatically.

TYPE: TransformationStatistics DEFAULT: feature_statistics

RETURNS DESCRIPTION
pd.Series

Series of nullable integer bin indices from 0 to 3.

quantile_transformer #

quantile_transformer(
    feature: pd.Series,
    statistics: TransformationStatistics = feature_statistics,
) -> pd.Series

Transform features using quantile information to map to a uniform [0, 1] distribution.

Maps the input feature to a uniform distribution using percentiles computed during training. Values are mapped to their quantile position in the training distribution. Useful for normalizing non-Gaussian distributions. Maps outliers to the edges of the [0, 1] interval. Output range is [0, 1] where 0 = minimum, 0.5 = median, 1 = maximum.

PARAMETER DESCRIPTION
feature

Numeric feature to transform.

TYPE: pd.Series

statistics

Training statistics providing percentile values; populated automatically.

TYPE: TransformationStatistics DEFAULT: feature_statistics

RETURNS DESCRIPTION
pd.Series

Series of float64 values in [0, 1] representing the quantile position of each input.

rank_normalizer #

rank_normalizer(
    feature: pd.Series,
    statistics: TransformationStatistics = feature_statistics,
) -> pd.Series

Replace each value with its percentile rank in the training distribution.

Assigns each value a rank between 0 and 1 based on its position in the sorted training data distribution. The rank represents the percentage of training values that are less than or equal to the given value. Robust to outliers (outliers get ranks near 0 or 1) and preserves the relative ordering of values. Output range is [0, 1] where 0 = at or below minimum, 1 = at or above maximum.

PARAMETER DESCRIPTION
feature

Numeric feature to rank.

TYPE: pd.Series

statistics

Training statistics providing percentile values; populated automatically.

TYPE: TransformationStatistics DEFAULT: feature_statistics

RETURNS DESCRIPTION
pd.Series

Series of float64 values in [0, 1] representing the percentile rank of each input.

robust_scaler #

robust_scaler(
    feature: pd.Series,
    statistics: TransformationStatistics = feature_statistics,
) -> pd.Series

Robust scaling using median and IQR.

Scales a feature by removing the median and dividing by the interquartile range (IQR = Q3 - Q1). This makes the transformation robust to outliers. If IQR is zero (constant feature), the function centers the data by the median without scaling to avoid division by zero.

PARAMETER DESCRIPTION
feature

Numeric feature to scale.

TYPE: pd.Series

statistics

Training statistics providing percentile values; populated automatically.

TYPE: TransformationStatistics DEFAULT: feature_statistics

RETURNS DESCRIPTION
pd.Series

Series of float64 values centered on the median and scaled by IQR.

standard_scaler #

standard_scaler(
    feature: pd.Series,
    statistics: TransformationStatistics = feature_statistics,
) -> pd.Series

top_k_categorical_binner #

top_k_categorical_binner(
    feature: pd.Series,
    statistics: TransformationStatistics = feature_statistics,
    context: dict | None = None,
) -> pd.Series

Bin categorical features by grouping rare categories into an "Other" bucket.

Groups low-frequency categories together based on training data frequencies. Keeps the top N most frequent categories and maps all others (including unseen categories) to a single label. Useful for high-cardinality categorical features to reduce dimensionality and prevent overfitting. Preserves NaN values as NaN. Unseen categories in production data are treated as rare and grouped. Configure via context: "top_n" sets the number of categories to keep (default 10), "other_label" sets the bucket label (default "Other").

PARAMETER DESCRIPTION
feature

Categorical feature to bin.

TYPE: pd.Series

statistics

Training statistics providing category frequencies; populated automatically.

TYPE: TransformationStatistics DEFAULT: feature_statistics

context

Optional configuration dict; supports "top_n" (default 10) and "other_label" (default "Other").

TYPE: dict | None DEFAULT: None

RETURNS DESCRIPTION
pd.Series

Series of strings where rare and unseen categories are replaced by other_label.

Keeping top 20 countries
tf = top_k_categorical_binner("country")
tf.transformation_context = {"top_n": 20, "other_label": "Rare"}

winsorize #

winsorize(
    feature: pd.Series,
    statistics: TransformationStatistics = feature_statistics,
    context: dict | None = None,
) -> pd.Series

Winsorization (clipping) to limit extreme values and reduce outlier influence.

Outliers are replaced with percentile boundary values instead of removing rows. Defaults to [1st, 99th] percentiles unless overridden via context with {"p_low": 5, "p_high": 95}.

PARAMETER DESCRIPTION
feature

Numeric feature to clip.

TYPE: pd.Series

statistics

Training statistics providing percentile values; populated automatically.

TYPE: TransformationStatistics DEFAULT: feature_statistics

context

Optional configuration dict; supports "p_low" and "p_high" as percentile indices (defaults 1 and 99).

TYPE: dict | None DEFAULT: None

RETURNS DESCRIPTION
pd.Series

Series of float64 values with extremes clipped to the specified percentile boundaries.