hsfs.builtin_transformations #

equal_frequency_binner #

equal_frequency_binner(
    feature: pd.Series,
    statistics: TransformationStatistics = feature_statistics,
) -> pd.Series

Discretize numeric values into equal-frequency bins using training quartiles.

Uses Q1/Q2/Q3 percentiles as boundaries to form up to 4 bins. If quartiles have duplicates (constant regions), fewer bins are created. NaN inputs remain NaN.

PARAMETER	DESCRIPTION
`feature`	Numeric feature to discretize. TYPE: `pd.Series`
`statistics`	Training statistics providing quartile percentiles; populated automatically. TYPE: `TransformationStatistics` DEFAULT: `feature_statistics`

RETURNS	DESCRIPTION
`pd.Series`	Series of nullable integer bin indices from 0 to 3.

equal_width_binner #

equal_width_binner(
    feature: pd.Series,
    statistics: TransformationStatistics = feature_statistics,
    context: dict | None = None,
) -> pd.Series

Discretize numeric values into equal-width bins using training min/max.

Default number of bins is 10, configurable via context["n_bins"]. Values below min are placed in the first bin; values above max in the last bin. NaN inputs remain NaN.

PARAMETER	DESCRIPTION
`feature`	Numeric feature to discretize. TYPE: `pd.Series`
`statistics`	Training statistics providing min and max; populated automatically. TYPE: `TransformationStatistics` DEFAULT: `feature_statistics`
`context`	Optional configuration dict; supports `"n_bins"` to set the number of bins (default `10`). TYPE: `dict \| None` DEFAULT: `None`

RETURNS	DESCRIPTION
`pd.Series`	Series of nullable integer bin indices starting from 0.

Using 20 bins

tf = equal_width_binner("feature")
tf.transformation_context = {"n_bins": 20}

impute_category #

impute_category(
    feature: pd.Series, context: dict | None = None
) -> pd.Series

Replace NaN values with a sentinel category string for categorical features.

The sentinel defaults to "__MISSING__" and can be overridden via context["value"].

Encoder chaining order matters

Downstream encoders (label_encoder, one_hot_encoder) trained before imputation will treat this sentinel as an unseen category and encode it as -1 / all-False. To get a dedicated encoding for the missing category, compute encoder statistics on already-imputed training data.

PARAMETER	DESCRIPTION
`feature`	Categorical feature with NaN values to fill. TYPE: `pd.Series`
`context`	Optional configuration dict; supports `"value"` to override the sentinel string (default `"__MISSING__"`). TYPE: `dict \| None` DEFAULT: `None`

RETURNS	DESCRIPTION
`pd.Series`	Series of strings with NaN replaced by the configured sentinel value.

Using a custom sentinel

tf = impute_category("country")
tf.transformation_context = {"value": "Unknown"}

impute_constant #

impute_constant(
    feature: pd.Series, context: dict | None = None
) -> pd.Series

Replace NaN values with a constant numeric fill value for numeric features.

The fill value is taken from context["value"] (default: 0.0).

PARAMETER	DESCRIPTION
`feature`	Numeric feature with NaN values to fill. TYPE: `pd.Series`
`context`	Optional configuration dict; supports `"value"` to set the fill value (default `0.0`). TYPE: `dict \| None` DEFAULT: `None`

RETURNS	DESCRIPTION
`pd.Series`	Series of float64 values with NaN replaced by the configured constant.

Using a custom fill value

tf = impute_constant("age")
tf.transformation_context = {"value": -1.0}

impute_mean #

impute_mean(
    feature: pd.Series,
    statistics: TransformationStatistics = feature_statistics,
) -> pd.Series

Replace NaN values with the training mean for numeric features.

If the training mean is itself NaN (no non-null training data), NaN values are left unchanged.

PARAMETER	DESCRIPTION
`feature`	Numeric feature with NaN values to fill. TYPE: `pd.Series`
`statistics`	Training statistics providing the mean; populated automatically. TYPE: `TransformationStatistics` DEFAULT: `feature_statistics`

RETURNS	DESCRIPTION
`pd.Series`	Series of float64 values with NaN replaced by the training mean.

impute_median #

impute_median(
    feature: pd.Series,
    statistics: TransformationStatistics = feature_statistics,
) -> pd.Series

Replace NaN values with the training median (50th percentile) for numeric features.

If the training median is NaN (no non-null training data), NaN values are left unchanged.

PARAMETER	DESCRIPTION
`feature`	Numeric feature with NaN values to fill. TYPE: `pd.Series`
`statistics`	Training statistics providing percentiles; populated automatically. TYPE: `TransformationStatistics` DEFAULT: `feature_statistics`

RETURNS	DESCRIPTION
`pd.Series`	Series of float64 values with NaN replaced by the training median.

impute_mode #

impute_mode(
    feature: pd.Series,
    statistics: TransformationStatistics = feature_statistics,
) -> pd.Series

Replace NaN values with the most frequent category from the training histogram for categorical features.

The mode is derived from the training-time histogram (the category with the highest count). Because the mode was seen during training, this chains safely into label_encoder and one_hot_encoder without producing unseen-category fallback values. If no histogram is available (statistics not computed), NaN values are left unchanged.

PARAMETER	DESCRIPTION
`feature`	Categorical feature with NaN values to fill. TYPE: `pd.Series`
`statistics`	Training statistics providing the histogram; populated automatically. TYPE: `TransformationStatistics` DEFAULT: `feature_statistics`

RETURNS	DESCRIPTION
`pd.Series`	Series of strings with NaN replaced by the most frequent training category.

label_encoder #

label_encoder(
    feature: pd.Series,
    statistics: TransformationStatistics = feature_statistics,
) -> pd.Series

log_transform #

log_transform(feature: pd.Series) -> pd.Series

Apply natural logarithm to a numeric feature.

Only strictly positive values are transformed: y = ln(x) if x > 0 else nan. This is useful to reduce skewness or model exponential relationships.

PARAMETER	DESCRIPTION
`feature`	Numeric feature with positive values to transform. TYPE: `pd.Series`

RETURNS	DESCRIPTION
`pd.Series`	Series of float64 values with natural log applied; non-positive inputs become NaN.

min_max_scaler #

min_max_scaler(
    feature: pd.Series,
    statistics: TransformationStatistics = feature_statistics,
) -> pd.Series

one_hot_encoder #

one_hot_encoder(
    feature: pd.Series,
    statistics: TransformationStatistics = feature_statistics,
) -> pd.Series

Encode a categorical feature as a boolean one-hot DataFrame.

Creates one boolean column per category seen during training. Categories absent from training statistics are encoded as all-False. Output columns are sorted alphabetically for consistent ordering.

PARAMETER	DESCRIPTION
`feature`	Categorical feature to encode. TYPE: `pd.Series`
`statistics`	Training statistics providing the set of known categories; populated automatically. TYPE: `TransformationStatistics` DEFAULT: `feature_statistics`

RETURNS	DESCRIPTION
`pd.Series`	Boolean DataFrame with one column per category seen during training.

quantile_binner #

quantile_binner(
    feature: pd.Series,
    statistics: TransformationStatistics = feature_statistics,
) -> pd.Series

Discretize numeric values using quantile-based boundaries from training statistics.

Default quantiles are quartiles (0%, 25%, 50%, 75%, 100%). Creates up to 4 bins based on Q1, Q2 (median), and Q3. NaN inputs remain NaN.

PARAMETER	DESCRIPTION
`feature`	Numeric feature to discretize. TYPE: `pd.Series`
`statistics`	Training statistics providing quartile percentiles; populated automatically. TYPE: `TransformationStatistics` DEFAULT: `feature_statistics`

RETURNS	DESCRIPTION
`pd.Series`	Series of nullable integer bin indices from 0 to 3.

quantile_transformer #

quantile_transformer(
    feature: pd.Series,
    statistics: TransformationStatistics = feature_statistics,
) -> pd.Series

Transform features using quantile information to map to a uniform [0, 1] distribution.

Maps the input feature to a uniform distribution using percentiles computed during training. Values are mapped to their quantile position in the training distribution. Useful for normalizing non-Gaussian distributions. Maps outliers to the edges of the [0, 1] interval. Output range is [0, 1] where 0 = minimum, 0.5 = median, 1 = maximum.

PARAMETER	DESCRIPTION
`feature`	Numeric feature to transform. TYPE: `pd.Series`
`statistics`	Training statistics providing percentile values; populated automatically. TYPE: `TransformationStatistics` DEFAULT: `feature_statistics`

RETURNS	DESCRIPTION
`pd.Series`	Series of float64 values in [0, 1] representing the quantile position of each input.

rank_normalizer #

rank_normalizer(
    feature: pd.Series,
    statistics: TransformationStatistics = feature_statistics,
) -> pd.Series

Replace each value with its percentile rank in the training distribution.

Assigns each value a rank between 0 and 1 based on its position in the sorted training data distribution. The rank represents the percentage of training values that are less than or equal to the given value. Robust to outliers (outliers get ranks near 0 or 1) and preserves the relative ordering of values. Output range is [0, 1] where 0 = at or below minimum, 1 = at or above maximum.

PARAMETER	DESCRIPTION
`feature`	Numeric feature to rank. TYPE: `pd.Series`
`statistics`	Training statistics providing percentile values; populated automatically. TYPE: `TransformationStatistics` DEFAULT: `feature_statistics`

RETURNS	DESCRIPTION
`pd.Series`	Series of float64 values in [0, 1] representing the percentile rank of each input.

robust_scaler #

robust_scaler(
    feature: pd.Series,
    statistics: TransformationStatistics = feature_statistics,
) -> pd.Series

Robust scaling using median and IQR.

Scales a feature by removing the median and dividing by the interquartile range (IQR = Q3 - Q1). This makes the transformation robust to outliers. If IQR is zero (constant feature), the function centers the data by the median without scaling to avoid division by zero.

PARAMETER	DESCRIPTION
`feature`	Numeric feature to scale. TYPE: `pd.Series`
`statistics`	Training statistics providing percentile values; populated automatically. TYPE: `TransformationStatistics` DEFAULT: `feature_statistics`

RETURNS	DESCRIPTION
`pd.Series`	Series of float64 values centered on the median and scaled by IQR.

standard_scaler #

standard_scaler(
    feature: pd.Series,
    statistics: TransformationStatistics = feature_statistics,
) -> pd.Series

top_k_categorical_binner #

top_k_categorical_binner(
    feature: pd.Series,
    statistics: TransformationStatistics = feature_statistics,
    context: dict | None = None,
) -> pd.Series

Bin categorical features by grouping rare categories into an "Other" bucket.

Groups low-frequency categories together based on training data frequencies. Keeps the top N most frequent categories and maps all others (including unseen categories) to a single label. Useful for high-cardinality categorical features to reduce dimensionality and prevent overfitting. Preserves NaN values as NaN. Unseen categories in production data are treated as rare and grouped. Configure via context: "top_n" sets the number of categories to keep (default 10), "other_label" sets the bucket label (default "Other").

PARAMETER	DESCRIPTION
`feature`	Categorical feature to bin. TYPE: `pd.Series`
`statistics`	Training statistics providing category frequencies; populated automatically. TYPE: `TransformationStatistics` DEFAULT: `feature_statistics`
`context`	Optional configuration dict; supports `"top_n"` (default `10`) and `"other_label"` (default `"Other"`). TYPE: `dict \| None` DEFAULT: `None`

RETURNS	DESCRIPTION
`pd.Series`	Series of strings where rare and unseen categories are replaced by `other_label`.

Keeping top 20 countries

tf = top_k_categorical_binner("country")
tf.transformation_context = {"top_n": 20, "other_label": "Rare"}

winsorize #

winsorize(
    feature: pd.Series,
    statistics: TransformationStatistics = feature_statistics,
    context: dict | None = None,
) -> pd.Series

Winsorization (clipping) to limit extreme values and reduce outlier influence.

Outliers are replaced with percentile boundary values instead of removing rows. Defaults to [1st, 99th] percentiles unless overridden via context with {"p_low": 5, "p_high": 95}.

PARAMETER	DESCRIPTION
`feature`	Numeric feature to clip. TYPE: `pd.Series`
`statistics`	Training statistics providing percentile values; populated automatically. TYPE: `TransformationStatistics` DEFAULT: `feature_statistics`
`context`	Optional configuration dict; supports `"p_low"` and `"p_high"` as percentile indices (defaults `1` and `99`). TYPE: `dict \| None` DEFAULT: `None`

RETURNS	DESCRIPTION
`pd.Series`	Series of float64 values with extremes clipped to the specified percentile boundaries.