hsfs.builtin_transformations #
equal_frequency_binner #
equal_frequency_binner(
feature: pd.Series,
statistics: TransformationStatistics = feature_statistics,
) -> pd.Series
Discretize numeric values into equal-frequency bins using training quartiles.
Uses Q1/Q2/Q3 percentiles as boundaries to form up to 4 bins. If quartiles have duplicates (constant regions), fewer bins are created. NaN inputs remain NaN.
| PARAMETER | DESCRIPTION |
|---|---|
feature | Numeric feature to discretize. |
statistics | Training statistics providing quartile percentiles; populated automatically. TYPE: |
| RETURNS | DESCRIPTION |
|---|---|
pd.Series | Series of nullable integer bin indices from 0 to 3. |
equal_width_binner #
equal_width_binner(
feature: pd.Series,
statistics: TransformationStatistics = feature_statistics,
context: dict | None = None,
) -> pd.Series
Discretize numeric values into equal-width bins using training min/max.
Default number of bins is 10, configurable via context["n_bins"]. Values below min are placed in the first bin; values above max in the last bin. NaN inputs remain NaN.
| PARAMETER | DESCRIPTION |
|---|---|
feature | Numeric feature to discretize. |
statistics | Training statistics providing min and max; populated automatically. TYPE: |
context | Optional configuration dict; supports TYPE: |
| RETURNS | DESCRIPTION |
|---|---|
pd.Series | Series of nullable integer bin indices starting from 0. |
Using 20 bins
tf = equal_width_binner("feature")
tf.transformation_context = {"n_bins": 20}
impute_category #
Replace NaN values with a sentinel category string for categorical features.
The sentinel defaults to "__MISSING__" and can be overridden via context["value"].
Encoder chaining order matters
Downstream encoders (label_encoder, one_hot_encoder) trained before imputation will treat this sentinel as an unseen category and encode it as -1 / all-False. To get a dedicated encoding for the missing category, compute encoder statistics on already-imputed training data.
| PARAMETER | DESCRIPTION |
|---|---|
feature | Categorical feature with NaN values to fill. |
context | Optional configuration dict; supports TYPE: |
| RETURNS | DESCRIPTION |
|---|---|
pd.Series | Series of strings with NaN replaced by the configured sentinel value. |
Using a custom sentinel
tf = impute_category("country")
tf.transformation_context = {"value": "Unknown"}
impute_constant #
Replace NaN values with a constant numeric fill value for numeric features.
The fill value is taken from context["value"] (default: 0.0).
| PARAMETER | DESCRIPTION |
|---|---|
feature | Numeric feature with NaN values to fill. |
context | Optional configuration dict; supports TYPE: |
| RETURNS | DESCRIPTION |
|---|---|
pd.Series | Series of float64 values with NaN replaced by the configured constant. |
Using a custom fill value
tf = impute_constant("age")
tf.transformation_context = {"value": -1.0}
impute_mean #
impute_mean(
feature: pd.Series,
statistics: TransformationStatistics = feature_statistics,
) -> pd.Series
Replace NaN values with the training mean for numeric features.
If the training mean is itself NaN (no non-null training data), NaN values are left unchanged.
| PARAMETER | DESCRIPTION |
|---|---|
feature | Numeric feature with NaN values to fill. |
statistics | Training statistics providing the mean; populated automatically. TYPE: |
| RETURNS | DESCRIPTION |
|---|---|
pd.Series | Series of float64 values with NaN replaced by the training mean. |
impute_median #
impute_median(
feature: pd.Series,
statistics: TransformationStatistics = feature_statistics,
) -> pd.Series
Replace NaN values with the training median (50th percentile) for numeric features.
If the training median is NaN (no non-null training data), NaN values are left unchanged.
| PARAMETER | DESCRIPTION |
|---|---|
feature | Numeric feature with NaN values to fill. |
statistics | Training statistics providing percentiles; populated automatically. TYPE: |
| RETURNS | DESCRIPTION |
|---|---|
pd.Series | Series of float64 values with NaN replaced by the training median. |
impute_mode #
impute_mode(
feature: pd.Series,
statistics: TransformationStatistics = feature_statistics,
) -> pd.Series
Replace NaN values with the most frequent category from the training histogram for categorical features.
The mode is derived from the training-time histogram (the category with the highest count). Because the mode was seen during training, this chains safely into label_encoder and one_hot_encoder without producing unseen-category fallback values. If no histogram is available (statistics not computed), NaN values are left unchanged.
| PARAMETER | DESCRIPTION |
|---|---|
feature | Categorical feature with NaN values to fill. |
statistics | Training statistics providing the histogram; populated automatically. TYPE: |
| RETURNS | DESCRIPTION |
|---|---|
pd.Series | Series of strings with NaN replaced by the most frequent training category. |
label_encoder #
label_encoder(
feature: pd.Series,
statistics: TransformationStatistics = feature_statistics,
) -> pd.Series
log_transform #
Apply natural logarithm to a numeric feature.
Only strictly positive values are transformed: y = ln(x) if x > 0 else nan. This is useful to reduce skewness or model exponential relationships.
| PARAMETER | DESCRIPTION |
|---|---|
feature | Numeric feature with positive values to transform. |
| RETURNS | DESCRIPTION |
|---|---|
pd.Series | Series of float64 values with natural log applied; non-positive inputs become NaN. |
min_max_scaler #
min_max_scaler(
feature: pd.Series,
statistics: TransformationStatistics = feature_statistics,
) -> pd.Series
one_hot_encoder #
one_hot_encoder(
feature: pd.Series,
statistics: TransformationStatistics = feature_statistics,
) -> pd.Series
Encode a categorical feature as a boolean one-hot DataFrame.
Creates one boolean column per category seen during training. Categories absent from training statistics are encoded as all-False. Output columns are sorted alphabetically for consistent ordering.
| PARAMETER | DESCRIPTION |
|---|---|
feature | Categorical feature to encode. |
statistics | Training statistics providing the set of known categories; populated automatically. TYPE: |
| RETURNS | DESCRIPTION |
|---|---|
pd.Series | Boolean DataFrame with one column per category seen during training. |
quantile_binner #
quantile_binner(
feature: pd.Series,
statistics: TransformationStatistics = feature_statistics,
) -> pd.Series
Discretize numeric values using quantile-based boundaries from training statistics.
Default quantiles are quartiles (0%, 25%, 50%, 75%, 100%). Creates up to 4 bins based on Q1, Q2 (median), and Q3. NaN inputs remain NaN.
| PARAMETER | DESCRIPTION |
|---|---|
feature | Numeric feature to discretize. |
statistics | Training statistics providing quartile percentiles; populated automatically. TYPE: |
| RETURNS | DESCRIPTION |
|---|---|
pd.Series | Series of nullable integer bin indices from 0 to 3. |
quantile_transformer #
quantile_transformer(
feature: pd.Series,
statistics: TransformationStatistics = feature_statistics,
) -> pd.Series
Transform features using quantile information to map to a uniform [0, 1] distribution.
Maps the input feature to a uniform distribution using percentiles computed during training. Values are mapped to their quantile position in the training distribution. Useful for normalizing non-Gaussian distributions. Maps outliers to the edges of the [0, 1] interval. Output range is [0, 1] where 0 = minimum, 0.5 = median, 1 = maximum.
| PARAMETER | DESCRIPTION |
|---|---|
feature | Numeric feature to transform. |
statistics | Training statistics providing percentile values; populated automatically. TYPE: |
| RETURNS | DESCRIPTION |
|---|---|
pd.Series | Series of float64 values in [0, 1] representing the quantile position of each input. |
rank_normalizer #
rank_normalizer(
feature: pd.Series,
statistics: TransformationStatistics = feature_statistics,
) -> pd.Series
Replace each value with its percentile rank in the training distribution.
Assigns each value a rank between 0 and 1 based on its position in the sorted training data distribution. The rank represents the percentage of training values that are less than or equal to the given value. Robust to outliers (outliers get ranks near 0 or 1) and preserves the relative ordering of values. Output range is [0, 1] where 0 = at or below minimum, 1 = at or above maximum.
| PARAMETER | DESCRIPTION |
|---|---|
feature | Numeric feature to rank. |
statistics | Training statistics providing percentile values; populated automatically. TYPE: |
| RETURNS | DESCRIPTION |
|---|---|
pd.Series | Series of float64 values in [0, 1] representing the percentile rank of each input. |
robust_scaler #
robust_scaler(
feature: pd.Series,
statistics: TransformationStatistics = feature_statistics,
) -> pd.Series
Robust scaling using median and IQR.
Scales a feature by removing the median and dividing by the interquartile range (IQR = Q3 - Q1). This makes the transformation robust to outliers. If IQR is zero (constant feature), the function centers the data by the median without scaling to avoid division by zero.
| PARAMETER | DESCRIPTION |
|---|---|
feature | Numeric feature to scale. |
statistics | Training statistics providing percentile values; populated automatically. TYPE: |
| RETURNS | DESCRIPTION |
|---|---|
pd.Series | Series of float64 values centered on the median and scaled by IQR. |
standard_scaler #
standard_scaler(
feature: pd.Series,
statistics: TransformationStatistics = feature_statistics,
) -> pd.Series
top_k_categorical_binner #
top_k_categorical_binner(
feature: pd.Series,
statistics: TransformationStatistics = feature_statistics,
context: dict | None = None,
) -> pd.Series
Bin categorical features by grouping rare categories into an "Other" bucket.
Groups low-frequency categories together based on training data frequencies. Keeps the top N most frequent categories and maps all others (including unseen categories) to a single label. Useful for high-cardinality categorical features to reduce dimensionality and prevent overfitting. Preserves NaN values as NaN. Unseen categories in production data are treated as rare and grouped. Configure via context: "top_n" sets the number of categories to keep (default 10), "other_label" sets the bucket label (default "Other").
| PARAMETER | DESCRIPTION |
|---|---|
feature | Categorical feature to bin. |
statistics | Training statistics providing category frequencies; populated automatically. TYPE: |
context | Optional configuration dict; supports TYPE: |
| RETURNS | DESCRIPTION |
|---|---|
pd.Series | Series of strings where rare and unseen categories are replaced by |
Keeping top 20 countries
tf = top_k_categorical_binner("country")
tf.transformation_context = {"top_n": 20, "other_label": "Rare"}
winsorize #
winsorize(
feature: pd.Series,
statistics: TransformationStatistics = feature_statistics,
context: dict | None = None,
) -> pd.Series
Winsorization (clipping) to limit extreme values and reduce outlier influence.
Outliers are replaced with percentile boundary values instead of removing rows. Defaults to [1st, 99th] percentiles unless overridden via context with {"p_low": 5, "p_high": 95}.
| PARAMETER | DESCRIPTION |
|---|---|
feature | Numeric feature to clip. |
statistics | Training statistics providing percentile values; populated automatically. TYPE: |
context | Optional configuration dict; supports TYPE: |
| RETURNS | DESCRIPTION |
|---|---|
pd.Series | Series of float64 values with extremes clipped to the specified percentile boundaries. |