Skip to content

Refactor SVD features preprocessing step#768

Open
bejaeger wants to merge 57 commits intomainfrom
ben/refactor-svd-transform
Open

Refactor SVD features preprocessing step#768
bejaeger wants to merge 57 commits intomainfrom
ben/refactor-svd-transform

Conversation

@bejaeger
Copy link
Contributor

@bejaeger bejaeger commented Feb 1, 2026

Refactors SVD feature processing so that it is a separate step in the preprocessing pipeline.


Public API Changes

  • No Public API changes
  • Yes, Public API changes (Details below)
    The "scaler" option for the preprocessing config global_transformer_name has been removed.

@bejaeger bejaeger removed the request for review from klemens-floege February 3, 2026 15:14
@bejaeger bejaeger changed the base branch from ben/introduce-feature-modality-dict to main February 5, 2026 09:54
@bejaeger bejaeger changed the base branch from main to lg-fix-nans-categorical February 5, 2026 10:06
@bejaeger bejaeger changed the base branch from lg-fix-nans-categorical to main February 5, 2026 10:07
Copilot AI review requested due to automatic review settings February 5, 2026 10:10
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request refactors SVD feature processing from being embedded in ReshapeFeatureDistributionsStep into a separate AddSVDFeaturesStep, improving the modularity and maintainability of the preprocessing pipeline. The refactoring also removes the "scaler" option from the global_transformer_name configuration.

Changes:

  • Introduces AddSVDFeaturesStep as a dedicated preprocessing step for SVD feature generation
  • Extracts utility functions (make_standard_scaler_safe, add_safe_standard_to_safe_power_without_standard) to a new steps/utils.py module
  • Removes SVD-related functionality from ReshapeFeatureDistributionsStep
  • Updates pipeline_factory.py to include AddSVDFeaturesStep in the preprocessing pipeline when a global transformer is configured
  • Removes "scaler" as a valid option for global_transformer_name in PreprocessorConfig

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
src/tabpfn/preprocessing/steps/add_svd_features_step.py New step that adds SVD features to the data, extracted from ReshapeFeatureDistributionsStep
src/tabpfn/preprocessing/steps/utils.py New utility module containing shared transformer utility functions
src/tabpfn/preprocessing/steps/reshape_feature_distribution_step.py Removed SVD-related code and global transformer functionality; utility functions moved to utils.py
src/tabpfn/preprocessing/pipeline_factory.py Updated to add AddSVDFeaturesStep to the pipeline when global transformer is configured
src/tabpfn/preprocessing/configs.py Removed "scaler" from valid global_transformer_name options
tests/test_preprocessing/test_add_svd_features_step.py Comprehensive test suite for the new AddSVDFeaturesStep
tests/test_preprocessing/test_reshape_feature_distribution_step.py Removed obsolete tests related to global transformers
tests/test_preprocessing/test_add_fingerprint_features_step.py Added integration test for pipeline usage
changelog/768.deprecated.md Documents removal of "scaler" option
Comments suppressed due to low confidence (2)

src/tabpfn/preprocessing/steps/reshape_feature_distribution_step.py:96

  • The docstring still mentions global_transformer_name and SVD features (lines 87-88, 91-95) which are no longer handled by this step after the refactoring. These references should be removed since SVD processing is now handled by the separate AddSVDFeaturesStep.
class ReshapeFeatureDistributionsStep(PreprocessingStep):
    """Reshape feature distributions using various transformations.

    This step should receive ALL columns (not modality-sliced) because it:
    1. Handles feature subsampling when too many features exist
    2. Applies different logic based on `apply_to_categorical` flag
    3. Can append transformed features to originals (`append_to_original`)

    # TODO(ben): Add separate PreprocessingStep's for all of the above
    # so that we can register this with modalities

    When using with PreprocessingPipeline, register as a bare step (no modalities):
        pipeline = PreprocessingPipeline(steps=[ReshapeFeatureDistributionsStep()])

    Configuration options:
        - transform_name: The transformation to apply (e.g., "squashing_scaler_default",
            "quantile_uni_coarse")
        - apply_to_categorical: Whether to transform categorical columns too
        - append_to_original: If True, keep original and append transformed as new
            columns
        - max_features_per_estimator: Subsample features if above this threshold
        - global_transformer_name: Optional global transform like "svd" that adds
            features

    Output column ordering:
        - With append_to_original=True: [original_cols, transformed_cols, (svd_cols)]
        - With append_to_original=False, apply_to_categorical=False:
            [categorical_passthrough, numerical_transformed, (svd_cols)]
        - With append_to_original=False, apply_to_categorical=True:
            [all_transformed, (svd_cols)]
    """

src/tabpfn/preprocessing/steps/reshape_feature_distribution_step.py:260

  • The comment mentions SVD features but SVD processing has been moved to AddSVDFeaturesStep. This comment should be removed as it's no longer accurate after the refactoring.
        # Build the new metadata with updated categorical indices
        # Non-categorical indices become numerical
        # SVD features are numerical and appended at the end

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@bejaeger bejaeger requested review from alanprior February 5, 2026 11:39


class AddSVDFeaturesStep(PreprocessingStep):
"""Adds SVD features to the data."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bejaeger can we elaborate slightly more the docstrings? What we do here is add, on top of the raw X, also more features that are just a compressed version of them? If so - I get it for numerical features, but a bit weird for categorical, and non-applicable for text?

(
"save_standard",
make_standard_scaler_safe(
("standard", StandardScaler(with_mean=False)),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bejaeger at this step we expect many columns to already be normalized, right? and this is general is to learn a more balanced projection, but unrelated to the original features?

n_samples,
n_features,
)
return next(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bejaeger I find this syntax hard to read. Why do we have next here? I'm confused

Copy link
Contributor

@alanprior alanprior left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GREAT!

max_features_per_estimator: int = 500
global_transformer_name: (
Literal[
"scaler",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What did scaler mean here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants