Conversation
…eature-modality-dict
…PFN into ben/refactor-svd-transform
…PFN into ben/refactor-svd-transform
There was a problem hiding this comment.
Pull request overview
This pull request refactors SVD feature processing from being embedded in ReshapeFeatureDistributionsStep into a separate AddSVDFeaturesStep, improving the modularity and maintainability of the preprocessing pipeline. The refactoring also removes the "scaler" option from the global_transformer_name configuration.
Changes:
- Introduces
AddSVDFeaturesStepas a dedicated preprocessing step for SVD feature generation - Extracts utility functions (
make_standard_scaler_safe,add_safe_standard_to_safe_power_without_standard) to a newsteps/utils.pymodule - Removes SVD-related functionality from
ReshapeFeatureDistributionsStep - Updates
pipeline_factory.pyto includeAddSVDFeaturesStepin the preprocessing pipeline when a global transformer is configured - Removes "scaler" as a valid option for
global_transformer_nameinPreprocessorConfig
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
src/tabpfn/preprocessing/steps/add_svd_features_step.py |
New step that adds SVD features to the data, extracted from ReshapeFeatureDistributionsStep |
src/tabpfn/preprocessing/steps/utils.py |
New utility module containing shared transformer utility functions |
src/tabpfn/preprocessing/steps/reshape_feature_distribution_step.py |
Removed SVD-related code and global transformer functionality; utility functions moved to utils.py |
src/tabpfn/preprocessing/pipeline_factory.py |
Updated to add AddSVDFeaturesStep to the pipeline when global transformer is configured |
src/tabpfn/preprocessing/configs.py |
Removed "scaler" from valid global_transformer_name options |
tests/test_preprocessing/test_add_svd_features_step.py |
Comprehensive test suite for the new AddSVDFeaturesStep |
tests/test_preprocessing/test_reshape_feature_distribution_step.py |
Removed obsolete tests related to global transformers |
tests/test_preprocessing/test_add_fingerprint_features_step.py |
Added integration test for pipeline usage |
changelog/768.deprecated.md |
Documents removal of "scaler" option |
Comments suppressed due to low confidence (2)
src/tabpfn/preprocessing/steps/reshape_feature_distribution_step.py:96
- The docstring still mentions global_transformer_name and SVD features (lines 87-88, 91-95) which are no longer handled by this step after the refactoring. These references should be removed since SVD processing is now handled by the separate AddSVDFeaturesStep.
class ReshapeFeatureDistributionsStep(PreprocessingStep):
"""Reshape feature distributions using various transformations.
This step should receive ALL columns (not modality-sliced) because it:
1. Handles feature subsampling when too many features exist
2. Applies different logic based on `apply_to_categorical` flag
3. Can append transformed features to originals (`append_to_original`)
# TODO(ben): Add separate PreprocessingStep's for all of the above
# so that we can register this with modalities
When using with PreprocessingPipeline, register as a bare step (no modalities):
pipeline = PreprocessingPipeline(steps=[ReshapeFeatureDistributionsStep()])
Configuration options:
- transform_name: The transformation to apply (e.g., "squashing_scaler_default",
"quantile_uni_coarse")
- apply_to_categorical: Whether to transform categorical columns too
- append_to_original: If True, keep original and append transformed as new
columns
- max_features_per_estimator: Subsample features if above this threshold
- global_transformer_name: Optional global transform like "svd" that adds
features
Output column ordering:
- With append_to_original=True: [original_cols, transformed_cols, (svd_cols)]
- With append_to_original=False, apply_to_categorical=False:
[categorical_passthrough, numerical_transformed, (svd_cols)]
- With append_to_original=False, apply_to_categorical=True:
[all_transformed, (svd_cols)]
"""
src/tabpfn/preprocessing/steps/reshape_feature_distribution_step.py:260
- The comment mentions SVD features but SVD processing has been moved to AddSVDFeaturesStep. This comment should be removed as it's no longer accurate after the refactoring.
# Build the new metadata with updated categorical indices
# Non-categorical indices become numerical
# SVD features are numerical and appended at the end
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
src/tabpfn/preprocessing/steps/reshape_feature_distribution_step.py
Outdated
Show resolved
Hide resolved
|
|
||
|
|
||
| class AddSVDFeaturesStep(PreprocessingStep): | ||
| """Adds SVD features to the data.""" |
There was a problem hiding this comment.
@bejaeger can we elaborate slightly more the docstrings? What we do here is add, on top of the raw X, also more features that are just a compressed version of them? If so - I get it for numerical features, but a bit weird for categorical, and non-applicable for text?
| ( | ||
| "save_standard", | ||
| make_standard_scaler_safe( | ||
| ("standard", StandardScaler(with_mean=False)), |
There was a problem hiding this comment.
@bejaeger at this step we expect many columns to already be normalized, right? and this is general is to learn a more balanced projection, but unrelated to the original features?
| n_samples, | ||
| n_features, | ||
| ) | ||
| return next( |
There was a problem hiding this comment.
@bejaeger I find this syntax hard to read. Why do we have next here? I'm confused
| max_features_per_estimator: int = 500 | ||
| global_transformer_name: ( | ||
| Literal[ | ||
| "scaler", |
There was a problem hiding this comment.
What did scaler mean here?
Refactors SVD feature processing so that it is a separate step in the preprocessing pipeline.
Public API Changes
The "scaler" option for the preprocessing config
global_transformer_namehas been removed.