Conversation
There was a problem hiding this comment.
Pull Request Overview
This pull request fixes bug #138 by improving the handling of text columns with missing (NA) values without introducing new dependencies. Key changes include:
- Adding a new test (test_classifier_with_text_and_na) to verify proper handling of text and NA values.
- Introducing the _process_text_na_dataframe helper function in utils.py for consistent text/NA preprocessing.
- Updating both the regressor and classifier modules to use _process_text_na_dataframe for transforming input data.
Reviewed Changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| tests/test_classifier_interface.py | Added a new test to verify handling of text columns with NA values. |
| src/tabpfn/utils.py | Introduced the _process_text_na_dataframe function for NA handling. |
| src/tabpfn/regressor.py | Replaced direct ordinal encoder transformation with the new helper. |
| src/tabpfn/constants.py | Added NA_PLACEHOLDER constant. |
| src/tabpfn/classifier.py | Updated fit and predict_proba to use the new text/NA processing. |
Comments suppressed due to low confidence (3)
tests/test_classifier_interface.py:402
- [nitpick] Consider adding additional assertions for edge cases (e.g., empty strings or columns with only NA values) to ensure the new text/NA handling is robust.
def test_classifier_with_text_and_na() -> None:
src/tabpfn/regressor.py:469
- [nitpick] Ensure that _process_text_na_dataframe correctly handles cases where there are no string columns (or when the DataFrame already contains only numeric types) to avoid any unintended type conversions.
X = _process_text_na_dataframe(X, ord_encoder=ord_encoder, fit_encoder=True) # type: ignore
src/tabpfn/classifier.py:546
- [nitpick] Confirm that using process_text_na_dataframe in place of self.preprocessor.transform(X) preserves the ordinal encoding appropriately, especially for inputs with mixed text and NA values.
X = _process_text_na_dataframe(X, ord_encoder=self.preprocessor_)
noahho
left a comment
There was a problem hiding this comment.
Well done, this is great code and fixes the issue!
|
Let's wait with the merge until we have the consistency check PR merged. |
|
Thanks @anuragg1209, LGTM! As it would also be solved by using Skrub (or probably autogluon.features), some thoughts on whether we want to: Option 1: Use SkrubPros:Fix other issues like #163 and probably some other edge cases. I think it also enable polars dataframe support (to check). Cons:
Option 2: Vendor SkrubWould fix the dependency issue, not sure how annoying this is. Probably increases maintenance burden. Option 3: Don't use Skrub, and do the few things we need ourselves.I think we mostly need for now: NaNs in string fix (this PR), support datetime and bool dtypes, (polar support?). WDYT? I think I'm leaning toward option 1 but not quite sure. In particular, I would like to understand better if the dependency will be an issue for some people. |
…2.5 checkpoint update. (#242) * Record copied public PR 604 * Update classifier consistency test values due to v2.5 checkpoint update. (#604) (cherry picked from commit 0ef85eb) --------- Co-authored-by: mirror-bot <mirror-bot@users.noreply.github.com> Co-authored-by: Benjamin Jaeger <jaeger.benjamin7@gmail.com> Co-authored-by: Oscar Key <oscar@priorlabs.ai>
text-na-bug-fix
Hi,
This Pull Request fixes bug #138 without requiring additional dependencies like
skruborautogluon.features.