Skip to content

text-na-bug-fix#242

Merged
noahho merged 1 commit intoPriorLabs:mainfrom
anuragg1209:main
Mar 18, 2025
Merged

text-na-bug-fix#242
noahho merged 1 commit intoPriorLabs:mainfrom
anuragg1209:main

Conversation

@anuragg1209
Copy link
Contributor

Hi,

This Pull Request fixes bug #138 without requiring additional dependencies like skrub or autogluon.features.

@noahho noahho requested a review from Copilot March 13, 2025 16:09
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This pull request fixes bug #138 by improving the handling of text columns with missing (NA) values without introducing new dependencies. Key changes include:

  • Adding a new test (test_classifier_with_text_and_na) to verify proper handling of text and NA values.
  • Introducing the _process_text_na_dataframe helper function in utils.py for consistent text/NA preprocessing.
  • Updating both the regressor and classifier modules to use _process_text_na_dataframe for transforming input data.

Reviewed Changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated no comments.

Show a summary per file
File Description
tests/test_classifier_interface.py Added a new test to verify handling of text columns with NA values.
src/tabpfn/utils.py Introduced the _process_text_na_dataframe function for NA handling.
src/tabpfn/regressor.py Replaced direct ordinal encoder transformation with the new helper.
src/tabpfn/constants.py Added NA_PLACEHOLDER constant.
src/tabpfn/classifier.py Updated fit and predict_proba to use the new text/NA processing.
Comments suppressed due to low confidence (3)

tests/test_classifier_interface.py:402

  • [nitpick] Consider adding additional assertions for edge cases (e.g., empty strings or columns with only NA values) to ensure the new text/NA handling is robust.
def test_classifier_with_text_and_na() -> None:

src/tabpfn/regressor.py:469

  • [nitpick] Ensure that _process_text_na_dataframe correctly handles cases where there are no string columns (or when the DataFrame already contains only numeric types) to avoid any unintended type conversions.
X = _process_text_na_dataframe(X, ord_encoder=ord_encoder, fit_encoder=True)  # type: ignore

src/tabpfn/classifier.py:546

  • [nitpick] Confirm that using process_text_na_dataframe in place of self.preprocessor.transform(X) preserves the ordinal encoding appropriately, especially for inputs with mixed text and NA values.
X = _process_text_na_dataframe(X, ord_encoder=self.preprocessor_)

Copy link
Collaborator

@noahho noahho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well done, this is great code and fixes the issue!

@noahho
Copy link
Collaborator

noahho commented Mar 13, 2025

Let's wait with the merge until we have the consistency check PR merged.

@LeoGrin
Copy link
Collaborator

LeoGrin commented Mar 14, 2025

Thanks @anuragg1209, LGTM! As it would also be solved by using Skrub (or probably autogluon.features), some thoughts on whether we want to:

Option 1: Use Skrub

Pros:

Fix other issues like #163 and probably some other edge cases. I think it also enable polars dataframe support (to check).

Cons:

  • Force us to bump the minimum version for some package (but nothing too recent)
  • One more dependency, and a dependency which is not so common, which can be an issue for people in companies where they only have access to a preinstalled list of packages (not sure, they do have to install tabpfn anyway so maybe it doesn't really make a difference?).

Option 2: Vendor Skrub

Would fix the dependency issue, not sure how annoying this is. Probably increases maintenance burden.

Option 3: Don't use Skrub, and do the few things we need ourselves.

I think we mostly need for now: NaNs in string fix (this PR), support datetime and bool dtypes, (polar support?).

WDYT? I think I'm leaning toward option 1 but not quite sure. In particular, I would like to understand better if the dependency will be an issue for some people.

@noahho noahho merged commit ad57da0 into PriorLabs:main Mar 18, 2025
8 checks passed
oscarkey added a commit that referenced this pull request Nov 12, 2025
…2.5 checkpoint update. (#242)

* Record copied public PR 604

* Update classifier consistency test values due to v2.5 checkpoint update. (#604)

(cherry picked from commit 0ef85eb)

---------

Co-authored-by: mirror-bot <mirror-bot@users.noreply.github.com>
Co-authored-by: Benjamin Jaeger <jaeger.benjamin7@gmail.com>
Co-authored-by: Oscar Key <oscar@priorlabs.ai>
liu-qingyuan pushed a commit to liu-qingyuan/TabPFN that referenced this pull request Nov 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants