text-na-bug-fix by anuragg1209 · Pull Request #242 · PriorLabs/TabPFN

anuragg1209 · 2025-03-13T03:10:33Z

Hi,

This Pull Request fixes bug #138 without requiring additional dependencies like skrub or autogluon.features.

Copilot

Pull Request Overview

This pull request fixes bug #138 by improving the handling of text columns with missing (NA) values without introducing new dependencies. Key changes include:

Adding a new test (test_classifier_with_text_and_na) to verify proper handling of text and NA values.
Introducing the _process_text_na_dataframe helper function in utils.py for consistent text/NA preprocessing.
Updating both the regressor and classifier modules to use _process_text_na_dataframe for transforming input data.

Reviewed Changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
tests/test_classifier_interface.py	Added a new test to verify handling of text columns with NA values.
src/tabpfn/utils.py	Introduced the _process_text_na_dataframe function for NA handling.
src/tabpfn/regressor.py	Replaced direct ordinal encoder transformation with the new helper.
src/tabpfn/constants.py	Added NA_PLACEHOLDER constant.
src/tabpfn/classifier.py	Updated fit and predict_proba to use the new text/NA processing.

Comments suppressed due to low confidence (3)

tests/test_classifier_interface.py:402

[nitpick] Consider adding additional assertions for edge cases (e.g., empty strings or columns with only NA values) to ensure the new text/NA handling is robust.

def test_classifier_with_text_and_na() -> None:

src/tabpfn/regressor.py:469

[nitpick] Ensure that _process_text_na_dataframe correctly handles cases where there are no string columns (or when the DataFrame already contains only numeric types) to avoid any unintended type conversions.

X = _process_text_na_dataframe(X, ord_encoder=ord_encoder, fit_encoder=True)  # type: ignore

src/tabpfn/classifier.py:546

[nitpick] Confirm that using process_text_na_dataframe in place of self.preprocessor.transform(X) preserves the ordinal encoding appropriately, especially for inputs with mixed text and NA values.

X = _process_text_na_dataframe(X, ord_encoder=self.preprocessor_)

noahho

Well done, this is great code and fixes the issue!

noahho · 2025-03-13T16:22:02Z

Let's wait with the merge until we have the consistency check PR merged.

LeoGrin · 2025-03-14T10:12:47Z

Thanks @anuragg1209, LGTM! As it would also be solved by using Skrub (or probably autogluon.features), some thoughts on whether we want to:

Option 1: Use Skrub

Pros:

Fix other issues like #163 and probably some other edge cases. I think it also enable polars dataframe support (to check).

Cons:

Force us to bump the minimum version for some package (but nothing too recent)
One more dependency, and a dependency which is not so common, which can be an issue for people in companies where they only have access to a preinstalled list of packages (not sure, they do have to install tabpfn anyway so maybe it doesn't really make a difference?).

Option 2: Vendor Skrub

Would fix the dependency issue, not sure how annoying this is. Probably increases maintenance burden.

Option 3: Don't use Skrub, and do the few things we need ourselves.

I think we mostly need for now: NaNs in string fix (this PR), support datetime and bool dtypes, (polar support?).

WDYT? I think I'm leaning toward option 1 but not quite sure. In particular, I would like to understand better if the dependency will be an issue for some people.

…2.5 checkpoint update. (#242) * Record copied public PR 604 * Update classifier consistency test values due to v2.5 checkpoint update. (#604) (cherry picked from commit 0ef85eb) --------- Co-authored-by: mirror-bot <mirror-bot@users.noreply.github.com> Co-authored-by: Benjamin Jaeger <jaeger.benjamin7@gmail.com> Co-authored-by: Oscar Key <oscar@priorlabs.ai>

text-na-bug-fix

text-na-bug-fix

a28e86f

noahho requested a review from Copilot March 13, 2025 16:09

Copilot AI reviewed Mar 13, 2025

View reviewed changes

noahho approved these changes Mar 13, 2025

View reviewed changes

noahho merged commit ad57da0 into PriorLabs:main Mar 18, 2025
8 checks passed

LeoGrin mentioned this pull request Mar 19, 2025

TabPFN fails on text with NA #138

Closed

noahho mentioned this pull request Apr 25, 2025

[Blocked] Use Scrub for data cleaning #218

Open

liu-qingyuan pushed a commit to liu-qingyuan/TabPFN that referenced this pull request Nov 24, 2025

Merge pull request PriorLabs#242 from anuragg1209/main

1e99d93

text-na-bug-fix

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

text-na-bug-fix#242

text-na-bug-fix#242
noahho merged 1 commit intoPriorLabs:mainfrom
anuragg1209:main

anuragg1209 commented Mar 13, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

noahho left a comment

Uh oh!

noahho commented Mar 13, 2025

Uh oh!

LeoGrin commented Mar 14, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

anuragg1209 commented Mar 13, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

noahho left a comment

Choose a reason for hiding this comment

Uh oh!

noahho commented Mar 13, 2025

Uh oh!

LeoGrin commented Mar 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Option 1: Use Skrub

Pros:

Cons:

Option 2: Vendor Skrub

Option 3: Don't use Skrub, and do the few things we need ourselves.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

LeoGrin commented Mar 14, 2025 •

edited

Loading