Embeddings V2 by KulikDM · Pull Request #178 · PriorLabs/TabPFN

KulikDM · 2025-02-10T18:44:44Z

Hi,

This commit addresses #111
I have made two PRs where the functionality is slightly different, but they essentially both do the exact same thing.
Please pick which one you think is best for the package.
V2 adds the get_embeddings to each estimator.

Benefit is easier to find API, downside is more duplicate code.

Also added the appropriate tests.

Hope this helps!

src/tabpfn/regressor.py

iivalchev · 2025-02-10T19:29:58Z

src/tabpfn/regressor.py

+
+        embeddings: list[np.ndarray] = []
+
+        for output, config in self.executor_.iter_outputs(


Perhaps I am missing something but I don't see how you are getting the train/test embeddings for X. Shouldn't the iter_outputs method be passed a new param that would invoke the model with only_return_standard_out=False?

noahho · 2025-02-10T19:58:05Z

Thanks so much for looking into this! Unfortunately the change does not actually retrieve the embeddings now, see the review comments.

KulikDM · 2025-02-11T04:11:09Z

Oh shoot!

Sorry, I saw in the transformer.py the code block

output_decoded["train_embeddings"] = train_encoder_out
output_decoded["test_embeddings"] = test_encoder_out

And traced it back to the regressor and classifier.

Would adding the arg only_return_standard_out=False do the trick?
If so, I can make the adjustment to this version as well as updating the tests.

iivalchev · 2025-02-11T08:55:53Z

Oh shoot!

Sorry, I saw in the transformer.py the code block
output_decoded["train_embeddings"] = train_encoder_out
output_decoded["test_embeddings"] = test_encoder_out
And traced it back to the regressor and classifier.

Would adding the arg only_return_standard_out=False do the trick? If so, I can make the adjustment to this version as well as updating the tests.

Looks that way, but you need to indicate this through the tabpfn.inference.InferenceEngine.iter_outputs with additional param to it, say include_embbedings_in_output wich defaults to False and only get_embeddings should set it to True. One more thing you should change how you handle what would then be returned from forward pass of the model as it would be a dict not just a single tensor.

KulikDM · 2025-02-11T19:04:38Z

Hi @iivalchev

Thanks for all the feedback!

The latest commit tries to incorporate all the suggestions mentioned above.

Single function in utils that is called by the classifier and the regressor
Pass only_return_standard_out as a new arg to the inference.iter_outputs (default set to True like before, embedding passes False)
iter_ouput checks output type, and if type dict, it extracts the test_embeddings (not sure if this should be train was following Embeddings Functionality #111)
Tests have been updated

Everything was working as expected on my side; hope this change is acceptable.

noahho · 2025-02-11T22:16:21Z

src/tabpfn/inference.py

            # We'd rather just say unload from GPU, we already have it available on CPU.
            model = model.cpu()  # noqa: PLW2901

+            output = output['test_embeddings'] if isinstance(output, dict) else output


This change is a bit hacky for the application, ideally, we would retrieve the entire dict at this step and in the next step decide which entry to select. That would change the return type of this function.
Alternatively we could pass a parameter return_field = 'test_embeddings' which selects whcih field to select and which is linked to the only_return_standard_out. The first option would be cleaner though?

src/tabpfn/utils.py

noahho · 2025-02-11T22:18:27Z

Great! I think this should be working already, I just had some comments to make it more maintainable.

KulikDM · 2025-02-12T03:42:00Z

Hi @noahho

Should I make the changes to rather pass the dictionary to the _get_embeddings function and add an additional arg for the user to select which information they would like to retrieve?

noahho · 2025-02-12T10:37:46Z

Hi @noahho

Should I make the changes to rather pass the dictionary to the _get_embeddings function and add an additional arg for the user to select which information they would like to retrieve?

I would let iter_outputs return the whole dictionary if not returning single out. Then Let get_embeddings_ select the test or train embeddings where the user specifies if would like to get test or train embedding. You think that makes sense?

KulikDM · 2025-02-12T18:47:44Z

Hi @noahho

Awesome, agreed!

I have added the additional arg data_source to either select the test (default) or train embeddings in the get_embeddings function.

noahho · 2025-02-12T19:28:19Z

Great! I just fixed the merge conflict and now CI should run.

CLAassistant · 2025-02-12T20:01:01Z

All committers have signed the CLA.

KulikDM · 2025-02-12T20:29:34Z

Hi @noahho

Seems like it is failing for python 3.9 but not 3.12
The issue is an infinite showing up.
Tested locally and it seems to fail due to passing a random_state value during regressor fit. Removing the arg resolved the issue. Will add this change later today :)

hzhz2020 · 2025-02-15T02:16:29Z

Dear Authors

Thanks for this great work! The performance in the paper is amazing and I have heard many good things about the work. I am eager to try it myself. I'm wondering do you have an expected timeline when this feature (obtaining embedding) would be available?

Thanks!

noahho · 2025-02-16T09:41:51Z

PR looks great, thank you for the contribution @KulikDM !

lincj1994 · 2025-03-10T07:50:32Z

@KulikDM Hi. Thanks for the great work. I'm still confused of how to get embeddings from a fitted model, such as the clf in the code example below. Could you please provide some additional guidance?
Thanks.

import pandas as pd
import numpy as np
from tabpfn import TabPFNClassifier
from tabpfn_extensions.post_hoc_ensembles.sklearn_interface import AutoTabPFNClassifier
from sklearn.model_selection import train_test_split


df = pd.read_csv('df2.csv')
X_features = list(df.drop(['resp_crpr', 'Tumor_Sample_Barcode'], axis=1))
y_features = ['resp_crpr']
clf = AutoTabPFNClassifier(device='auto')


args = Namespace(
    seed = 444444,
    test_size = 0.3
)
X_train, X_test, y_train, y_test = train_test_split(df[X_features].values.reshape(-1, 63), df[y_features], 
                                                test_size=args.test_size, random_state=args.seed)
clf.fit(X_train, y_train)

* Record copied public PR 522 * Bump mypy from 1.17.1 to 1.18.2 (#522) Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> (cherry picked from commit 3254993) --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: mirror-bot <mirror-bot@users.noreply.github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

Embeddings V2

Add get_embeddings to estimators

d24577f

iivalchev reviewed Feb 10, 2025

View reviewed changes

src/tabpfn/regressor.py Outdated Show resolved Hide resolved

iivalchev reviewed Feb 10, 2025

View reviewed changes

Return standard out embeddings arg

9b12c14

noahho reviewed Feb 11, 2025

View reviewed changes

src/tabpfn/utils.py Outdated Show resolved Hide resolved

Add data_source arg to get_embeddings

bcbac37

Merge branch 'main' into Embeddings-V2

214ff49

Fix import issue in get_embeddings test

990f188

KulikDM and others added 2 commits February 13, 2025 19:59

Removed random_state arg from get_embeddings test

fff1553

Lint embeddings addition

a5139bc

KulikDM requested a review from noahho February 14, 2025 18:45

noahho merged commit b953e87 into PriorLabs:main Feb 16, 2025
6 checks passed

LeoGrin mentioned this pull request Feb 17, 2025

Embeddings Functionality #111

Closed

liu-qingyuan pushed a commit to liu-qingyuan/TabPFN that referenced this pull request Nov 24, 2025

Merge pull request PriorLabs#178 from KulikDM/Embeddings-V2

2a2dd64

Embeddings V2


		embeddings: list[np.ndarray] = []

		for output, config in self.executor_.iter_outputs(

Conversation

KulikDM commented Feb 10, 2025

Uh oh!

Uh oh!

iivalchev Feb 10, 2025

Choose a reason for hiding this comment

Uh oh!

noahho commented Feb 10, 2025

Uh oh!

KulikDM commented Feb 11, 2025

Uh oh!

iivalchev commented Feb 11, 2025

Uh oh!

KulikDM commented Feb 11, 2025

Uh oh!

noahho Feb 11, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

noahho commented Feb 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

KulikDM commented Feb 12, 2025

Uh oh!

noahho commented Feb 12, 2025

Uh oh!

KulikDM commented Feb 12, 2025

Uh oh!

noahho commented Feb 12, 2025

Uh oh!

CLAassistant commented Feb 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

KulikDM commented Feb 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hzhz2020 commented Feb 15, 2025

Uh oh!

noahho commented Feb 16, 2025

Uh oh!

Uh oh!

lincj1994 commented Mar 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

noahho commented Feb 11, 2025 •

edited

Loading

CLAassistant commented Feb 12, 2025 •

edited

Loading

KulikDM commented Feb 12, 2025 •

edited

Loading

lincj1994 commented Mar 10, 2025 •

edited

Loading