Skip to content

Embeddings V2#178

Merged
noahho merged 7 commits intoPriorLabs:mainfrom
KulikDM:Embeddings-V2
Feb 16, 2025
Merged

Embeddings V2#178
noahho merged 7 commits intoPriorLabs:mainfrom
KulikDM:Embeddings-V2

Conversation

@KulikDM
Copy link
Contributor

@KulikDM KulikDM commented Feb 10, 2025

Hi,

This commit addresses #111
I have made two PRs where the functionality is slightly different, but they essentially both do the exact same thing.
Please pick which one you think is best for the package.
V2 adds the get_embeddings to each estimator.

Benefit is easier to find API, downside is more duplicate code.

Also added the appropriate tests.

Hope this helps!


embeddings: list[np.ndarray] = []

for output, config in self.executor_.iter_outputs(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps I am missing something but I don't see how you are getting the train/test embeddings for X. Shouldn't the iter_outputs method be passed a new param that would invoke the model with only_return_standard_out=False?

@noahho
Copy link
Collaborator

noahho commented Feb 10, 2025

Thanks so much for looking into this! Unfortunately the change does not actually retrieve the embeddings now, see the review comments.

@KulikDM
Copy link
Contributor Author

KulikDM commented Feb 11, 2025

Oh shoot!

Sorry, I saw in the transformer.py the code block

output_decoded["train_embeddings"] = train_encoder_out
output_decoded["test_embeddings"] = test_encoder_out

And traced it back to the regressor and classifier.

Would adding the arg only_return_standard_out=False do the trick?
If so, I can make the adjustment to this version as well as updating the tests.

@iivalchev
Copy link
Contributor

Oh shoot!

Sorry, I saw in the transformer.py the code block

output_decoded["train_embeddings"] = train_encoder_out
output_decoded["test_embeddings"] = test_encoder_out

And traced it back to the regressor and classifier.

Would adding the arg only_return_standard_out=False do the trick? If so, I can make the adjustment to this version as well as updating the tests.

Looks that way, but you need to indicate this through the tabpfn.inference.InferenceEngine.iter_outputs with additional param to it, say include_embbedings_in_output wich defaults to False and only get_embeddings should set it to True. One more thing you should change how you handle what would then be returned from forward pass of the model as it would be a dict not just a single tensor.

@KulikDM
Copy link
Contributor Author

KulikDM commented Feb 11, 2025

Hi @iivalchev

Thanks for all the feedback!

The latest commit tries to incorporate all the suggestions mentioned above.

  • Single function in utils that is called by the classifier and the regressor
  • Pass only_return_standard_out as a new arg to the inference.iter_outputs (default set to True like before, embedding passes False)
  • iter_ouput checks output type, and if type dict, it extracts the test_embeddings (not sure if this should be train was following Embeddings Functionality #111)
  • Tests have been updated

Everything was working as expected on my side; hope this change is acceptable.

# We'd rather just say unload from GPU, we already have it available on CPU.
model = model.cpu() # noqa: PLW2901

output = output['test_embeddings'] if isinstance(output, dict) else output
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is a bit hacky for the application, ideally, we would retrieve the entire dict at this step and in the next step decide which entry to select. That would change the return type of this function.
Alternatively we could pass a parameter return_field = 'test_embeddings' which selects whcih field to select and which is linked to the only_return_standard_out. The first option would be cleaner though?

@noahho
Copy link
Collaborator

noahho commented Feb 11, 2025

Great! I think this should be working already, I just had some comments to make it more maintainable.

@KulikDM
Copy link
Contributor Author

KulikDM commented Feb 12, 2025

Hi @noahho

Should I make the changes to rather pass the dictionary to the _get_embeddings function and add an additional arg for the user to select which information they would like to retrieve?

@noahho
Copy link
Collaborator

noahho commented Feb 12, 2025

Hi @noahho

Should I make the changes to rather pass the dictionary to the _get_embeddings function and add an additional arg for the user to select which information they would like to retrieve?

I would let iter_outputs return the whole dictionary if not returning single out. Then Let get_embeddings_ select the test or train embeddings where the user specifies if would like to get test or train embedding. You think that makes sense?

@KulikDM
Copy link
Contributor Author

KulikDM commented Feb 12, 2025

Hi @noahho

Awesome, agreed!

I have added the additional arg data_source to either select the test (default) or train embeddings in the get_embeddings function.

@noahho
Copy link
Collaborator

noahho commented Feb 12, 2025

Great! I just fixed the merge conflict and now CI should run.

@CLAassistant
Copy link

CLAassistant commented Feb 12, 2025

CLA assistant check
All committers have signed the CLA.

@KulikDM
Copy link
Contributor Author

KulikDM commented Feb 12, 2025

Hi @noahho

Seems like it is failing for python 3.9 but not 3.12
The issue is an infinite showing up.
Tested locally and it seems to fail due to passing a random_state value during regressor fit. Removing the arg resolved the issue. Will add this change later today :)

@KulikDM KulikDM requested a review from noahho February 14, 2025 18:45
@hzhz2020
Copy link

Dear Authors

Thanks for this great work! The performance in the paper is amazing and I have heard many good things about the work. I am eager to try it myself. I'm wondering do you have an expected timeline when this feature (obtaining embedding) would be available?

Thanks!

@noahho
Copy link
Collaborator

noahho commented Feb 16, 2025

PR looks great, thank you for the contribution @KulikDM !

@noahho noahho merged commit b953e87 into PriorLabs:main Feb 16, 2025
6 checks passed
@LeoGrin LeoGrin mentioned this pull request Feb 17, 2025
@lincj1994
Copy link

lincj1994 commented Mar 10, 2025

@KulikDM Hi. Thanks for the great work. I'm still confused of how to get embeddings from a fitted model, such as the clf in the code example below. Could you please provide some additional guidance?
Thanks.

import pandas as pd
import numpy as np
from tabpfn import TabPFNClassifier
from tabpfn_extensions.post_hoc_ensembles.sklearn_interface import AutoTabPFNClassifier
from sklearn.model_selection import train_test_split


df = pd.read_csv('df2.csv')
X_features = list(df.drop(['resp_crpr', 'Tumor_Sample_Barcode'], axis=1))
y_features = ['resp_crpr']
clf = AutoTabPFNClassifier(device='auto')


args = Namespace(
    seed = 444444,
    test_size = 0.3
)
X_train, X_test, y_train, y_test = train_test_split(df[X_features].values.reshape(-1, 63), df[y_features], 
                                                test_size=args.test_size, random_state=args.seed)
clf.fit(X_train, y_train)

oscarkey pushed a commit that referenced this pull request Nov 12, 2025
* Record copied public PR 522

* Bump mypy from 1.17.1 to 1.18.2 (#522)

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
(cherry picked from commit 3254993)

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: mirror-bot <mirror-bot@users.noreply.github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
liu-qingyuan pushed a commit to liu-qingyuan/TabPFN that referenced this pull request Nov 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants