New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Add perturbation functionality to geneformer's `.predict()` method #98

Merged

ordabayevy merged 21 commits into main from sf_predict_with_perturbations

Feb 13, 2024

Contributor

sjfleming commented Oct 23, 2023 •

edited

Loading

Adds the optional kwargs

feature_deletion
feature_activation
feature_map

to Geneformer.predict()

The idea of "deletion" and "activation" are described in the Geneformer paper. Deletion is achieved by zeroing expression before tokenization. Activation is achieved by setting feature expression to a high value so that it ends up at the top of the rank-ordered list after tokenization.

"Map" is a general kind of idea to implement other kinds of in silico experiments. It takes the data after tokenization, and maps tokens to other tokens. This could be used to implement the replacement of a specific token or set of tokens with a pad (0) or mask (1) token, or to switch the role of two tokens, etc.

ordabayevy and others added 6 commits

September 27, 2023 23:54


          update geneformer

2069bab


          Merge branch 'main' into fix-geneformer

9483a80


          Ignore pycharm IDE files and build files

a4be33d


          Initial attempt at generic in silico perturbations

c4f9ac7


          Order of operations was incorrect, fixed

47ddd00


          Define feature_map keys as integer feature_ids

a56ffed

Contributor Author

sjfleming commented Oct 23, 2023

Closes #98


          black

0c0da49

Contributor Author

sjfleming commented Oct 23, 2023

So far there are no tests covering the new functionality

sjfleming added 10 commits

October 23, 2023 14:10


          Apparently tensor.apply_() only works on CPU and is slow

ec10e1d


          Fix typo

5e6b248


          Merge remote-tracking branch 'origin/fix-geneformer' into sf_predict_…

2c162b6

…with_perturbations


          Merge remote-tracking branch 'origin/main' into sf_predict_with_pertu…

f1373d2

…rbations


          Enable in silico perturbations and masking in predict()

72bee69


          Remove redundant gitignore line

62b055c


          black

94df909


          Add linting dir to gitignore

085a263


          Slight refactor and add a test

03dd16c


          black

6bb5550

Contributor Author

sjfleming commented Feb 2, 2024

Added a test to cover the new functionality


          Extend test a bit

1cde57c

sjfleming requested a review from ordabayevy

February 2, 2024 06:43

sjfleming marked this pull request as ready for review

February 2, 2024 06:43

Contributor Author

sjfleming commented Feb 2, 2024

@ordabayevy I think I might need some help with the mypy error here. It's coming from my added geneformer test. It doesn't seem to like the way I'm trying to pass **kwargs as an input? Do you know what's wrong? The pytest test itself passes.

sjfleming marked this pull request as draft

February 2, 2024 06:47


          Fix mypy issue in test code

b7d296a

Contributor Author

sjfleming commented Feb 2, 2024

Okay nevermind, I just did some extra copy and paste and now mypy seems happy. I wonder how you're supposed to give **kwargs as an input in a way mypy will be happy with.


          Add a geneformer notebook demo

8fd75ed

sjfleming marked this pull request as ready for review

February 2, 2024 07:45

sjfleming requested a review from mbabadi

February 2, 2024 07:45

ordabayevy requested changes

View reviewed changes

Contributor

ordabayevy left a comment

Looks great! Left some comments

cellarium/ml/models/geneformer.py

+                      x_ng: torch.Tensor,
+                      feature_deletion: list[str] | None = None,
+                      feature_activation: list[str] | None = None,
+                      feature_map: dict[str, int] | None = None,

Contributor

ordabayevy Feb 6, 2024

nit: sort inputs in the order feature_activation, feature_deletion, feature_map and elsewhere so they are in the same order just to avoid confusion

cellarium/ml/models/geneformer.py Outdated

                       Returns:
                           A dictionary with the inference results.
+                      NOTE:

Contributor

ordabayevy Feb 6, 2024

nit: change this to rst format .. note::

cellarium/ml/models/geneformer.py

@@ @@ -166,8 +202,14 @@ def predict( @@
                       output_attentions: bool = True,
                       output_input_ids: bool = True,
                       output_attention_mask: bool = True,
+                      feature_map: dict[str, int] | None = None,
+                      feature_activation: list[str] | None = None,
+                      feature_deletion: list[str] | None = None,

Contributor

ordabayevy Feb 6, 2024

nit: sort inputs in the order feature_activation, feature_deletion, feature_map

cellarium/ml/models/geneformer.py

+                          feature_deletion:
+                              Specify features whose expression should be set to zero before tokenization (remove from inputs).
+                          feature_activation:
+                              Specify features whose expression should be set to > max(x_ng) before tokenization (top rank).

Contributor

ordabayevy Feb 6, 2024

nit: sort inputs in the order feature_activation, feature_deletion, feature_map

cellarium/ml/models/geneformer.py Outdated

+                      if feature_deletion:
+                          assert all(
+                              [g in self.var_names_g for g in feature_deletion]
+                          ), "Some feature_deletion elements are not in self.var_names_g"

Contributor

ordabayevy Feb 6, 2024

nit: raise ValueError

cellarium/ml/models/geneformer.py Outdated

+                          assert all(
+                              [g in self.var_names_g for g in feature_deletion]
+                          ), "Some feature_deletion elements are not in self.var_names_g"
+                          deletion_logic_g = np.logical_or.reduce([(self.var_names_g == g) for g in feature_deletion])

Contributor

ordabayevy Feb 6, 2024

If I understand this correctly there is already a numpy function for this: np.isin(self.var_names_g, feature_deletion)

Contributor Author

sjfleming Feb 12, 2024

Great call, I did not realize that!

cellarium/ml/models/geneformer.py Outdated

+                          max_val = x_ng.max()
+                          for i, g in enumerate(feature_activation[::-1]):
+                              feature_logic_g = self.var_names_g == g
+                              assert feature_logic_g.sum() == 1, f"feature_activation element {g} is not in self.var_names_g"

Contributor

ordabayevy Feb 6, 2024

nit: raise ValueError

cellarium/ml/models/geneformer.py Outdated

+                      if feature_map:
+                          for g, target_token in feature_map.items():
+                              feature_logic_g = self.var_names_g == g
+                              assert feature_logic_g.sum() == 1, f"feature_map key {g} not in self.var_names_g"

Contributor

ordabayevy Feb 6, 2024

nit: raise ValueError

cellarium/ml/models/geneformer.py

+                          for g, target_token in feature_map.items():
+                              feature_logic_g = self.var_names_g == g
+                              assert feature_logic_g.sum() == 1, f"feature_map key {g} not in self.var_names_g"
+                              initial_token = self.feature_ids[feature_logic_g]

Contributor

ordabayevy Feb 6, 2024

I wonder if there would be any benefits of having a gene_name -> gene_id dictionary so we don't have to do this

Contributor Author

sjfleming Feb 12, 2024

Yeah it might not hurt

tests/test_geneformer.py

+                  print(f"Expected input_ids:\n{expected_input_ids}")
+                  print(f"Actual input_ids:\n{input_ids}")
+                  torch.testing.assert_close(input_ids, expected_input_ids)

Contributor

ordabayevy Feb 6, 2024

I didn't know that torch.testing existed!

Contributor

ordabayevy commented Feb 6, 2024 •

edited

Loading

Notebook comment: you can directly index CellariumPipeline:

var_names_g = pipeline.get_submodule("2").var_names_g
# instead
var_names_g = pipeline[2].var_names_g
# or better
# var_names_g = pipeline.model.var_names_g  # sorry this actually won't work, it is for CellariumModule.model


          Address comments

e48712d

Contributor Author

sjfleming commented Feb 12, 2024

Alright I think I've addressed everything, worth double-checking

ordabayevy approved these changes

View reviewed changes

ordabayevy merged commit 55a902c into main

5 checks passed

ordabayevy deleted the sf_predict_with_perturbations branch

February 13, 2024 18:22

ordabayevy mentioned this pull request

Enable geneformer in silico perturbations #97

Closed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet