Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

👩‍🔬 Add LoRA layers to fine-tune protein language models for embeddings calculation #85

Open
7 tasks
SebieF opened this issue Sep 7, 2023 · 4 comments
Labels
enhancement New feature or request refactoring Code or standardization refactorings

Comments

@SebieF
Copy link
Collaborator

SebieF commented Sep 7, 2023

After migrating from bio_embeddings to calculate embeddings directly in biotrainer for the provided sequences, it is now theoretically possible to allow for fine-tuning existing protein language models (pLMs) such as ProtTrans for specific tasks. Such tasks might include prediction of subcellular location, secondary structure or protein-protein interaction.

While fine-tuning a full pLM on a specific task is very costly, LoRA: Low-rank adaption of large language models are one possibility to enable fine-tuning a transformer model with only a fraction of the original model's parameters.

Adding LoRA layers to biotrainer would, therefore, be a meaningful enhancement and would be in line with the overall premise of biotrainer, making protein prediction tasks easily accessible in a standardized and reproducable way. On the other side, it also requires a significant change in the sequence of operations that biotrainer performs. Currently, all embeddings are loaded or calculated once at the beginning of training. Adding fine-tuning, embeddings would have to be calculated on the fly for every epoch. A possible implementation could replace the current embeddings object with a function that is called every epoch and might be constant if no fine-tuning is applied. Still, major adaptions must be made to the dataloader module.

List of required steps (non-exhaustive):

  • Refactor data loading and embeddings calculation to allow for re-calculation of embeddings for every epoch
  • Implement LoRA layers
  • Add configuration option(s) to enable fine-tuning
  • Add validation and tests for new config option(s)
  • Add documentation
  • Add a fine-tuning example
  • Evaluate implementation on (parts of) the FLIP dataset

Additional material:

@SebieF SebieF added enhancement New feature or request refactoring Code or standardization refactorings labels Sep 7, 2023
@wangjiaqi8710
Copy link

Biotrainer is using bio_embeddings to calculate embeddings for the provided sequences. Currently, this does not allow for fine-tuning existing protein language models (pLMs) such as ProtTrans for specific tasks, which might include prediction of subcellular location, secondary structure or protein-protein interaction.

While fine-tuning a full pLM on a specific task is very costly, LoRA: Low-rank adaption of large language models are one possibility to enable fine-tuning a transformer model with only a fraction of the original model's parameters.

Adding LoRA layers to biotrainer would, therefore, be a meaningful enhancement and would be in line with the overall premise of biotrainer, making protein prediction tasks easily accessible in a standardized and reproducable way. On the other side, it also requires a significant change in the sequence of operations that biotrainer performs. Currently, all embeddings are loaded or calculated once at the beginning of training. Adding fine-tuning, embeddings would have to be calculated on the fly for every epoch. A possible implementation could replace the current embeddings object with a function that is called every epoch and might be constant if no fine-tuning is applied. Still, major adaptions must be made to the dataloader module.

List of required steps (non-exhaustive):

  • Refactor data loading and embeddings calculation to allow for re-calculation of embeddings for every epoch
  • Implement LoRA layers
  • Add configuration option(s) to enable fine-tuning
  • Add validation and tests for new config option(s)
  • Add documentation
  • Add a fine-tuning example
  • Evaluate implementation on (parts of) the FLIP dataset

Additional material:

does it mean biotrainer can not be used to fine-tune PLM at this moment?

Wishes,
JQ

@SebieF
Copy link
Collaborator Author

SebieF commented Sep 9, 2024

Hello and thanks for your question. :) No, biotrainer is currently not designed for fine-tuning. We plan to implement fine tuning via LoRA layers and hope to have the feature complete by mid 2025.

@mheinzinger
Copy link
Collaborator

In the meantime, I would recommend using the notebooks provided here: https://github.com/RSchmirler/data-repo_plm-finetune-eval/tree/main/notebooks/finetune

@wangjiaqi8710
Copy link

In the meantime, I would recommend using the notebooks provided here: https://github.com/RSchmirler/data-repo_plm-finetune-eval/tree/main/notebooks/finetune

Cool. Thank you. Congratulation for the work!

JQ

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request refactoring Code or standardization refactorings
Projects
None yet
Development

No branches or pull requests

3 participants