👩‍🔬 Add LoRA layers to fine-tune protein language models for embeddings calculation #85

SebieF · 2023-09-07T08:51:30Z

After migrating from bio_embeddings to calculate embeddings directly in biotrainer for the provided sequences, it is now theoretically possible to allow for fine-tuning existing protein language models (pLMs) such as ProtTrans for specific tasks. Such tasks might include prediction of subcellular location, secondary structure or protein-protein interaction.

While fine-tuning a full pLM on a specific task is very costly, LoRA: Low-rank adaption of large language models are one possibility to enable fine-tuning a transformer model with only a fraction of the original model's parameters.

Adding LoRA layers to biotrainer would, therefore, be a meaningful enhancement and would be in line with the overall premise of biotrainer, making protein prediction tasks easily accessible in a standardized and reproducable way. On the other side, it also requires a significant change in the sequence of operations that biotrainer performs. Currently, all embeddings are loaded or calculated once at the beginning of training. Adding fine-tuning, embeddings would have to be calculated on the fly for every epoch. A possible implementation could replace the current embeddings object with a function that is called every epoch and might be constant if no fine-tuning is applied. Still, major adaptions must be made to the dataloader module.

List of required steps (non-exhaustive):

Refactor data loading and embeddings calculation to allow for re-calculation of embeddings for every epoch
Implement LoRA layers
Add configuration option(s) to enable fine-tuning
Add validation and tests for new config option(s)
Add documentation
Add a fine-tuning example
Evaluate implementation on (parts of) the FLIP dataset

Additional material:

wangjiaqi8710 · 2024-09-06T07:18:05Z

Biotrainer is using bio_embeddings to calculate embeddings for the provided sequences. Currently, this does not allow for fine-tuning existing protein language models (pLMs) such as ProtTrans for specific tasks, which might include prediction of subcellular location, secondary structure or protein-protein interaction.

While fine-tuning a full pLM on a specific task is very costly, LoRA: Low-rank adaption of large language models are one possibility to enable fine-tuning a transformer model with only a fraction of the original model's parameters.

Adding LoRA layers to biotrainer would, therefore, be a meaningful enhancement and would be in line with the overall premise of biotrainer, making protein prediction tasks easily accessible in a standardized and reproducable way. On the other side, it also requires a significant change in the sequence of operations that biotrainer performs. Currently, all embeddings are loaded or calculated once at the beginning of training. Adding fine-tuning, embeddings would have to be calculated on the fly for every epoch. A possible implementation could replace the current embeddings object with a function that is called every epoch and might be constant if no fine-tuning is applied. Still, major adaptions must be made to the dataloader module.

List of required steps (non-exhaustive):

Refactor data loading and embeddings calculation to allow for re-calculation of embeddings for every epoch

Implement LoRA layers

Add configuration option(s) to enable fine-tuning

Add validation and tests for new config option(s)

Add documentation

Add a fine-tuning example

Evaluate implementation on (parts of) the FLIP dataset

Additional material:

Paper

LoRA python package

Fine-tuning ProtTrans with LoRA layers

LoRA layer implementation in t-few repository

does it mean biotrainer can not be used to fine-tune PLM at this moment?

Wishes,
JQ

SebieF · 2024-09-09T12:34:19Z

Hello and thanks for your question. :) No, biotrainer is currently not designed for fine-tuning. We plan to implement fine tuning via LoRA layers and hope to have the feature complete by mid 2025.

mheinzinger · 2024-09-09T13:49:29Z

In the meantime, I would recommend using the notebooks provided here: https://github.com/RSchmirler/data-repo_plm-finetune-eval/tree/main/notebooks/finetune

wangjiaqi8710 · 2024-09-10T13:20:19Z

In the meantime, I would recommend using the notebooks provided here: https://github.com/RSchmirler/data-repo_plm-finetune-eval/tree/main/notebooks/finetune

Cool. Thank you. Congratulation for the work!

JQ

SebieF added enhancement New feature or request refactoring Code or standardization refactorings labels Sep 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

👩‍🔬 Add LoRA layers to fine-tune protein language models for embeddings calculation #85

👩‍🔬 Add LoRA layers to fine-tune protein language models for embeddings calculation #85

SebieF commented Sep 7, 2023 •

edited

Loading

wangjiaqi8710 commented Sep 6, 2024

SebieF commented Sep 9, 2024

mheinzinger commented Sep 9, 2024

wangjiaqi8710 commented Sep 10, 2024

👩‍🔬 Add LoRA layers to fine-tune protein language models for embeddings calculation #85

👩‍🔬 Add LoRA layers to fine-tune protein language models for embeddings calculation #85

Comments

SebieF commented Sep 7, 2023 • edited Loading

wangjiaqi8710 commented Sep 6, 2024

SebieF commented Sep 9, 2024

mheinzinger commented Sep 9, 2024

wangjiaqi8710 commented Sep 10, 2024

SebieF commented Sep 7, 2023 •

edited

Loading