mutation-prediction

This is a tool for protein mutation effect prediction. It was developed as part of a bachelor thesis at Forschungszentrum Jülich and RWTH Aachen University. While the thesis is licensed under CC BY 4.0, the source code is licensed under MIT.

Abstract

Enzyme engineering plays a crucial role for industry and research, but the expensive and time-consuming evaluation of mutants in the laboratory limits the number of variants that can be explored. Therefore, researchers started to investigate the usage of machine learning for predicting enzyme mutation effects. Many different algorithms and embeddings have been developed in this context.

This work contains a comparison of multiple different embedding strategies and machine learning algorithms. It was shown that embeddings based on transfer learning improve the accuracy of the predictions across a variety of different proteins. Both a novel technique for processing multiple sequence alignments with autoencoders and the application of state-of-the-art natural language processing techniques have been investigated. On the model side, a novel k-convolutional neural network that uses the three-dimensional structure of the protein was developed and compared to the other models. The application of support vector machines that use transfer learning provided accurate predictions of protein mutation effects. All hyperparameters have been optimized automatically with distributed computing.

Furthermore, the comparison is accompanied by a thorough sensitivity analysis, which discovered multiple interesting effects, including starting points for further improvements. Additionally, the effects of various properties inherent to the datasets were investigated. It was shown that a larger dataset size strongly correlates to higher performance and that the lack of mutants with multiple mutations in the training dataset degrades performance significantly. Finally, a measure of epistasis was defined and its negative correlation to model performance was shown. Using the sensitivity analysis, it was demonstrated that the proposed methods can learn some of the epistasis in the dataset. In conclusion, this work provides multiple contributions to the field which will aid in further improvement of protein mutation effect prediction and its application in industry and research.

Performance

dataset	training samples	R²
Engqvist et al. 2015 (A75)	53	0.9534
Gumulya et al. 2012 (B75)	105	0.7967
Wu et al. 2019 (C75)	425	0.7364
Sarkisyan et al. 2016 (Dr500)	500	0.6958
Cadet et al. 2018 (E75)	32	0.8323

Thesis

The thesis is licensed under CC BY 4.0 and available under /Bachelor_Thesis_Hoffbauer.pdf.

@mastersthesis{Hoffbauer2021,
  document_type = {Bachelor's Thesis},
  timestamp = {20210913},
  author = {Tilman Hoffbauer},
  title = {Evaluation of various machine learning approaches to predicting enzyme mutation data},
  school = {RWTH Aachen University},
  year = {2021},
  type = {Bachelor Thesis},
  month = {September},
  doi = {https://doi.org/10.18154/RWTH-2021-08460}
}

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
data		data
images		images
mutation_prediction		mutation_prediction
mutation_prediction_native		mutation_prediction_native
tests		tests
.dockerignore		.dockerignore
.gitattributes		.gitattributes
.gitignore		.gitignore
Bachelor_Thesis_Hoffbauer.pdf		Bachelor_Thesis_Hoffbauer.pdf
Dockerfile		Dockerfile
Dockerfile.gpu		Dockerfile.gpu
Dockerfile.redis		Dockerfile.redis
LICENSE		LICENSE
README.md		README.md
check.sh		check.sh
cli.py		cli.py
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
redis-entrypoint.sh		redis-entrypoint.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

mutation-prediction

Abstract

Performance

Thesis

About

Languages

License

Turakar/mutation-prediction

Folders and files

Latest commit

History

Repository files navigation

mutation-prediction

Abstract

Performance

Thesis

About

Topics

Resources

License

Stars

Watchers

Forks

Languages