learnMSA2: deep protein multiple alignments with large language and hidden Markov models

Our tool is under active development and feedback is very much appreciated.

learnMSA2: deep protein multiple alignments with large language and hidden Markov models

Features

Aligns large numbers of protein sequences with above state-of-the-art accuracy
Can utilize protein language models (--use_language_model) for significantly improved accuracy
Enables ultra-large alignment of millions of sequences
GPU acceleration, multi-GPU support
Scales linear in the number of sequences (does not require a guide tree)
Memory efficient (depending on sequence length, aligning millions of sequences on a laptop is possible)
Visualize a profile HMM or a sequence logo of the consensus motif

Current limitations

Requires many sequences (in most cases starting at 1000, a few 100 might still be enough) to achieve high accuracy
Only for protein sequences
Increasingly slow for long proteins with a length > 1000 residues

Installation

Recommended ways to install learnMSA along with Tensorflow + GPU:

Singularity/Docker

We provide a hassle-free docker image including GPU and pLM support.

singularity build learnmsa.sif docker://felbecker/learnmsa
singularity run --nv learnmsa.sif learnMSA

Running the container with --nv is required for GPU support.

conda/mamba and pip

Create a conda environment:

conda create -n learnMSA python=3.12
conda activate learnMSA

Install learnMSA (cuda toolkit included):

pip install learnMSA

Additional installs for sequence weights (recommended!):

conda install -c bioconda mmseqs2

You may have to set up Bioconda channels.

Additional installs for language model support (recommended!):

pip install torch==2.2.1 tf-keras==2.17.0

(optional) Verify that TensorFlow and learnMSA are correctly installed:

python3 -c "import tensorflow as tf; print(tf.__version__, tf.config.list_physical_devices('GPU'))"
learnMSA -h

bioconda (currently not recommended)

conda install -c bioconda -n learnMSA learnMSA

The additional installs (steps 3. and 4. above) are again highly recommended.

Unfortunately, there are errors to be expected when installing learnMSA with bioconda.

${\text{\color{red}Troubleshooting:}}$

Error:

tensorflow.python.framework.errors_impl.UnknownError: {{function_node __wrapped__Expm1_device_/job:localhost/replica:0/task:0/device:GPU:0}} JIT compilation failed.

Your root error is: TensorFlow libdevice not found

Fix:

Find nvvm directory: find / -type d -name nvvm 2>/dev/null

Expected outputs: <path>/nvvm

If there are multiple paths, choose the one matching your conda environment.

Run: export XLA_FLAGS=--xla_gpu_cuda_data_dir=<path>

Error:

ERROR: Flag 'minloglevel' was defined more than once (...)

Fix: pip install --no-deps --upgrade sentencepiece==0.1.99

Using learnMSA for alignment

Recommended way to align proteins with learnMSA version >= 2.0.10:

learnMSA -i INPUT_FILE -o OUTPUT_FILE --use_language_model

Recommended way to align proteins with learnMSA version < 2.0.10:

learnMSA -i INPUT_FILE -o OUTPUT_FILE --use_language_model --sequence_weights

Without language model support (recommended for speed and for proteins with very high sequence similarity):

learnMSA -i INPUT_FILE -o OUTPUT_FILE

Note: If you installed learnMSA via docker/singularity, you have to run singularity run --nv learnmsa.sif learnMSA -i ...

Note 2: Since v2.0.10 the default behavior changed: Sequence weights are now used by default and the --sequence_weights option was removed. Instead, a --no_sequence_weights option exists to align without sequence weights (not recommended). Users that installed learnMSA via pip have to manually install mmseqs2 (conda is recommended, see above).

To output a pdf with a sequence logo alongside the msa, use --logo. For a fun gif that visualizes the training process, you can use --logo_gif (attention, slows down training and should not be used for real alignments).

Interactive notebook with visualization:

Run the notebooks learnMSA_demo.ipynb or learnMSA_with_language_model_demo.ipynb with juypter.

Benchmark:

Publications

Becker F, Stanke M. learnMSA2: deep protein multiple alignments with large language and hidden Markov models. Bioinformatics. 2024

Becker F, Stanke M. learnMSA: learning and aligning large protein families. GigaScience. 2022

Name		Name	Last commit message	Last commit date
Latest commit History 403 Commits
learnMSA		learnMSA
logo		logo
test		test
util		util
.gitignore		.gitignore
LICENSE.md		LICENSE.md
README.md		README.md
learnMSA.py		learnMSA.py
learnMSA_demo.ipynb		learnMSA_demo.ipynb
learnMSA_with_language_model_demo.ipynb		learnMSA_with_language_model_demo.ipynb
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

learnMSA2: deep protein multiple alignments with large language and hidden Markov models

Features

Current limitations

Installation

Singularity/Docker

conda/mamba and pip

bioconda (currently not recommended)

${\text{\color{red}Troubleshooting:}}$

Error:

Error:

Using learnMSA for alignment

Interactive notebook with visualization:

Benchmark:

Publications

About

Releases 19

Packages

Contributors 2

Languages

License

Gaius-Augustus/learnMSA

Folders and files

Latest commit

History

Repository files navigation

learnMSA2: deep protein multiple alignments with large language and hidden Markov models

Features

Current limitations

Installation

Singularity/Docker

conda/mamba and pip

bioconda (currently not recommended)

${\text{\color{red}Troubleshooting:}}$

Error:

Error:

Using learnMSA for alignment

Interactive notebook with visualization:

Benchmark:

Publications

About

Resources

License

Stars

Watchers

Forks

Releases 19

Packages 0

Contributors 2

Languages

Packages