Monomerizer (or SMILES2Seq, #SMILES2FASTA) is a pipeline that converts peptides and peptidomimetics, represented as SMILES (chemical formulae), into sequences of amino acids and terminal modifications.
For more information, visit our paper: Coming soon?🙏.
To use the output data to finetune our foundation language model for peptidomimetics, visit: GPepT
To run a Monomerizer demo, use the following command:
python3 run_pipeline.py --input_file demo/example_smiles.txt
- By default, results will be saved to the
output/<datetime>
directory. The raw directory contains the raw result, and the standard directory contains the sequences after standardizing them to the standard dictionary accepted by GPepT. - Replace
demo/example_smiles.txt
with the path to your input file containing SMILES strings. (The input file must follow the format of the example files in thedemo
directory.)
--output_dir <path>
--min_amino_acids <int>
: Minimum number of amino acids required for processing. Default is3
.--batch_size <int>
: Number of SMILES to process in each batch. Default is100
.--max_workers <int>
: Maximum number of parallel workers. Default is the number of available CPU cores.-draw
: Draws output file like this.