Import sequence format for RNA, DNA and PEPTIDES #1426

even1024 · 2023-12-07T10:17:25Z

Background
There is a requirement to have parsing of RNA, DNA, and peptide sequences.
These sequences represented as a plain strings with a combination of the following symbols:

for peptides:
A - Alanine
C - Cysteine
D - Aspartic Acid
E - Glutamic Acid
F - Phenylalanine
G - Glycine
H - Histidine
I - Isoleucine
K - Lysine
L - Leucine
M - Methionine
N - Asparagine
P - Proline
Q - Glutamine
R - Arginine
S - Serine
T - Threonine
V - Valine
W - Tryptophan
Y - Tyrosine

for RNA nucleotides:
A - AMP (Adenosine monophosphate)
C - CMP (Cytidine monophosphate)
G - GMP (Guanosine monophosphate)
U - UMP (Uridine monophosphate)
T - rTMP (Ribothymidine monophosphate)

for DNA nucleotides:
A - dAMP (Deoxyadenosine monophosphate)
C - dCMP (Deoxycytidine monophosphate)
G - dGMP (Deoxyguanosine monophosphate)
U - dUMP (Deoxyuridine monophosphate)
T - TMP (Thymidine monophosphate)

Sequence parser should split nucleotides into its components: phosphate, sugar and nucleobase as following:

For RNA case:
Input: <string> (of RNA letters)
Input Visualization (given just for better explanation):

Output: All supported indigo formats including ket format in JSON for Ketcher.
Output Visualization (given just for better explanation):

Algo: every RNA letter wrap into r(...)p i.e. A -> r(A)p, C -> r(C)p and so on

Example:
Input: "ACGU"
Output: r(A)p, r(C)p, r(G)p, r(U) (in ket format)

For DNA, case:
Input: <string> (of RNA letters)
Input Visualization:

Output: All supported indigo formats including ket format in JSON for Ketcher.
Output Visualization:

Algo: every RNA letter wrap into d(...)p i.e. A -> d(A)p, C -> d(C)p and so on

Example:
Input: "ACGT"
Output: d(A)p, d(C)p, d(G)p, d(T) (in ket format)

Please note that phosphate component "p" appears from the right side of an expanded nucleotide.
"r" - means ribose and "d"- deoxyribose.

Solution

Implement C++ class SequenceLoader in addition to the existing Indigo loaders for molecular formats.
Implement following functions for C API, where type can be one of "RNA", "DNA" or "PEPTIDE":

int indigoLoadSequence(int source, const char* type);
int indigoLoadSequenceFromString(const char* string,  const char* type);
int indigoLoadSequenceFromFile(const char* filename,  const char* type);
int indigoLoadSequenceFromBuffer(const char* buffer, int size,  const char* type);

Add language bindings for Python, Java, C#
python binding functions:
def loadSequence(self, input_string: string, sequence_type: string):
def loadSequenceFromFile(self, input_file: string, sequence_type: string):
Add the following content types to WASM "loadMoleculeOrReaction" and Indigo service "convert" API:

chemical/x-rna-sequence, chemical/x-dna-sequence, chemical/x-peptide-sequence

Coordinates for monomers calculated according to the pictures above. Backbone monomers coordinates calculated from left to right. Branch monomers positioned under the sugars they connected to.

The text was updated successfully, but these errors were encountered:

even1024 added the epic: macromolecules label Dec 7, 2023

even1024 added this to the Indigo-1.17.0-rc.1 milestone Dec 7, 2023

even1024 self-assigned this Dec 7, 2023

even1024 changed the title ~~Monomers sequences parser~~ Import sequence format for RNA, DNA and PEPTIDES Dec 7, 2023

even1024 modified the milestones: Indigo-1.17.0-rc.1, Indigo-1.18.0-rc.1 Dec 8, 2023

olganaz mentioned this issue Dec 27, 2023

Macro: Remove and insert nucleotides in sequences (sequence representation) epam/ketcher#3650

Closed

even1024 modified the milestones: Indigo-1.18.0-rc.1, Indigo-1.19.0-rc.1 Dec 28, 2023

even1024 mentioned this issue Jan 3, 2024

Disable layout for macromolecules #1469

Closed

even1024 linked a pull request Jan 4, 2024 that will close this issue

#1426 Import sequence format for RNA, DNA and PEPTIDES #1462

Merged

even1024 closed this as completed Jan 4, 2024

even1024 added a commit that referenced this issue Jan 4, 2024

#1426 Import sequence format for RNA, DNA and PEPTIDES (#1462)

940d5e0

olganaz mentioned this issue Jun 20, 2024

Import/Export of variant monomers from Fasta/Sequence #2015

Closed

ljubica-milovic mentioned this issue Jul 19, 2024

Starting new sequence by typing in sequence mode epam/ketcher#5136

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Import sequence format for RNA, DNA and PEPTIDES #1426

Import sequence format for RNA, DNA and PEPTIDES #1426

even1024 commented Dec 7, 2023 •

edited

Loading

Import sequence format for RNA, DNA and PEPTIDES #1426

Import sequence format for RNA, DNA and PEPTIDES #1426

Comments

even1024 commented Dec 7, 2023 • edited Loading

even1024 commented Dec 7, 2023 •

edited

Loading