Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Import sequence format for RNA, DNA and PEPTIDES #1426

Closed
even1024 opened this issue Dec 7, 2023 · 0 comments · Fixed by #1462
Closed

Import sequence format for RNA, DNA and PEPTIDES #1426

even1024 opened this issue Dec 7, 2023 · 0 comments · Fixed by #1462

Comments

@even1024
Copy link
Collaborator

even1024 commented Dec 7, 2023

Background
There is a requirement to have parsing of RNA, DNA, and peptide sequences.
These sequences represented as a plain strings with a combination of the following symbols:

for peptides:
A - Alanine
C - Cysteine
D - Aspartic Acid
E - Glutamic Acid
F - Phenylalanine
G - Glycine
H - Histidine
I - Isoleucine
K - Lysine
L - Leucine
M - Methionine
N - Asparagine
P - Proline
Q - Glutamine
R - Arginine
S - Serine
T - Threonine
V - Valine
W - Tryptophan
Y - Tyrosine

for RNA nucleotides:
A - AMP (Adenosine monophosphate)
C - CMP (Cytidine monophosphate)
G - GMP (Guanosine monophosphate)
U - UMP (Uridine monophosphate)
T - rTMP (Ribothymidine monophosphate)

for DNA nucleotides:
A - dAMP (Deoxyadenosine monophosphate)
C - dCMP (Deoxycytidine monophosphate)
G - dGMP (Deoxyguanosine monophosphate)
U - dUMP (Deoxyuridine monophosphate)
T - TMP (Thymidine monophosphate)

Sequence parser should split nucleotides into its components: phosphate, sugar and nucleobase as following:

For RNA case:
Input: <string> (of RNA letters)
Input Visualization (given just for better explanation):
image

Output: All supported indigo formats including ket format in JSON for Ketcher.
Output Visualization (given just for better explanation):
image

Algo: every RNA letter wrap into r(...)p i.e. A -> r(A)p, C -> r(C)p and so on

Example:
Input: "ACGU"
Output: r(A)p, r(C)p, r(G)p, r(U) (in ket format)

For DNA, case:
Input: <string> (of RNA letters)
Input Visualization:
image
Output: All supported indigo formats including ket format in JSON for Ketcher.
Output Visualization:
image

Algo: every RNA letter wrap into d(...)p i.e. A -> d(A)p, C -> d(C)p and so on

Example:
Input: "ACGT"
Output: d(A)p, d(C)p, d(G)p, d(T) (in ket format)

Please note that phosphate component "p" appears from the right side of an expanded nucleotide.
"r" - means ribose and "d"- deoxyribose.

Solution

  1. Implement C++ class SequenceLoader in addition to the existing Indigo loaders for molecular formats.
  2. Implement following functions for C API, where type can be one of "RNA", "DNA" or "PEPTIDE":
int indigoLoadSequence(int source, const char* type);
int indigoLoadSequenceFromString(const char* string,  const char* type);
int indigoLoadSequenceFromFile(const char* filename,  const char* type);
int indigoLoadSequenceFromBuffer(const char* buffer, int size,  const char* type);

  1. Add language bindings for Python, Java, C#
    python binding functions:
    def loadSequence(self, input_string: string, sequence_type: string):
    def loadSequenceFromFile(self, input_file: string, sequence_type: string):

  2. Add the following content types to WASM "loadMoleculeOrReaction" and Indigo service "convert" API:

chemical/x-rna-sequence, chemical/x-dna-sequence, chemical/x-peptide-sequence

  1. Coordinates for monomers calculated according to the pictures above. Backbone monomers coordinates calculated from left to right. Branch monomers positioned under the sugars they connected to.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

1 participant