Non-standard Amino Acids #69

ericmjl · 2020-08-18T14:47:28Z

@goraj raised this at work, that being able to rep non-standard amino acids could be good.

The semantics of how this would work probably needs a bit of definition. For example, we would need to:

Add new embedding slots.
Re-train the model using proteins that have synthetic/unnatural amino acids incorporated.
Rework how repping happens.

Leaving this here for further discussion. This seems to be a quite niche use case, though with increasing applicability of synthetic amino acids in protein design applications, could become handy.

ElArkk · 2020-08-18T15:31:15Z

Interesting! IIRC, I think if we would simply index the new embedding slots at 26 onwards, we would not break the existing embedding using the standard weights.

One thing is that we so far did not implement the initial 10-dim embedding that in the original model is also learnable. I guess we would implement that at this point, to generate 10-dim embeddings for the non-natural AAs while retraining the network?

ElArkk · 2020-09-14T13:54:47Z

Pinging back in here. I have been looking at how to best implement this, and got overwhelmed by the amount of non-standard residues out there. The first question that arose was how we handle the fact that we can never hard-code all possible non-natural AA that people might have in their sequences:

We make an "expert-curated" list of often used non-standard AA that we hard-code into new embedding slots, or
We do not hard-code anything, and just give the user the ability to define new embedding slots. The embeddings would always be initialised randomly, and it would be the users responsibility of having enough examples of whatever non-natural AA to get a meaningful embedding through re-training.

We will also run out of one-letter AA codes pretty quickly, meaning a re-work of how embedding happens is unavoidable (some sort of substring search, or passing sequences containing non-natural AA as lists of single AA's rather than strings maybe?)

Your thoughts, @ericmjl and @ivanjayapurna ?

ivanjayapurna · 2020-09-15T05:15:18Z

My initial thoughts on this (having admittedly not spent that much time thinking about it) are that you'll never be able to satisfy everyone with a manually / expertly curated list - so there will always be a need for a custom-add option. In which case, I think it makes sense not to hard-code anything and only create a custom-add option where the user defines an amino acid dictionary containing whatever letterings they would want from scratch by themselves, then using the jax-unirep library to retrain on relevant examples that they would provide on their own. It also means each user could just assign codes to the exact minimal list of AA's they need, meaning hopefully you either wouldn't run out of 1 letter AA codes, or at least any substring search time would be minimized.

ElArkk · 2020-09-15T17:11:17Z

Good point. I also think a custom-add option would be the way to go. I don't know what the standard way of representing sequences with non-standard AA's is. We only really have a problem if it's still a single string of AA's, with some AA's not being single letters (e.g. if a sequence containing nme-LEU is written as MTLDnme-LEUATT). If there's some delimiter anyways a simple lookup in a dictionary that the user provides would suffice. Maybe we should just enforce the constraint that there needs to be a delimiter between each AA in those cases.

I will in the meantime start implementing a layer for multiplying the initial seq_len x n_AA matrix with a randomly initialised and then learnable n_AA x 10 embedding matrix.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-standard Amino Acids #69

Non-standard Amino Acids #69

ericmjl commented Aug 18, 2020

ElArkk commented Aug 18, 2020

ElArkk commented Sep 14, 2020

ivanjayapurna commented Sep 15, 2020

ElArkk commented Sep 15, 2020

Non-standard Amino Acids #69

Non-standard Amino Acids #69

Comments

ericmjl commented Aug 18, 2020

ElArkk commented Aug 18, 2020

ElArkk commented Sep 14, 2020

ivanjayapurna commented Sep 15, 2020

ElArkk commented Sep 15, 2020