Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Non-standard Amino Acids #69

Open
ericmjl opened this issue Aug 18, 2020 · 4 comments
Open

Non-standard Amino Acids #69

ericmjl opened this issue Aug 18, 2020 · 4 comments

Comments

@ericmjl
Copy link
Collaborator

ericmjl commented Aug 18, 2020

@goraj raised this at work, that being able to rep non-standard amino acids could be good.

The semantics of how this would work probably needs a bit of definition. For example, we would need to:

  1. Add new embedding slots.
  2. Re-train the model using proteins that have synthetic/unnatural amino acids incorporated.
  3. Rework how repping happens.

Leaving this here for further discussion. This seems to be a quite niche use case, though with increasing applicability of synthetic amino acids in protein design applications, could become handy.

@ElArkk
Copy link
Owner

ElArkk commented Aug 18, 2020

Interesting! IIRC, I think if we would simply index the new embedding slots at 26 onwards, we would not break the existing embedding using the standard weights.

One thing is that we so far did not implement the initial 10-dim embedding that in the original model is also learnable. I guess we would implement that at this point, to generate 10-dim embeddings for the non-natural AAs while retraining the network?

@ElArkk
Copy link
Owner

ElArkk commented Sep 14, 2020

Pinging back in here. I have been looking at how to best implement this, and got overwhelmed by the amount of non-standard residues out there. The first question that arose was how we handle the fact that we can never hard-code all possible non-natural AA that people might have in their sequences:

  • We make an "expert-curated" list of often used non-standard AA that we hard-code into new embedding slots, or
  • We do not hard-code anything, and just give the user the ability to define new embedding slots. The embeddings would always be initialised randomly, and it would be the users responsibility of having enough examples of whatever non-natural AA to get a meaningful embedding through re-training.

We will also run out of one-letter AA codes pretty quickly, meaning a re-work of how embedding happens is unavoidable (some sort of substring search, or passing sequences containing non-natural AA as lists of single AA's rather than strings maybe?)

Your thoughts, @ericmjl and @ivanjayapurna ?

@ivanjayapurna
Copy link
Contributor

My initial thoughts on this (having admittedly not spent that much time thinking about it) are that you'll never be able to satisfy everyone with a manually / expertly curated list - so there will always be a need for a custom-add option. In which case, I think it makes sense not to hard-code anything and only create a custom-add option where the user defines an amino acid dictionary containing whatever letterings they would want from scratch by themselves, then using the jax-unirep library to retrain on relevant examples that they would provide on their own. It also means each user could just assign codes to the exact minimal list of AA's they need, meaning hopefully you either wouldn't run out of 1 letter AA codes, or at least any substring search time would be minimized.

@ElArkk
Copy link
Owner

ElArkk commented Sep 15, 2020

Good point. I also think a custom-add option would be the way to go. I don't know what the standard way of representing sequences with non-standard AA's is. We only really have a problem if it's still a single string of AA's, with some AA's not being single letters (e.g. if a sequence containing nme-LEU is written as MTLDnme-LEUATT). If there's some delimiter anyways a simple lookup in a dictionary that the user provides would suffice. Maybe we should just enforce the constraint that there needs to be a delimiter between each AA in those cases.

I will in the meantime start implementing a layer for multiplying the initial seq_len x n_AA matrix with a randomly initialised and then learnable n_AA x 10 embedding matrix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants