-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Non-standard Amino Acids #69
Comments
Interesting! IIRC, I think if we would simply index the new embedding slots at 26 onwards, we would not break the existing embedding using the standard weights. One thing is that we so far did not implement the initial 10-dim embedding that in the original model is also learnable. I guess we would implement that at this point, to generate 10-dim embeddings for the non-natural AAs while retraining the network? |
Pinging back in here. I have been looking at how to best implement this, and got overwhelmed by the amount of non-standard residues out there. The first question that arose was how we handle the fact that we can never hard-code all possible non-natural AA that people might have in their sequences:
We will also run out of one-letter AA codes pretty quickly, meaning a re-work of how embedding happens is unavoidable (some sort of substring search, or passing sequences containing non-natural AA as lists of single AA's rather than strings maybe?) Your thoughts, @ericmjl and @ivanjayapurna ? |
My initial thoughts on this (having admittedly not spent that much time thinking about it) are that you'll never be able to satisfy everyone with a manually / expertly curated list - so there will always be a need for a custom-add option. In which case, I think it makes sense not to hard-code anything and only create a custom-add option where the user defines an amino acid dictionary containing whatever letterings they would want from scratch by themselves, then using the jax-unirep library to retrain on relevant examples that they would provide on their own. It also means each user could just assign codes to the exact minimal list of AA's they need, meaning hopefully you either wouldn't run out of 1 letter AA codes, or at least any substring search time would be minimized. |
Good point. I also think a custom-add option would be the way to go. I don't know what the standard way of representing sequences with non-standard AA's is. We only really have a problem if it's still a single string of AA's, with some AA's not being single letters (e.g. if a sequence containing I will in the meantime start implementing a layer for multiplying the initial |
@goraj raised this at work, that being able to rep non-standard amino acids could be good.
The semantics of how this would work probably needs a bit of definition. For example, we would need to:
Leaving this here for further discussion. This seems to be a quite niche use case, though with increasing applicability of synthetic amino acids in protein design applications, could become handy.
The text was updated successfully, but these errors were encountered: