You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Now we have a method to compute candidate GO terms, we should investigate over which pairs of proteins-terms we should make predictions (the limit of the competition is 15k predictions, while the number of proteins in the test set is >140k and the number of GO terms are also tens of thousands).
For a first analysis, let's start with the direct child terms of those already assigned over the test set of proteins. Each term, based on its rarity over the whole protein universe, has a score (Information Accretion, here a full explanation of this term). Let's call this IA(term).
We should create an analysis where we compute:
All the direct child GO terms over the test set of proteins, saving for each candidate term the number of proteins this term is a candidate for (let's call this #proteins(term)).
We will go naive: we will rank the terms to be predicted by multiplying #proteins(term) * IA(term) for each term.
We should compute the pais of GO terms - proteins to be predicted, with a cutoff on the 15k predictions.
Usually, jupyter notebooks are used to make analyses more than scripts, since you can describe the step-to-step with markdown, plot things, etc. which will be useful to communicate our decisions toward the final predictions. Notebooks are well displayed in GitHub, they format the markdown, display the plots, etc. Let's include the generated notebook in a folder called analyses and "consume" the functionalities of the package already developed as a first example of its use also.
The text was updated successfully, but these errors were encountered:
* ia() and get_parents() for #28
* filtering to ensure children and parents are candidate terms, not actual terms
* ancestors_within_distance for max_distance param to get_parents()
Now we have a method to compute candidate GO terms, we should investigate over which pairs of proteins-terms we should make predictions (the limit of the competition is 15k predictions, while the number of proteins in the test set is >140k and the number of GO terms are also tens of thousands).
For a first analysis, let's start with the direct child terms of those already assigned over the test set of proteins. Each term, based on its rarity over the whole protein universe, has a score (Information Accretion, here a full explanation of this term). Let's call this
IA(term)
.We should create an analysis where we compute:
#proteins(term)
).#proteins(term) * IA(term)
for each term.Usually, jupyter notebooks are used to make analyses more than scripts, since you can describe the step-to-step with markdown, plot things, etc. which will be useful to communicate our decisions toward the final predictions. Notebooks are well displayed in GitHub, they format the markdown, display the plots, etc. Let's include the generated notebook in a folder called
analyses
and "consume" the functionalities of the package already developed as a first example of its use also.The text was updated successfully, but these errors were encountered: