Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Analysis: get a list of candidate GO terms #28

Open
leandroradusky opened this issue May 12, 2023 — with Manas.Tech Commit · 1 comment
Open

Analysis: get a list of candidate GO terms #28

leandroradusky opened this issue May 12, 2023 — with Manas.Tech Commit · 1 comment
Assignees

Comments

Copy link
Contributor

leandroradusky commented May 12, 2023

Now we have a method to compute candidate GO terms, we should investigate over which pairs of proteins-terms we should make predictions (the limit of the competition is 15k predictions, while the number of proteins in the test set is >140k and the number of GO terms are also tens of thousands).

For a first analysis, let's start with the direct child terms of those already assigned over the test set of proteins. Each term, based on its rarity over the whole protein universe, has a score (Information Accretion, here a full explanation of this term). Let's call this IA(term).

We should create an analysis where we compute:

  1. All the direct child GO terms over the test set of proteins, saving for each candidate term the number of proteins this term is a candidate for (let's call this #proteins(term)).
  2. We will go naive: we will rank the terms to be predicted by multiplying #proteins(term) * IA(term) for each term.
  3. We should compute the pais of GO terms - proteins to be predicted, with a cutoff on the 15k predictions.

Usually, jupyter notebooks are used to make analyses more than scripts, since you can describe the step-to-step with markdown, plot things, etc. which will be useful to communicate our decisions toward the final predictions. Notebooks are well displayed in GitHub, they format the markdown, display the plots, etc. Let's include the generated notebook in a folder called analyses and "consume" the functionalities of the package already developed as a first example of its use also.

@nthiad nthiad self-assigned this May 23, 2023
nthiad added a commit that referenced this issue May 30, 2023
nthiad added a commit that referenced this issue Jun 1, 2023
* ia() and get_parents() for #28

* filtering to ensure children and parents are candidate terms, not actual terms

* ancestors_within_distance for max_distance param to get_parents()
@nthiad
Copy link
Contributor

nthiad commented Jun 1, 2023

partly added in #33 but jupyter notebook needs to be written

@nthiad nthiad closed this as completed Jun 1, 2023
@nthiad nthiad reopened this Jun 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants