-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support v2-beta2 kb.phenoscape API #235
Comments
Problem 1: /term/all_descendants parts not workingThe rphenoscape/tests/testthat/test-classif.R Lines 61 to 63 in 531601b
The problem is the following line returns FALSE instead TRUE.
The The production and v2-beta api return 500+ records for finding term descendants of
The v2-beta2 api returns 28 records for finding term descendants of
If you have
Perhaps the |
Problem 2: /similarity/corpus_size always returns 0The corpus_size() function returns 0 for taxa and genes.
This is expected to be at least 100. This can be reproduced from the command line:
|
Problem 3: /similarity/frequency API changeThe
We are currently passing
I think the 'E.g. /' is a rendering problem and should be I am not sure how to include the corpus(taxa or genes) into the path parameter. |
Problem 4: /similarity/matrix API failingThe
The /similarity/matrix endpoint has a new
I tried hard coding the
|
@balhoff Please see the above problems I encountered with v2-beta2. In addition I noticed something strange in swagger. |
Fix for problem 1: phenoscape/phenoscape-kb-services#472 |
Fix for swagger issue : phenoscape/phenoscape-kb-services#473 |
Re: problem 2 and problem 3 — parameters have changed for all services related to similarity corpora. Instead of using an IRI to name one, you provide a SPARQL property path for which the subjects are items in the corpus (e.g. taxa), and the objects are the annotations (e.g. phenotype classes). You can also (optionally) provide a For the previous "taxa" corpus, use:
For the previous "genes" corpus, use:
|
Problem 4 should be fixed by phenoscape/phenoscape-kb-services#476 (which has been deployed). For the subsumer matrix, you may want to consider allowing a choice of relations to traverse (new feature). |
@balhoff can you explain (or link to the documentation that explains) what the specifier parameters are for? I don't recall these from our biweekly discussion, but I may have missed it. Do these essentially act as filters for the initial subject of the property chain? (The parameter name seems rather confusing - can you say where that's coming from?) |
That's exactly right. I invented these today, so feedback on the name is entirely welcome! I realized that some additional specification was needed to target the corpus items. What do you think? |
Also (but perhaps the documentation explains this?) for a path of |
If they're in essence a subject filter, then maybe just call them such? I.e., |
The |
Why not the Swagger docs? Isn't that where someone would go to find it? Of course, if it's lengthy, you could put it on the wiki (but I would use the phenoscape-kb-services repo wiki), and then link to it from the Swagger docs. |
So when you say to connect to the phenotypes what you mean is connect to the phenotype class(es) because that, not the instance(s), is what we're interested in and where the semantics are codified. |
@hlapp I updated the parameter names as you suggested: phenoscape/phenoscape-kb-services#481 |
Problem 5: /similarity/frequency 500 ErrorI updated the
The above code uses the POST /similarity/frequency endpoint. To reproduce in R I do the following:
If I reduce the terms IRIs to 185 the API doesn't fail but does take 2m10s. In testing this out I noticed some data differences between the v2-beta and v2-beta2 API results:
The subsumer matrix is created by calling the /similarity/matrix API endpoint. v2-beta2 IRIs: iris.txt |
@johnbradley could you paste the list of terms here? |
@balhoff I updated my comment to include a link to a text file of IRIs. |
Problem 6: /similarity/matrix returning less results for "basihyal bone" phenotypesFailing testA test started failing when switching to v2-beta2 API:
The R code to reproduce the problem is:
Explanation of the testThe test fetches phenotype IRIs curl exampleExample of only receiving 24 IRI back from /similarity/matrix for a "basihyal bone" phenotype IRI:
If you switch to the v2-beta API in the above curl command 385 items are returned. The phenotype IRI in question:
QuestionIs the difference in number of returned IRIs is expected? |
Problem 7: /similarity/matrix returning IRIs that have no labelsFailing testA test started failing when switching to v2-beta2 API:
The R test code: rphenoscape/tests/testthat/test-pk.R Lines 180 to 190 in f45e325
A quick way to see the data in R is:
Explanation of the testThe femur IRI is sent to the /similarity/matrix API and from the results 30 IRIs are sampled. Comparing v2-beta results vs v2-beta2 resultsI ran the quick R example above using v2-beta and v2-beta2. Example v2-beta2 IRI that has no label:
This IRI is not valid for v2-beta. Some "has part ..." labels show up in v2-beta but did not show up in v2-beta2. QuestionShould the http://purl.org/phenoscape/term/relation/ IRIs have labels? If not I can filter them out like we do the CARO IRIs. |
Problem 8: Unable to determine term categories for IRIs returned by /similarity/matrixFailing testA test started failing when switching to v2-beta2 API:
The R test code: rphenoscape/tests/testthat/test-freqs.R Lines 61 to 65 in 531601b
A quick way to see the IRIs in R is:
In the above example scroll to the right to see the Explanation of the testThe test looks up IRIs for "fin ray", "dorsal fin", and "caudal fin". The IRIs that we can't determine term category have the
curl ExampleFetch ancestors for a relation IRI:
Fetch term classification for a relation IRI:
QuestionShould these ../term/relation IRIs return data for /term/classification and/or /term/all_ancestors? |
Problem 9: Resnik similarity zero for some IRI in a matrix returned by /similarity/matrixFailing testA test started failing when switching to v2-beta2 API:
The R test code: rphenoscape/tests/testthat/test-semsim.R Lines 63 to 74 in f45e325
So The IRI for
Explanation of the testThe test uses /phenotype/query with "basihyal bone" and taxon = "Cyprinidae" to create a list of IRIs. Example IRI that has 0 Resnik similarityFor the an IRI that had 0 Resnik similarity we received 797 for the "frequency score (subsumed items)" returned by
797 the size as the corpus of taxa. Since we take QuestionIt seems like the IRIs that are problematic have a label ending in "... absent". These IRIs are coming from /phenotype/query. Should these "absent" IRI be filtered out at some point? |
Re: problem 9, this (Resnik similarity score of zero) can only really come about if two terms do not have any common subsumers in the matrix. This could be because of an error in the Lines 249 to 259 in 4855e6c
This may inadvertently for some terms remove the only common subsumer(s) that there are. It seems more likely that there are some common subsumers that are being returned as from the matrix endpoint, but then erroneously receive no count or a count of zero in the frequencies endpoint. |
@hlapp Re: problem 9: I removed the logic that removes rows and the problem persisted. It looks like I missed a pretty big part of what happens in problem 9 ( fetching frequencies from |
For problem 5—I made a PR to perform many queries instead of one big one: phenoscape/phenoscape-kb-services#489 |
@johnbradley for problem 6, the reduced number of subsumers in the matrix for a phenotype is expected. You will get more if you add more arguments to the relations parameter. For the phenotype IRI you mentioned, if you add the relation Note to myself—object properties for different situations is one of the topics that needs documentation. |
Yes. For example, when would I and would I not want to add |
For problem 7, the |
For problem 6: I will send an example relation list to @johnbradley which most closely mimics the previous results. |
@johnbradley and @balhoff just to clarify from our discussion: The Resnick similarity between two terms is zero if (a) they have no subsumers in common (in a graph with a root shared by all terms this should never happen), or if (b) the subsumer(s) that they do have in common either are the root term(s) or have the same frequency as the root term (i.e., for which the frequency is equal to the corpus size). Note that, unlike Jaccard, Resnick cannot distinguish between a root term and a term descending from the root term that has nonetheless the same frequency as the root term. (This means for example that if the only change we made to a graph is adding a line of subsumer terms to a term that currently is the root term in a graph, then Resnick similarities for any pair of terms would be unchanged. Jaccard similarities would change, however, because now we've added terms into the union and intersection sets of subsumers for any pair of terms.) Hence, if there isn't a bug with frequency calculations, one question is, is it "correct" (however we define this) that for |
Problem 9: The test currently samples 10 rows from the subsumer matrix (subs.mat): rphenoscape/tests/testthat/test-semsim.R Lines 66 to 69 in 531601b
Could the test be removing the only common subsumer for some terms? Using the idea from #239 I checked jaccard similarity on the sampled subsumer matrix:
|
@johnbradley good catch, and it seems your check shows this to indeed be a (the?) problem. The subsampling is there because originally obtaining the frequencies took more time than seemed tolerable. If you disable the subsampling, does the runtime become prohibitive for a test suite? You can disable the subsampling simply by reassigning # subs1 <- rownames(subs.mat)[s]
# subs.mat1 <- subs.mat[s,]
subs.mat1 <- subs.mat
# rownames(subs.mat1) <- subs1 |
Problem 9: Even after removing subsampling the test is still failing.
The code then produced the following grid:
The above matrix is creating by combining the subsumer matrix with the frequency values. Below is the subsumer matrix with an additional nlog_term_freq column:
If you scroll to the right you can see the only subsumer with Details about 15th subsumer IRI:
So these two phenotypes have a common subsumer of "implies_presence_of multicellular organism" but this subsumer has a term_frequency of 1, which -nlog converts to 0. |
It seems this shows that the cause of zero Resnick similarity is in the database. It's certainly expected that the frequency of "implies_presence_of some 'multicellular organism'" would be equal to the corpus size for corpus "taxa". There are nevertheless two things that are surprising, but they're both presumably due to the database content and how it's generated. One is why |
@johnbradley which relations are you requesting for Problem 9? I think this may be the cause of missing common subsumers. Also I think I neglected to send you a suggested list to use, is that right? |
When creating the subsumer matrix using the /similarity/matrix endpoint we only specify You did give me some defaults for |
Problem 5 is no longer occurring. Fixed by #235 (comment) |
Removes sampling that occasionally caused failures implementing fix suggested here: #235 (comment) Fixes #246
Removes sampling that occasionally caused failures implementing fix suggested here: #235 (comment) Fixes #246
The KB v2-beta2 API was replaced by a new version (currently Note the issue mentioned here #235 (comment) was fixed by 10b2206. Closing this issue since we aren't using the v2-beta2 KB API and the items found in this issue have been resolved. |
Add support for the v2-beta2 API.
Problems
The text was updated successfully, but these errors were encountered: