Pagination / offset of results for the SPARQL endpoint is faulty #150

dhimmel · 2020-12-03T20:28:38Z

Thanks for making the MeSH RDF SPARQL API. It's been convenient for quick access to MeSH.

I'd like to do a query that returns over 1000 results, and therefore need to figure out how to use pagination with the SPARQL API at https://id.nlm.nih.gov/mesh/sparql. Here's my query to return a table of descriptors:

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX meshv: <http://id.nlm.nih.gov/mesh/vocab#>
SELECT *
FROM <http://id.nlm.nih.gov/mesh/2020>
WHERE {
  ?mesh_uri a meshv:Descriptor .
  ?mesh_uri meshv:identifier ?mesh_id.
  ?mesh_uri rdfs:label ?mesh_label .
} 
ORDER BY ?mesh_uri

But I'm having trouble incrementing limit and offset to retrieve all results.

In search of a more reproducible example, I've simplified it to the this API call, generated by this python code:

import requests

query = """
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX meshv: <http://id.nlm.nih.gov/mesh/vocab#>
SELECT *
FROM <http://id.nlm.nih.gov/mesh/2020>
WHERE {
  ?mesh_uri a meshv:Descriptor .
  ?mesh_uri meshv:identifier ?mesh_id.
  ?mesh_uri rdfs:label ?mesh_label .
} 
ORDER BY ?mesh_uri
LIMIT 5
"""

params = {
    "query": query,
    "format": "json",
    "inference": True,
    "limit": 10,
    "offset": 4,
    "year": 2020,
}
api_url = "https://id.nlm.nih.gov/mesh/sparql"
response = requests.get(api_url, params)
print(response.url)
len(response.json()["results"]["bindings"])

The expected result is to receive a single record (the 5th record), because the query should return 5 records, and the offset is 4. Instead, 5 records are returned. The returned records under results.bindings start with:

      {
        "mesh_uri": { "type": "uri" , "value": "http://id.nlm.nih.gov/mesh/2020/D000005" } ,
        "mesh_id": { "type": "literal" , "value": "D000005" } ,
        "mesh_label": { "type": "literal" , "xml:lang": "en" , "value": "Abdomen" }
      } ,

So it looks like offset was respected, but something about the SPARQL LIMIT 5 or API parameter limit=10 does not work.

The text was updated successfully, but these errors were encountered:

dhimmel · 2020-12-08T18:29:10Z

@danizen any ideas here or examples of successfully using the API for multi-page results?

danizen · 2020-12-08T18:37:43Z

When we were in the prototype phase, I argued that SPARQL is fine as a query language, but the whole deal with federated SPARQL and paging is flawed. Virtuoso also may truncate results. I went to bat at work for another team having direct access to the Virtuoso SQL interface (over JDBC) because of this.

So, I have devoted less time to validating this than I should, and will look into it. I have long wanted to automatically add a count of results, but I don't think I have time to do that soon. If I can recreate your issue, I will do it as an automated test and keep the fix DRY.

danizen · 2020-12-08T18:52:12Z

I suspect that there is some string processing that would add the LIMIT and OFFSET to the query via string processing rather than paging from a temporary graph or model. If so, this could be hard to fix. To fix the results you see would be a pretty fundamental change - we would need a query cache, with a query ID based on the sessionID and the query text. A similar cache to the one I propose is described in RFC 1813 and applies to NFS. Virtuoso may also support this itself, and I would have to involve @simonjupp from https://github.com/EBISPOT/lodestar as to a proper solution.

As a shorter term alternative, you can simply drop the LIMIT 5 clause from your query and it should work. Also, by doing a COUNT() aggregation query before your query, you may be able to discover if Virtuoso has truncated the query.

dhimmel · 2020-12-08T19:20:24Z

I suspect that there is some string processing that would add the LIMIT and OFFSET to the query via string processing rather than paging

Ah that makes sense, the limit and offset parameters get appended to the query, such that my earlier example probably became:

LIMIT 5
LIMIT 10
OFFSET 4

It looks like limit is set to 1000 when no limit is specified or a limit > 1000 is specified. The easy solution would be to allow for higher limits, perhaps limiting queries by some other mechanism like execution time if required.

But since the limit is capped at 1000, I tried two methods of determining when pagination has exhausted all results:

Using SPARQL's COUNT function (example), but that returned 1000 results like:

[{'callret-0': {'type': 'literal',
'datatype': 'http://www.w3.org/2001/XMLSchema#integer',
'value': '1'}},
{'callret-0': {'type': 'literal',
'datatype': 'http://www.w3.org/2001/XMLSchema#integer',
'value': '1'}},

Iterative queries that increment offset until len(results) < limit, but it seems that the API always returns 1000 results for the final call rather than the actual number of records beyond the offset.

danizen · 2020-12-08T19:24:08Z

Let me think about this a bit.

danizen · 2020-12-09T00:29:43Z

I see no problems in the code or differences with the EBISPOT upstream, so I will try to duplicate your results. One thing this won't change is whether we return more than 1000 results to a single http request/response.

The implementation of limit and offset do not use string concatenation, but the parsed Apache Jena query.
See uk.ac.ebi.fgpt.lode.impl.JenaSparqlService:207 and uk.ac.ebi.fgpt.lode.impl.JenaSparqlService:256.

danizen · 2020-12-09T00:43:07Z

After a review of the code, an attempt to reproduce, and a re-read of the SPARQL spec., I think this is working as designed. The LIMIT is from the OFFSET, and therefore designed to allow you to page through a large result set:

https://www.w3.org/TR/rdf-sparql-query/#modOffset

There may however be some Virtuoso setting limiting the the actual result set, and I will check on this tomorrow.

dhimmel · 2020-12-10T02:35:47Z

Thanks @danizen for looking into this and narrowing down where things are going wrong.

Here's a versioned link for JenaSparqlService in case that file changes.

dhimmel mentioned this issue Dec 7, 2020

rdfs:subClassOf relationships missing from MeSH RDF #153

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pagination / offset of results for the SPARQL endpoint is faulty #150

Pagination / offset of results for the SPARQL endpoint is faulty #150

dhimmel commented Dec 3, 2020

dhimmel commented Dec 8, 2020

danizen commented Dec 8, 2020

danizen commented Dec 8, 2020

dhimmel commented Dec 8, 2020 •

edited

Loading

danizen commented Dec 8, 2020

danizen commented Dec 9, 2020

danizen commented Dec 9, 2020 •

edited

Loading

dhimmel commented Dec 10, 2020 •

edited

Loading

Pagination / offset of results for the SPARQL endpoint is faulty #150

Pagination / offset of results for the SPARQL endpoint is faulty #150

Comments

dhimmel commented Dec 3, 2020

dhimmel commented Dec 8, 2020

danizen commented Dec 8, 2020

danizen commented Dec 8, 2020

dhimmel commented Dec 8, 2020 • edited Loading

danizen commented Dec 8, 2020

danizen commented Dec 9, 2020

danizen commented Dec 9, 2020 • edited Loading

dhimmel commented Dec 10, 2020 • edited Loading

dhimmel commented Dec 8, 2020 •

edited

Loading

danizen commented Dec 9, 2020 •

edited

Loading

dhimmel commented Dec 10, 2020 •

edited

Loading