-
-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Caching system #372
Comments
In the case of nested service clauses, this is sufficient to avoid multiple re-engineering of the same file, the RDF is kept in memory, and subsequent queries are performed on the cache. The cache is indeed per-execution based; this makes it less potent in a server setting (Fuseki runnable). In the case of the CLI, there is one execution only (except when queries are parametrized). We need to verify what happens with parametrised queries; in that case, there are multiple executions within the same runtime, and we should check if the cache is brought over or wiped. |
I forgot to mention the PySPARQL-Anything setting, where multiple executions are performed within the same runtime (and probably the same execution context -- but this should be verified). |
Each query has its own execution context so I think this makes caching useful only in the case of nested service clauses. |
My fear is that the system caches (which has a cost) a lot of DatasetGraphs that will be used very rarely. |
Indeed. |
We may disable caching by default |
Let's first add the option to disable it so that we can experiment with the effects. The nested query use case is quite common for me -- I use it to speed up joins between large sources (e.g. two large CSVs); without cache, a large CSV will be re-read and re-triplified from the file system for each query solution of the sub-SERVICE clause. |
Agreed. I'm struggling to find a good example of cache usage. I've created a spreadsheet using
( |
After a discussion with @enridaga, we agreed on rethinking the caching system. In particular, the key used for storing and retrieving cached data must be redesigned. In fact, at the moment, the key is the concatenation of the options used for the triplification with a string representation of the operation (e.g. algebra of the service clause). While the cache key must depend on properties, using the whole operation seems too restrictive. An idea could be extracting and verbalise (turning them into strings) the triple pattern within the operation (as they affect the triplification when triple filtering is enabled). |
Quick analysis of the general options and their influence on caching:
|
So probably all the options should be considered |
Including the format specific ones |
I don't know; maybe we look at each of them and decide. I think the may issue at the moment is that BGPs are bringing the outer context in. Also, the cache should be valid if a BGP that is more restrictive than the cached one is queried... (considering the triple filtering). |
The cache is disabled by default as there is a cost (memory and time) in storing the dataset graph. |
The cache is wiped at every query execution as it is stored as a hash map within the execution context. Therefore, the cache is effective only when the same resource is queried multiple times in the same query (e.g. multiple service clauses having as the same resource as location, same properties and same sub-operation). This makes me question the usefulness of the cache.
An alternative could be decoupling the caching from the execution context, and initialising the cache once at the startup of the system.
The text was updated successfully, but these errors were encountered: