Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Caching system #372

Closed
luigi-asprino opened this issue Jun 6, 2023 · 14 comments
Closed

Caching system #372

luigi-asprino opened this issue Jun 6, 2023 · 14 comments
Labels
Bug Something isn't working
Milestone

Comments

@luigi-asprino
Copy link
Member

The cache is wiped at every query execution as it is stored as a hash map within the execution context. Therefore, the cache is effective only when the same resource is queried multiple times in the same query (e.g. multiple service clauses having as the same resource as location, same properties and same sub-operation). This makes me question the usefulness of the cache.

An alternative could be decoupling the caching from the execution context, and initialising the cache once at the startup of the system.

@luigi-asprino luigi-asprino added the Bug Something isn't working label Jun 6, 2023
@enridaga
Copy link
Member

enridaga commented Jun 6, 2023

This makes me question the usefulness of the cache.

In the case of nested service clauses, this is sufficient to avoid multiple re-engineering of the same file, the RDF is kept in memory, and subsequent queries are performed on the cache.

The cache is indeed per-execution based; this makes it less potent in a server setting (Fuseki runnable).

In the case of the CLI, there is one execution only (except when queries are parametrized).

We need to verify what happens with parametrised queries; in that case, there are multiple executions within the same runtime, and we should check if the cache is brought over or wiped.

@enridaga
Copy link
Member

enridaga commented Jun 6, 2023

I forgot to mention the PySPARQL-Anything setting, where multiple executions are performed within the same runtime (and probably the same execution context -- but this should be verified).

@luigi-asprino
Copy link
Member Author

Each query has its own execution context so I think this makes caching useful only in the case of nested service clauses.
However, the nested clauses must have the same sub-operation and the same configuration options, thus reducing its applicability by a lot.

@luigi-asprino
Copy link
Member Author

My fear is that the system caches (which has a cost) a lot of DatasetGraphs that will be used very rarely.

@enridaga
Copy link
Member

enridaga commented Jun 7, 2023

My fear is that the system caches (which has a cost) a lot of DatasetGraphs that will be used very rarely.

Indeed.

@luigi-asprino
Copy link
Member Author

We may disable caching by default

@enridaga
Copy link
Member

enridaga commented Jun 8, 2023

We may disable caching by default

Let's first add the option to disable it so that we can experiment with the effects.

The nested query use case is quite common for me -- I use it to speed up joins between large sources (e.g. two large CSVs); without cache, a large CSV will be re-read and re-triplified from the file system for each query solution of the sub-SERVICE clause.

@luigi-asprino
Copy link
Member Author

Agreed.

I'm struggling to find a good example of cache usage.

I've created a spreadsheet using =NOW() formula which returns the "serial number" (as it is called in the Office documentation) of the datetime when it is evaluated.
If the triplification of the spreadsheet is cached, then the result of the formula is always the same even if the file is transformed multiple times.
Then, I drafted this query:

PREFIX fx: <http://sparql.xyz/facade-x/ns/>
PREFIX xyz: <http://sparql.xyz/facade-x/data/>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT * WHERE {
  SERVICE <x-sparql-anything:spreadsheet.evaluate-formulas=true> {
    fx:properties fx:location "%%%LOCATION%%%" .
          [] rdf:type fx:root ;
             fx:anySlot ?row .
          ?row rdf:_1 ?n .
          ?row rdf:_2 ?now

    SERVICE <x-sparql-anything:> {


       fx:properties fx:content "[1.0,2.0,3.0]" .
               fx:properties fx:media-type "application/json" .
               ?s fx:anySlot ?n .
    }


  }
}

(%%%LOCATION%%% is substituted with the file path of the spreadsheet at runtime).
However, this query transforms the spreadsheet just once (damn nested queries!).

@luigi-asprino
Copy link
Member Author

After a discussion with @enridaga, we agreed on rethinking the caching system. In particular, the key used for storing and retrieving cached data must be redesigned. In fact, at the moment, the key is the concatenation of the options used for the triplification with a string representation of the operation (e.g. algebra of the service clause).

While the cache key must depend on properties, using the whole operation seems too restrictive. An idea could be extracting and verbalise (turning them into strings) the triple pattern within the operation (as they affect the triplification when triple filtering is enabled).

@enridaga enridaga added this to the v0.9.0 milestone Jul 14, 2023
@enridaga
Copy link
Member

Quick analysis of the general options and their influence on caching:

Property Note Cache Key
location* The URL of the data source. Yes
content* The content to be transformed. Yes
command* An external command line to be executed. The output is handled according to the option 'media-type' Yes
from-archive The filename of the resource to be triplified within an archive. Yes
root The IRI of generated root resource. Yes
media-type The media-type of the data source. Yes (different formats different triples)
namespace The namespace prefix for the properties that will be generated. Yes
blank-nodes It tells SPARQL Anything to generate blank nodes or not. Yes
trim-strings Trim all string literals. Yes
null-string Do not produce triples where the specified string would be in the object position of the triple. Yes
http.* A set of options for customising HTTP request method, headers, querystring, and others. More details on the HTTP request configuration Yes?
triplifier It forces SPARQL Anything to use a specific triplifier for transforming the data source Yes?
charset The charset of the data source. Yes?
metadata It tells SPARQL Anything to extract metadata from the data source and to store it in the named graph with URI http://sparql.xyz/facade-x/data/metadata More details Yes
ondisk It tells SPARQL Anything to use an on disk graph (instead of the default in memory graph). The string should be a path to a directory where the on disk graph will be stored. Using an on disk graph is almost always slower (than using the default in memory graph) but with it you can triplify large files without running out of memory. I don't know
ondisk.reuse When using an on disk graph, it tells SPARQL Anything to reuse the previous on disk graph. I don't know
strategy The execution strategy. 0 = in memory, all triples; 1 = in memory, only triples matching any of the triple patterns in the where clause Yes
slice The resources is sliced and the SPARQL query executed on each one of the parts. Supported by: CSV (row by row); JSON (when array slice by item, when json object requires json.path); XML (requires xml.path) Yes (Maybe incompatible with caching?)
use-rdfs-member  It tells SPARQL Anything to use the (super)property rdfs:member instead of container membership properties (rdf:_1, rdf:_2 ...) Yes

@luigi-asprino
Copy link
Member Author

So probably all the options should be considered

@luigi-asprino
Copy link
Member Author

Including the format specific ones

@enridaga
Copy link
Member

Including the format specific ones

I don't know; maybe we look at each of them and decide. I think the may issue at the moment is that BGPs are bringing the outer context in. Also, the cache should be valid if a BGP that is more restrictive than the cached one is queried... (considering the triple filtering).

@enridaga enridaga mentioned this issue Nov 6, 2023
@luigi-asprino luigi-asprino modified the milestones: v0.9.0, v1.0.0 Dec 5, 2023
luigi-asprino added a commit that referenced this issue Sep 2, 2024
Include no-cache option #371
Review caching system #372
Add information whether or not a cached graph was used #149
luigi-asprino added a commit that referenced this issue Sep 2, 2024
luigi-asprino added a commit that referenced this issue Sep 11, 2024
@luigi-asprino
Copy link
Member Author

The cache is disabled by default as there is a cost (memory and time) in storing the dataset graph.
The cache is maintained until the process is executed.
The key is a string result of concatenating the translation of the query in SPARQL algebra with the execution properties (either extracted from the query or passed as an argument via the CLI).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants