Caching system #372

luigi-asprino · 2023-06-06T14:54:10Z

The cache is wiped at every query execution as it is stored as a hash map within the execution context. Therefore, the cache is effective only when the same resource is queried multiple times in the same query (e.g. multiple service clauses having as the same resource as location, same properties and same sub-operation). This makes me question the usefulness of the cache.

An alternative could be decoupling the caching from the execution context, and initialising the cache once at the startup of the system.

enridaga · 2023-06-06T15:41:55Z

This makes me question the usefulness of the cache.

In the case of nested service clauses, this is sufficient to avoid multiple re-engineering of the same file, the RDF is kept in memory, and subsequent queries are performed on the cache.

The cache is indeed per-execution based; this makes it less potent in a server setting (Fuseki runnable).

In the case of the CLI, there is one execution only (except when queries are parametrized).

We need to verify what happens with parametrised queries; in that case, there are multiple executions within the same runtime, and we should check if the cache is brought over or wiped.

enridaga · 2023-06-06T15:48:00Z

I forgot to mention the PySPARQL-Anything setting, where multiple executions are performed within the same runtime (and probably the same execution context -- but this should be verified).

luigi-asprino · 2023-06-07T07:40:09Z

Each query has its own execution context so I think this makes caching useful only in the case of nested service clauses.
However, the nested clauses must have the same sub-operation and the same configuration options, thus reducing its applicability by a lot.

luigi-asprino · 2023-06-07T07:42:55Z

My fear is that the system caches (which has a cost) a lot of DatasetGraphs that will be used very rarely.

enridaga · 2023-06-07T14:42:49Z

My fear is that the system caches (which has a cost) a lot of DatasetGraphs that will be used very rarely.

Indeed.

luigi-asprino · 2023-06-07T16:13:49Z

We may disable caching by default

enridaga · 2023-06-08T10:38:41Z

We may disable caching by default

Let's first add the option to disable it so that we can experiment with the effects.

The nested query use case is quite common for me -- I use it to speed up joins between large sources (e.g. two large CSVs); without cache, a large CSV will be re-read and re-triplified from the file system for each query solution of the sub-SERVICE clause.

…371 update test #371 #372

luigi-asprino · 2023-06-09T10:30:11Z

Agreed.

I'm struggling to find a good example of cache usage.

I've created a spreadsheet using =NOW() formula which returns the "serial number" (as it is called in the Office documentation) of the datetime when it is evaluated.
If the triplification of the spreadsheet is cached, then the result of the formula is always the same even if the file is transformed multiple times.
Then, I drafted this query:

PREFIX fx: <http://sparql.xyz/facade-x/ns/>
PREFIX xyz: <http://sparql.xyz/facade-x/data/>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT * WHERE {
  SERVICE <x-sparql-anything:spreadsheet.evaluate-formulas=true> {
    fx:properties fx:location "%%%LOCATION%%%" .
          [] rdf:type fx:root ;
             fx:anySlot ?row .
          ?row rdf:_1 ?n .
          ?row rdf:_2 ?now

    SERVICE <x-sparql-anything:> {


       fx:properties fx:content "[1.0,2.0,3.0]" .
               fx:properties fx:media-type "application/json" .
               ?s fx:anySlot ?n .
    }


  }
}

(%%%LOCATION%%% is substituted with the file path of the spreadsheet at runtime).
However, this query transforms the spreadsheet just once (damn nested queries!).

luigi-asprino · 2023-07-01T11:42:59Z

After a discussion with @enridaga, we agreed on rethinking the caching system. In particular, the key used for storing and retrieving cached data must be redesigned. In fact, at the moment, the key is the concatenation of the options used for the triplification with a string representation of the operation (e.g. algebra of the service clause).

While the cache key must depend on properties, using the whole operation seems too restrictive. An idea could be extracting and verbalise (turning them into strings) the triple pattern within the operation (as they affect the triplification when triple filtering is enabled).

enridaga · 2023-07-14T13:47:24Z

Quick analysis of the general options and their influence on caching:

Property	Note	Cache Key
location*	The URL of the data source.	Yes
content*	The content to be transformed.	Yes
command*	An external command line to be executed. The output is handled according to the option 'media-type'	Yes
from-archive	The filename of the resource to be triplified within an archive.	Yes
root	The IRI of generated root resource.	Yes
media-type	The media-type of the data source.	Yes (different formats different triples)
namespace	The namespace prefix for the properties that will be generated.	Yes
blank-nodes	It tells SPARQL Anything to generate blank nodes or not.	Yes
trim-strings	Trim all string literals.	Yes
null-string	Do not produce triples where the specified string would be in the object position of the triple.	Yes
http.*	A set of options for customising HTTP request method, headers, querystring, and others. More details on the HTTP request configuration	Yes?
triplifier	It forces SPARQL Anything to use a specific triplifier for transforming the data source	Yes?
charset	The charset of the data source.	Yes?
metadata	It tells SPARQL Anything to extract metadata from the data source and to store it in the named graph with URI http://sparql.xyz/facade-x/data/metadata More details	Yes
ondisk	It tells SPARQL Anything to use an on disk graph (instead of the default in memory graph). The string should be a path to a directory where the on disk graph will be stored. Using an on disk graph is almost always slower (than using the default in memory graph) but with it you can triplify large files without running out of memory.	I don't know
ondisk.reuse	When using an on disk graph, it tells SPARQL Anything to reuse the previous on disk graph.	I don't know
strategy	The execution strategy. 0 = in memory, all triples; 1 = in memory, only triples matching any of the triple patterns in the where clause	Yes
slice	The resources is sliced and the SPARQL query executed on each one of the parts. Supported by: CSV (row by row); JSON (when array slice by item, when json object requires json.path); XML (requires xml.path)	Yes (Maybe incompatible with caching?)
use-rdfs-member	It tells SPARQL Anything to use the (super)property rdfs:member instead of container membership properties (rdf:_1, rdf:_2 ...)	Yes

luigi-asprino · 2023-07-14T13:50:39Z

So probably all the options should be considered

luigi-asprino · 2023-07-14T13:51:04Z

Including the format specific ones

enridaga · 2023-07-14T14:03:13Z

Including the format specific ones

I don't know; maybe we look at each of them and decide. I think the may issue at the moment is that BGPs are bringing the outer context in. Also, the cache should be valid if a BGP that is more restrictive than the cached one is queried... (considering the triple filtering).

Include no-cache option #371 Review caching system #372 Add information whether or not a cached graph was used #149

Test caching system #372

#371 and #372 document use-cache option

luigi-asprino · 2024-09-11T14:41:47Z

The cache is disabled by default as there is a cost (memory and time) in storing the dataset graph.
The cache is maintained until the process is executed.
The key is a string result of concatenating the translation of the query in SPARQL algebra with the execution properties (either extracted from the query or passed as an argument via the CLI).

luigi-asprino added the Bug Something isn't working label Jun 6, 2023

luigi-asprino added a commit that referenced this issue Jun 9, 2023

implement evaluation of no-cache option in DatasetGraphCreator -- see #…

e35c237

…371 update test #371 #372

enridaga added this to the v0.9.0 milestone Jul 14, 2023

enridaga mentioned this issue Nov 6, 2023

Release v0.9.0 #419

Closed

luigi-asprino modified the milestones: v0.9.0, v1.0.0 Dec 5, 2023

luigi-asprino added a commit that referenced this issue Sep 2, 2024

Fix #494

87733d4

Include no-cache option #371 Review caching system #372 Add information whether or not a cached graph was used #149

luigi-asprino added a commit that referenced this issue Sep 2, 2024

Include test for no-cache option #371

80ad2af

Test caching system #372

luigi-asprino added a commit that referenced this issue Sep 11, 2024

#211 Add activity diagrams to SystemOverview.md

6420211

#371 and #372 document use-cache option

luigi-asprino closed this as completed Sep 11, 2024

luigi-asprino added a commit that referenced this issue Sep 11, 2024

#201 Update documentation '#371 #372 #211'

7b270a0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Caching system #372

Caching system #372

luigi-asprino commented Jun 6, 2023

enridaga commented Jun 6, 2023 •

edited

Loading

enridaga commented Jun 6, 2023

luigi-asprino commented Jun 7, 2023

luigi-asprino commented Jun 7, 2023

enridaga commented Jun 7, 2023

luigi-asprino commented Jun 7, 2023

enridaga commented Jun 8, 2023

luigi-asprino commented Jun 9, 2023

luigi-asprino commented Jul 1, 2023

enridaga commented Jul 14, 2023

luigi-asprino commented Jul 14, 2023

luigi-asprino commented Jul 14, 2023

enridaga commented Jul 14, 2023

luigi-asprino commented Sep 11, 2024

Caching system #372

Caching system #372

Comments

luigi-asprino commented Jun 6, 2023

enridaga commented Jun 6, 2023 • edited Loading

enridaga commented Jun 6, 2023

luigi-asprino commented Jun 7, 2023

luigi-asprino commented Jun 7, 2023

enridaga commented Jun 7, 2023

luigi-asprino commented Jun 7, 2023

enridaga commented Jun 8, 2023

luigi-asprino commented Jun 9, 2023

luigi-asprino commented Jul 1, 2023

enridaga commented Jul 14, 2023

luigi-asprino commented Jul 14, 2023

luigi-asprino commented Jul 14, 2023

enridaga commented Jul 14, 2023

luigi-asprino commented Sep 11, 2024

enridaga commented Jun 6, 2023 •

edited

Loading