Collecting simple statistics from online public SPARQL endpoints is hampered by their fair usage policies. This restriction obstructs several critical operations, such as aggregate query processing, portal development, and data summarization. Online sampling allows to collect statistics while respecting fair usage policies. However, sampling has not yet been integrated into the SPARQL standard. Although integrating sampling into the SPARQL standard appears beneficial, its effectiveness must be proven in a practical semantic web context. This poster investigates whether online sampling can generate summaries for use in cutting-edge SPARQL federation engines. Our experimental studies indicate that sampling enables the creation and maintenance of summaries by exploring less than 20% of datasets.
Keywords: Semantic Web ,SPARQL , Sampling, Summary
Note Installation instructions have only been tested on Ubuntu 20.04.6 LTS
- maven
- java 11 & 20 (JDK)
-
Load dataset into Apache Jena by TDB2 XLoader
-
Queries are provided in this repo.
-
Install project dependencies
Warning The project uses Maven Toolchains. Make sure that the location of
java 11
andjava 20
is defined in~/.m2/toolchains.xml
.mvn clean install
-
Ground Truth Summary Size for FedShop200 is 5800 and LargeRDFBench is 6070
-
Run Sampling on SPO and write result to file. ( result is in format "number of random walks-corresponding summary size")
-
WARNING: It may converge but will take very long time!!!
java OnlineSummary --dataset pathToTDB2dataset --create_summary pathToNewSummary --GT groundtruth --spo --sampling > result.txt
-
This is equivalent with executing the query to build the summary => It may takes very long time to finish TOO!
java OnlineSummary --dataset pathToTDB2dataset --query pathToQuery --create_summary pathToNewSummary --wa
-
This is the recommended mode. You can modify the ground truth value to decide the size of summary that you want to obtain.
java OnlineSummary --dataset pathToTDB2dataset --query pathToQuery --create_summary pathToNewSummary --GT desiredsummarysize --wa --sampling
Number of random walks that we need to draw and the corresponding summary size will be printed out when you use the sampling flag so you can decide to write it into file as describe in the SPO-Sampling
[1] Julien Aimonier-Davat, Minh-Hoang Dang, Pascal Molli, Brice Nédelec, and Hala Skaf-Molli. 2023. RAW-JENA: Approximate Query Processing for SPARQL Endpoints.
[2] Julien Aimonier-Davat, Minh-Hoang Dang, Pascal Molli, Brice Nédelec, and Hala Skaf-Molli. 2024. FedUP: Querying Large-Scale Federations of SPARQL Endpoints.