-
Notifications
You must be signed in to change notification settings - Fork 65
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
hdt::QueryProcessor.searchJoin() gives incorrect results #265
Comments
… BasicVarBindingString class to its own pair of implementation files to improve readability.
Strange indeed. Are you following up on this? |
I have read through some of the code, but I am still investigating the cause. My next step is to trace through the execution of the test case and see if I can find where the logic breaks. I have a sandbox setup with the Python and C++ repositories working together. So far, I have just run clang-format on the relevant C++ classes and moved BasicVarBindingString to its own .hpp/.cpp files. Based on the comments in the code, it looks like hdt::QueryProcessor.searchJoin() was a work-in-progress and never fully implemented. @mielvds - if you or anyone else knows the history of the QueryProcessor.searchJoin(), please let me know. |
only @MarioAriasGa and if you're lucky @LaurensRietveld might know more. |
Around https://github.com/rdfhdt/hdt-cpp/blob/develop/libhdt/src/sparql/QueryProcessor.cpp#L90, I suspect the triplePatID variable is assigning "0" for cases that should be distinct. A "0" is used when a subject, predicate, or object is a variable and will therefore match anything. However, a "0" is also used when the string does not match anything from the dictionary. Thus, strings that are non-matches are effectively treated as variables that match anything. |
TBH, I didn't even know that there was a (partial) SPARQL implementation in the HDT-CPP. My guess is that it is used nowhere and was probably never finished. In the Java version, the query processing is offloaded to Jena, maybe something similar is possible with oxigraph or even rdflib. |
It looks like hdt::QueryProcessor is limited to processing Basic Graph Patterns (BGP), so it is still short of a SPARQL implementation. I understand that one branch of code queries triples via a single BGP at a time. The RDFLib/rdflib-hdt library uses that approach by default. The hdt:QueryProcessor appears to extend that capability to add efficiencies for the case of multiple BGPs at once. This is critical for performance and leveraging the Dictionary (index). The rdflib-hdt library has an optimize_sparql() function that causes it to use the QueryProcessor for multiple BGPs instead of querying one BGP at a time and then aggregating them in the rdflib SPARQL engine. I suspect that any pluggable SPARQL engine sitting on top of HDT (e.g., Apache Jena ARQ, Python rdflib, etc.) can interface with the HDT function for querying a single BGP at a time. But, anytime that approach is taken, performance will be left on the table as the Dictionary may remain underutilized for specific optimizations. An HDT function (hdt::QueryProcessor.searchJoin) that can provide an interface to a set of BGP and give an efficient response does seem like it would be an essential interface underneath any SPARQL engine. It would be interesting to compare how the HDT Java version handles this. If anyone familiar with that codebase could confirm my assumptions of how things work and point me to the relevant Java code for comparison, that would be very helpful. |
Filing this issue for hdt-cpp work on RDFLib/rdflib-hdt#14. See other issue for test case and additional details.
The text was updated successfully, but these errors were encountered: