-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Loops in triggers #94
Comments
Patch proposed in https://github.com/bbcarchdev/spindle/tree/SPINDLE-94 |
Stage |
The main issue here is the named graph in the sample set as "http://example.org/works/as_you_like_it#id". If in the sample the named graph is updated to "http://example.org/works/as_you_like_it" there is no issue. The triple "http://example.org/works/as_you_like_it http://xmlns.com/foaf/0.1/primaryTopic http://example.org/works/as_you_like_it#id" creates a membership trigger (because http://example.org/works/as_you_like_it#id partOf http://example.org/works/as_you_like_it). Both "http://example.org/works/as_you_like_it" and "http://example.org/works/as_you_like_it#id" also get triggers from the named graph. If that named graph is "http://example.org/works/as_you_like_it#id" we get "http://example.org/works/as_you_like_it" partOf "http://example.org/works/as_you_like_it#id" : a loop. |
Rather than preventing the creation of loops we can implement a system to make the processing stop at some point. Capturing some mail discussion:
|
… as part of local resultset filtering (#94)
The issue is explained in length earlier in this thread. The problem is that because every resource becomes a part of the information resource describing it, and because in the case of the test sample that information resource is actually the non information one, this creates a loop. Here is another screenshot of what I currently have in the triggers: See in particular the lines ee4 -> 8c6 and 8c6 -> ee4. |
Ah, yes. Okay, two moderately quick fixes needed — one is to apply the fix from df10c5a to membership, not to triggers The other is the named graph fragment fix (which will need a bit of care and attention to the database to resolve). Once that's done, then a proper implementation of refcounting circular dependency tracking can be implemented. |
Is there a document or specification anywhere about how the "refcounting circular dependency tracking" will work? |
It's described above |
I did an ingest of some Shakespeare data from Richard Light last week, which was OK. But I think this is because I created the nquads myself, and did them in such a way that I stripped /#.+$/ from resource URIs to get the named graph URIs. This prevents the situation Christophe describes, as information resources (without a #) will end up in the same graph as non-information resources (with #id), instead of in a circular relationship. I've since tried to ingest some Shakesepare data from the dataset-sampler project, which comes from shakespeare.acropolis.org.uk. This data does appear to have triples in a relationship which causes the loop, as described by Christophe. The attached file is a minimal case which demonstrates the problem (attempting to import these two triples never terminates): loop_creator.nq.txt |
Oh, yes, I was hoping for more of a formal / pseudocode kind of specification + some examples of how it would work. The description above is a bit vague and difficult to visualise without a worked example showing what happens when and how it prevents loops. |
It might be difficult to visualise, but the vagueness is troubling me, can you elaborate? |
"when processing begins, check the reference count in the state table" - doesn't specify which reference count (assumption is that it's the URI you're about to create triggers from?) "skip processing and further triggering when the count reaches our threshold" - what happens if the threshold is 4 and we need to trigger 5 things from this URI? Is the count evaulated on every potential trigger or just at the start of processing a URI? "which may be slightly more than the number of triggers which are actually activated" - if the reference count goes up by 5 but only 3 triggers are activated, how does the extra 2 get accounted for? "when processing is completed (whether skipped or not)" - how can you complete skipped processing? "decrement the count of each inbound trigger's originating proxy by one" - if X triggers A, B, C, then presumably processing A, B, C, will lead to a decrement on X because A->X, B->X, C->X, etc? Or is this suggesting that X triggering A leads to a decrement on A? A worked example would help to clarify these a lot. |
@rjpwork - happy to clarify well, the processing cycle in as this is in the context of generate a proxy, "skip processing" expands to "skip the contents of "which may be slightly more than…” —the only reason you'd skip applying a trigger is because it's already been triggered "when processing is completed (whether skipped or not)" means even if you skip the bulk of because triggers are always directional (i.e., proxy X triggers updates on A, B, and C), inbound trigger implies the reverse; thus at the end of processing of each of A, B and C would be decremented. |
With respect to this test example it turns out it did not came from the Shakespeare data dump but was generated, with errors, from another script. So the Shakespeare data is actually fine and there is nothing to fix there... |
@nevali Ta. Just a couple more. "thus at the end of processing of each of A, B and C would be decremented" - presumably this is decrementing X at the end of processing each of A, B, C. What happens if we increment X by 4 (because we'd activate 4 triggers) but only 3 triggers are actually activated? Can this happen? Does this algorithm work in isolation or does it need to include things like "if we've already got A:X as DIRTY in the state table, ignore another A:X trigger"? Or even if A:X is COMPLETE? Or both? Apologies for the questions but without a state diagram or worked example, it's difficult to visualise how it works and how it prevents loops. |
@rjpwork yes, sorry, decrementing X. mismatched increments and decrements shouldn't happen; the exception is if we decide a trigger is actually invalid (e.g., it's an item triggering itself), at which point we should just ignore it. This particular algorithm should work in isolation — it certainly ought not to be status-dependent. Don't apologise for the questions — although tracing the flow on paper to aid visualisation might be a good way to quickly get up to speed with what's going on generally! |
@nevali Is it possible that processing proxy A can queue up a trigger (or set of triggers) which then starts processing the next proxy B before the full set is read from proxy A? ie. the reference count can be "partial"? |
Right, I'm going to work through this graph and see if I can follow this algorithm.
|
There is still a loop when ingesting certain types of data, e.g. where there is a non-information resource like:
and an associated information resource like:
I've attached a set of n-quads in this kind of relationship, which causes a loop to occur. Ingesting this with twine on the command-line should demonstrate the issue. |
In those quads having |
@townxelliot okay, there'a GIGO issue in that case which we should prevent from cascading (that actual triple is nonsense and we should arguably special-case triples matching that general pattern and ignore them). |
For the web-semantically challenged amongst us (ie me), why is that triple nonsense? (Admittedly, this ticket isn't the best place for this discussion but since it came up here...) |
because it's back to front; it states that something that can only be a non-information resource (i.e., a conceptual thing of some kind) has as its primary topic something which we know must be an information resource — the latter is in principle valid in the general case, the former isn't. It's saying "my chair is about this invoice" rather than "this invoice is about my chair". |
I'd argue that the triple isn't nonsense: there are plenty of instances where non-information resources have non-hash URIs, and information resources have hash URIs, whether or not that's correct according to semantic web conventions (where it's usually non-information resources which have hash URIs). (Note that http://example.com/foo#data is an information resource, so it can have the non-information resource http://example.com/foo as its primaryTopic. It's just the URIs which are "wrong".) For example, I'm not sure of the status of terms in an ontology (whether they are information or non-information resources; I'd say non-information), but I've seen both non-hash URIs (http://xmlns.com/foaf/0.1/Agent) and hash URIs (http://purl.org/NET/c4dm/event.owl#Event) used to define them. They can't both be "correct" by your definition. Also, there are bound to be cases in the wild where the two are mixed up or used incorrectly (according to the convention), so some guard against this seems advisable. |
|
I agree with one thing @nevali said; the URI It would also not resolve logic errors for other predicates which we don't understand, for example when:
is defined to be illogical by the foo namespace, yet:
is fine. Having not been aware of this issue until yesterday, and not yet seen any code, I expect I have misunderstood at least two or three important points somewhere along the line so please point them out to me. |
Just to be devil's advocate: we may come to a point in the future where non-information resources (dogs, people) can be transferred over the internet as information. Even within my lifetime, books (in the sense of "a thing your eyes can physically read") have gone from being non-information resources to information resources. What happens to the NIR hash URIs when the objects they denote can be transmitted as information? Surely the NIR/IR distinction is purely a technological one? |
I don't agree that books have gone from non-information resources to information resources. The information resource you're referring to is a digital encoding of the text and images within a particular book (perhaps even a specific edition or publication of a book). Editions and publications are intangible non-information resources. Tangible non-information resources would include "this copy of such and such, which I've owned since I was eight and once dropped in a puddle". All four of those would have different URIs, and only one is retrievable. |
@townxelliot The point remains that a HTTP/HTTPS URI with a fragment cannot ever in LOD terms be an IR because the protocol doesn't work that way. @nickshanks FWIW The crawler itself isn't relevant here, this is purely about the data processing. |
When ingesting a data using a resource as the named graph a loop is created in the triggers.
Example:
the following file
sample.txt
generates:
The text was updated successfully, but these errors were encountered: