Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

getSolidDataset is blocking #1158

Closed
mrkvon opened this issue Jul 20, 2021 · 5 comments · Fixed by #1173
Closed

getSolidDataset is blocking #1158

mrkvon opened this issue Jul 20, 2021 · 5 comments · Fixed by #1173
Labels
bug Something isn't working

Comments

@mrkvon
Copy link

mrkvon commented Jul 20, 2021

Search terms you've used

  • skimmed through all issues
  • solidDataset, getSolidDataset, blocking, synchronous

Bug description

Overview

When executing getSolidDataset with a particularly large turtle file, a javascript app freezes for a while
The async nature of getSolidDataset would suggest that the code is non-blocking, but the actual behaviour suggests otherwise.

Example

An app would freeze for 71 seconds when running getSolidDataset('https://ruben.verborgh.org/profile/#me')

Detail

On a closer look, it seems that the blocking code is the following:

Blocks for 60 seconds:
https://github.com/inrupt/solid-client-js/blob/main/src/rdfjs.internal.ts#L317-L319

Blocks for 11 seconds:
https://github.com/inrupt/solid-client-js/blob/main/src/rdfjs.internal.ts#L326-L347

Maybe these pieces of code could be more efficient? Maybe they run with high time complexity? (Didn't study it in detail seems like a long-running cycle detecting algorithm)

The time is just an example. Of course it will take different times in different context.

To Reproduce

  1. import { getSolidDataset } from '@inrupt/solid-client'
  2. run await getSolidDataset('https://ruben.verborgh.org/profile/#me')
  3. see that things freeze for a long time

Minimal reproduction

Used a custom code sandbox, not the one provided here.

code sandbox: https://codesandbox.io/s/gracious-buck-7xyvj?file=/src/index.ts
resulting app: https://7xyvj.csb.app/

Expected result

When await getSolidDataset('https://ruben.verborgh.org/profile/#me') is executed within an app, the app keeps running.
The getSolidDataset either runs faster, or runs asynchronously or both.

Actual result

The app freezes for a long time.

Environment

$ npx envinfo --system --npmPackages --binaries --npmGlobalPackages --browsers
 System:
    OS: Linux 5.12 Arch Linux
    CPU: (4) x64 Intel(R) Core(TM) i5-3320M CPU @ 2.60GHz
    Memory: 229.06 MB / 5.62 GB
    Container: Yes
    Shell: 😛 not relevant
  Binaries:
    Node: 16.2.0 - /usr/bin/node
    Yarn: 1.22.10 - /usr/bin/yarn
    npm: 7.15.0 - /usr/bin/npm
  Browsers:
    😛 not relevant
  npmPackages:
    @inrupt/solid-client: ^1.10.0 => 1.10.0 
    node-fetch: ^2.6.1 => 2.6.1 
  npmGlobalPackages:
    😛 not relevant

Additional

If you want to see a production app running into this issue, you can open a solid friends crawler, stay logged out and watch for a while, until the app reaches the profile https://ruben.verborgh.org/profile/#me and freezes

@mrkvon mrkvon added the bug Something isn't working label Jul 20, 2021
@NSeydoux
Copy link
Contributor

Hi @mrkvon, thanks for reaching out ! I'm not entierly sure, and I'm still investigating the issue, but what you are describing may be the symptom of a performance issue : it looks like the event loop gets stuck when processing the fetched data, which blocks other async calls from happening. I can reproduce the behaviour in NodeJS as well with the following snippet:

import { getSolidDataset } from "@inrupt/solid-client";

const wait = (time) => new Promise((res) => setTimeout(res, time));

(async () => {
  await getSolidDataset("https://ruben.verborgh.org/profile/#me");
  console.log("finished processing");
})();

(async () => {
let i=0;
while(i<100) {
        console.log("foo", i);
        i+=1;
        await wait(100);
}
})()

This shows that getSolidDataset isn't technically "blocking", because a couple of console log happen appropriately, but then as in your example shows the event loop gets hung on processing some operations in getSolidDataset, preventing the while loop to keep on going. Is that consistent with your experience ? I'll keep you updated as we work on resolving this.

@mrkvon
Copy link
Author

mrkvon commented Jul 24, 2021

@NSeydoux

Indeed that's consistent:

First, the turtle document gets fetched, async.
Then it gets parsed with n3, probably async, in any case fast.

So the event loop is still available for other stuff, like console.log.

Then the parsed data gets further processed, synchronously. As you mentioned, the event loop gets stuck here.

Specifically:

Blocks for 60 seconds:

blankNodeObjects.forEach((blankNodeObject) => {
cycleBlankNodes.push(...getCycleBlankNodes(blankNodeObject, quads));
});

Blocks for 11 seconds:

const chainBlankNodes = blankNodeSubjects
.concat(blankNodeObjects)
.filter((blankNode) => {
// ....removing those Blank Nodes that are part of a cycle...
if (
cycleBlankNodes.some((cycleBlankNode) =>
cycleBlankNode.equals(blankNode)
)
) {
return false;
}
// ...and then returning only those Blank Nodes that only occur in the
// Object position for a single Subject, i.e. that are part of a single
// chain:
const subjectsWithThisNodeAsObject = quads
.filter((quad) => quad.object.equals(blankNode))
.map((quad) => quad.subject);
return (
subjectsWithThisNodeAsObject.length > 0 &&
everyNodeTheSame(subjectsWithThisNodeAsObject)
);
});

I figured that out by putting a couple console.log inside the library and building it locally.

Even more specifically, the culprit seems to be getCycleBlankNodes. That's supposedly a recursive function that detects cycles in a graph. So the time complexity may be high, depending on the algorithm used. But can we do any better?

To fix this

It would be useful to understand:

  • why we need to detect the cycles in the first place,
  • how the cycle-detecting algorithm works, and
  • if it can be made more efficient.

There seem to be O(|V| + |E|) algorithms for detecting cycles. I suspect the algorithm used by getCycleBlackNodes is less efficient.

Here my knowledge of the issue ends. I hope the above analysis may help...


Alternatives?

Are there other, less powerful and less heavy methods in this library to fetch and process turtle documents? Which don't attempt to create full SolidDataset?

Otherwise one can always use n3 or something for parsing turtle and process the triples/quads on lower level; but that doesn't sound like fun.

Or do you have other ideas?

@Vinnl
Copy link
Contributor

Vinnl commented Jul 28, 2021

Thanks for the very detailed analysis @mrkvon! It even helped me find and fix a related bug :)

In #1173 I have submitted a PR that simply skips the chain detection algorithm if there are more than 20 Blank Nodes, since it's non-essential. That should solve/work around the 60-second block. You can give it a try today by installing the preview release using

npm install @inrupt/solid-client@fix1158-blocking

If you could let us know whether it improves the situation for you, that would be great :)

Parsing a Resource with many statements is unfortunately still expensive, and likely will be for a bit. For future reference, here's some ideas of things we could do:

  • Use something like requestIdleCallback to split up the parsing of individual Quads into small units of work that are executed when the script is not doing anything more important. A challenge here is that Node does not currently implement that, so we'd need some wrapper that either implements something similar there, or just keeps the work to be blocking when running in Node. Additionally, this would only make the generation of our data structures non-blocking - n3 would need to implement something similar to improve that half of the work.
  • Implement a direct parser from Turtle to our data structure. This would break the ability to implement your own parser though, and would only improve parsing complexity from O(2n) to O(n) (currently, n3 first does a pass to generate Quads, and then we do a pass over every Quad to generate our data structure).

@mrkvon
Copy link
Author

mrkvon commented Jul 28, 2021

@Vinnl
Thank you for looking into this!

Indeed with the fix, the blocking (of the getSolidDataset('https://ruben.verborgh.org/profile/#me')) was cut down to about 6-7 seconds:

  • parsing the document with n3 seems to block other code for 1 second
  • the cycles detection doesn't take any time now (well, it was skipped, wasn't it)
  • the remaining blocking time (~5 s) is taken by
    solidDataset = quadsWithoutChainBlankNodeSubjects.reduce(
    (datasetAcc, quad) =>
    addRdfJsQuadToDataset(datasetAcc, quad, {
    otherQuads: allQuads,
    chainBlankNodes: chainBlankNodes,
    }),
    solidDataset
    );

🎉

so I suppose this is as good as it gets. (also the size of the document at https://ruben.verborgh.org/profile/#me seems to be an exception rather than a regularity, so hopefully people won't bump into this issue very often)

requestIdleCallback sounds exciting! Like exactly what's needed in a case like this. Didn't know it.

Further thoughts

Maybe for some use cases the developer could make their own n3 wrapper which would filter out the irrelevant triples/quads. Then they would pass that custom parser to getSolidDataset. getSolidDataset seems to provide that option. This would cut away all the processing (that this library does) that's not relevant for the particular app. But maybe pruning the SolidDocument like this would mess up with e.g. saveSolidDatasetAt, i don't know. In any case should be good enough for reading data.

@Vinnl
Copy link
Contributor

Vinnl commented Aug 3, 2021

The change has now been shipped in 1.10.1. I'll keep this issue open for posterity in case someone wants to look into making parsing non-blocking too.

@Vinnl Vinnl reopened this Aug 3, 2021
mrkvon added a commit to mrkvon/friend-crawler that referenced this issue Aug 6, 2021
Try upgrading all other packages, too
Because there is a fix of the parsers getting stuck on big files
They'll still get stuck, but not that long.

inrupt/solid-client-js#1158
mrkvon added a commit to ditup/solid-interests that referenced this issue Aug 14, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants