Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OrderedCollection: persist items order #568

Open
srosset81 opened this issue Feb 3, 2021 · 39 comments · May be fixed by #1355
Open

OrderedCollection: persist items order #568

srosset81 opened this issue Feb 3, 2021 · 39 comments · May be fixed by #1355

Comments

@srosset81
Copy link
Contributor

srosset81 commented Feb 3, 2021

Issue
Currently, to manage OrderedCollection, items are ordered according to a predicate (for example as:published). But normally the order of items is persisted.

Proposal

@srosset81
Copy link
Contributor Author

srosset81 commented Oct 25, 2024

This would be important for performances !

For an inbox with 6000 activities, and with WAC permissions check, it takes 25-30s to load the first page (and all other pages) because all activities must be ordered by the as:published activity. If we correctly persisted the number of items, we would not need to load and sort this whole list ! And it would be better for standards compliance too.

@srosset81
Copy link
Contributor Author

@srosset81
Copy link
Contributor Author

One possibility could be to persist the OrderedCollectionPages (SemApps does not persist them at the moment, they are auto-generated by the middleware). It would be faster to go to a specific page.

The ActivityStreams spec has a long section about collections:
https://www.w3.org/TR/activitystreams-core/#collections

@srosset81 srosset81 changed the title Persister l'ordre des items des OrderedCollection avec rdf:Seq OrderedCollection: persist items order Oct 25, 2024
@Laurin-W
Copy link
Contributor

Maybe it makes sense to distinguish here between RDF representation and implementation.
Personally, I like the sort predicate as a concept quite well because it fits better into the world of rdf and also local first. Here it is the same, the rdf representation is not reflected in the structure of the rdf data persisted in the graph db.

One possibility could be to persist the OrderedCollectionPages

You mean in the redis cache for example?

Spinning this further, we could even build a custom index (for at least some predicates) in the middleware to handle this but it's not really pretty either...
Ideally the database would support indexing based on a predicate, like sql databases support this as well.

@srosset81
Copy link
Contributor Author

srosset81 commented Oct 27, 2024

Ideally the database would support indexing based on a predicate, like sql databases support this as well.

The problem is not indexing. Fuseki handles indexing very well. The problem is that, by definition, a triplestore cannot keep track of the order of a list, so we need to rely on the rdf:Seq or rdf:List do that.

One possibility could be to persist the OrderedCollectionPages

You mean in the redis cache for example?

No I was suggesting to persist it in the triple store. Right now when we fetch a collection, we build "on the fly" the special API for collections, such as totalItems, first, last, etc. We also build "on the fly" the CollectionPage or OrderedCollectionPage. I was suggesting that maybe ALL these informations could be persisted in the triple store, so that they could be fetched like any other resource (which would be much cleaner from a Solid point of view).

I'm not sure however it's a good idea, because it would be a lot of work to "rearrange" the triplestore whenever there is a change. For example the last predicate would need to be updated for all pages whenever a new page is added. The other problem is that, in the inbox and outbox, we are supposed to put the latest activity at the very top. So potentially we need to rework all the pages whenever there is a new item.

So with this solution, the GET may be faster, but the POST may be longer. This is not a big problem, but I'm more afraid of the kind of bugs that could come out of this.

@srosset81
Copy link
Contributor Author

srosset81 commented Nov 5, 2024

FYI I found out that the LDP paging specs allow some sort of ordering, but it is not persisted: rather, it is only a way for a server to inform the client the way the resources have been ordered when paging is requested.

See the example of what can be added to a container:

<> ldp:pageSortCriteria  ( <#Sort-o.marketValue-Ascending> ).

<#Sort-o.marketValue-Ascending>
   a ldp:pageSortCriterion;
   ldp:pageSortOrder ldp:Ascending;
   ldp:pageSortPredicate o:marketValue.

That looks a bit like what we implement now now for ActivityStreams ordered collections (like in this example). If we applied this part of the LDP paging spec, we would certainly have the same kind of performance issues.

Anyway, if LDP cannot handle ordered lists, then it means we probably can't consider ActivityStreams collections as another "representation" of LDP containers, as @elf-pavlik suggested in our recent call. Or at least not ordered collections (used for inboxes and outboxes)

@niko-ng
Copy link

niko-ng commented Nov 13, 2024

This would be important for performances !

For an inbox with 6000 activities, and with WAC permissions check, it takes 25-30s to load the first page (and all other pages) because all activities must be ordered by the as:published activity. If we correctly persisted the number of items, we would not need to load and sort this whole list ! And it would be better for standards compliance too.

We will have to be very careful here on this topic.

I am planning to do some major work in NextGraph to have ordered collections (arrays) work well, performance wise, and also compatible with CRDT. But for now, we do not have that.

My first question is about this query that takes 30s. This is not normal and can probably optimised.

The first problem I see is that you fetch all the triples in memory, and order them in JS. This is suboptimal.
RDF and SPARQL can do that for you, and can also do range queries, but only if the datatype of as:published is an integer. Unfortunately I see in the specs that it is a dateTime. Triplestores in general cannot index dates. Jena is one of them. Oxigraph the same. So we have to convert the date into a long integer (unix timestamp) and then queries on that value (filter, order, range) will work well with much better perfs (including in Jena), and more importantly, we will not need to load all the items in JS memory in order to do the ordering.

Then the discussion about pagination is not an easy one. I started to talk about it in Matrix with Laurin and we have to understand exactly what are the uses cases, because it has an impact on how we implement it and what kind of features we can skip, in order to improve prefs.

As far as I understood, we need to be able to add an item at the beginning of the list, and at the end of it, but not in the middle. Please confirm or debate on that here as it is important to know the answer now.

Another question: do we need to randomly access arbitrary page numbers "out of nowhere" ? like ?page=10 or can we reduce the usecase to a timeline scrolling where we start from the last item in the collection, and we repeatedly want the pages one by one going down in the "past" of the timeline? If this is the case, the API can be simpler, with just ?nextPageAfter=XXX&page_size=10 and XXX would be the long integer of the last activity we have received in the previous page. Then it is easy to ask in SPARQL, for the preceding 10 items of XXX. (10 being the page size here, as an example).
There is nothing in the as:Collection specs that hints at a way to retrieve random pages! It seems that their usecase is more to start from the first page, and iteratively go from one page to the next. This is indeed the behaviour of a timeline. accessing random pages is more usually seen on a "table" representation of some data, where you have a pagination widget at the bottom that lets you click on any page number. Is that something that we need for collections?

Then there is the question of how to retrieve the pages. They can be generated in memory on the fly (on request), or they can be persisted (I previously came up with this idea) but less easily if we can insert items at the beginning of the collection (it will create a first page that is not full, which is not very nice in terms of UX). And not possible at all if we want to insert random items in the middle of the list.

If we can persist the pages, then we do not need to persist the individual order of each item, as this is easily fetched by knowing the starting item of a page, and querying in the next X ordered items (X being the page size). Jena will do that easily.

Please note that Jena has no support for rdf:seq at all, and will not index the data according to the rdf:_1 predicates. SPARQL will not help us here if we use this kind of order persisting. So I think this is a dead end.
Jena has support for rdf:List (which are linked lists) but only if the ARQ engine is loaded, which is not the case for us. And anyway we do not want rdf:List (unless we change the usecase and accept that random page lookup is not possible).

Please detail here below the user cases (Inbox, Outbox, other ordered collections, and the requirements for them (insert at the begin, end, in the middle, and the rational behind those requirements).

@niko-ng
Copy link

niko-ng commented Nov 13, 2024

@srosset81

I was suggesting to persist [the OrderCollectionPage]s in the triple store

in fact, I was suggesting that ;)

@srosset81
Copy link
Contributor Author

Hi @niko-ng, thanks for your comment ! Just a few quick answers, in the hope it helps you. I'll read your comment more thoroughly later, in case I missed some important informations.

The first problem I see is that you fetch all the triples in memory, and order them in JS. This is suboptimal.
RDF and SPARQL can do that for you, and can also do range queries, but only if the datatype of as:published is an integer.

No, we don't sort them with JS. The SPARQL query that takes 30s is here:

const result = await ctx.call('triplestore.query', {
query: `
PREFIX as: <https://www.w3.org/ns/activitystreams#>
SELECT DISTINCT ?itemUri
WHERE {
<${collectionUri}> a as:Collection .
OPTIONAL {
<${collectionUri}> as:items ?itemUri .
${options.ordered ? `OPTIONAL { ?itemUri <${options.sortPredicate}> ?order . }` : ''}
}
}
${
options.ordered
? `ORDER BY ${options.sortOrder === 'http://semapps.org/ns/core#DescOrder' ? 'DESC' : 'ASC'}( ?order )`
: ''
}
`,
accept: MIME_TYPES.JSON,
webId
});

Triplestores in general cannot index dates. Jena is one of them. Oxigraph the same. So we have to convert the date into a long integer (unix timestamp) and then queries on that value (filter, order, range) will work well with much better perfs (including in Jena), and more importantly, we will not need to load all the items in JS memory in order to do the ordering.

OK, but then this would be something internal to NextGraph ? Because if we want to conform to ActivityPub specs, the published date must be a ... date.

As far as I understood, we need to be able to add an item at the beginning of the list, and at the end of it, but not in the middle. Please confirm or debate on that here as it is important to know the answer now.

Well the specs about OrderedCollections do not specify that, so in theory we should be able to add items anywhere we want. But in ActivityPub, OrderedCollections are used only for inboxes and outboxes (and I don't know of other usages in the fediverse, but I'm not an expert), so naturally it should be added at the beginning or the end (the most recent activities should appear on the first page, but we could use a reverse order).

Another question: do we need to randomly access arbitrary page numbers "out of nowhere" ? like ?page=10 or can we reduce the usecase to a timeline scrolling where we start from the last item in the collection, and we repeatedly want the pages one by one going down in the "past" of the timeline? If this is the case, the API can be simpler, with just ?nextPageAfter=XXX&page_size=10 and XXX would be the long integer of the last activity we have received in the previous page. Then it is easy to ask in SPARQL, for the preceding 10 items of XXX. (10 being the page size here, as an example).
There is nothing in the as:Collection specs that hints at a way to retrieve random pages! It seems that their usecase is more to start from the first page, and iteratively go from one page to the next. This is indeed the behaviour of a timeline. accessing random pages is more usually seen on a "table" representation of some data, where you have a pagination widget at the bottom that lets you click on any page number. Is that something that we need for collections?

Collections only need to implement first, prev, next and last predicates (and I'm not even sure they are all required, but probably recommended). So we can use any query strings we want. What you are suggesting reminds me of the way pagination is usually handed with GraphQL (with what they call "edges")

Please note that Jena has no support for rdf:seq at all, and will not index the data according to the rdf:_1 predicates. SPARQL will not help us here if we use this kind of order persisting. So I think this is a dead end.

Do we really need special SPARQL queries here ? If we want the items 20 to 30 in a rdf:seq, can't we just retrieve rdf:_20, rdf:_21, and so on ?

@Laurin-W
Copy link
Contributor

Laurin-W commented Nov 13, 2024

As far as I understood, we need to be able to add an item at the beginning of the list, and at the end of it, but not in the middle. Please confirm or debate on that here as it is important to know the answer now.

I'd say so too. For removing items though, it's not the same. E.g. if I undo a Like activity, I don't want that to show up in the likes collection where it says:

Every actor MAY have a liked collection. This is a list of every object from all of the actor's Like activities, added as a side effect. The liked collection MUST be either an OrderedCollection or a Collection and MAY be filtered on privileges of an authenticated user or as appropriate when no authentication is given.

So we also get the problem of filtering (which we have already though).

do we need to randomly access arbitrary page numbers "out of nowhere" ? like ?page=10 or can we reduce the usecase to a timeline scrolling where we start from the last item in the collection, and we repeatedly want the pages one by one going down in the "past" of the timeline [...]
It seems that their usecase is more to start from the first page, and iteratively go from one page to the next

I would say so too. Though I see a use-case where you want to see all posts that were added before e.g. 2023.

This is indeed the behaviour of a timeline. accessing random pages is more usually seen on a "table" representation of some data, where you have a pagination widget at the bottom that lets you click on any page number. Is that something that we need for collections?

I don't think so, you might want to apply a filter before though, as described above. Can we reduce the problem to applying filters and then scrolling from there sequentually?

@niko-ng
Copy link

niko-ng commented Nov 13, 2024

OK, but then this would be something internal to NextGraph ? Because if we want to conform to ActivityPub specs, the published date must be a ... date.

My bad f you use SPARQl for ordering, better! but indeed, the 30seconds come from the fact that SPARQL cannot index dates. You would store both a date and a timestamp, and return only the date to ActivityPub compliant APIs. the timestamp would just be used internally. Or if you want to save space, you would not save the dateTime, and instead regenerate it from the timestamp at the moment of outputting the data towards ActivityPub systems. Nothing specific to NExtGraph here. tried to stay away from anything specific to NextGraph in my answer, because I know it concerns a task that is starting now.

@niko-ng
Copy link

niko-ng commented Nov 13, 2024

Well the specs about OrderedCollections do not specify that, so in theory we should be able to add items anywhere we want. But in ActivityPub, OrderedCollections are used only for inboxes and outboxes (and I don't know of other usages in the fediverse, but I'm not an expert), so naturally it should be added at the beginning or the end (the most recent activities should appear on the first page, but we could use a reverse order).

If we have the choice, then better to add at the end (and reverse order if needed) because adding at the end is always simpler for managing pages.

The question of "insert in the middle" might occur if you have activities added to the inbox by example, by a remote server, who was offline for some time (for some technical reason) and that will add "outdated" activities, that have an as:published date that is slightly behind the current time. If some more recent activities are already present in the inbox, then those incoming "old" activities will be inserted "in the middle".... what do you think of this case? can it occur?

@niko-ng
Copy link

niko-ng commented Nov 13, 2024

Do we really need special SPARQL queries here ? If we want the items 20 to 30 in a rdf:seq, can't we just retrieve rdf:_20, rdf:_21, and so on ?

Yes we can, but each "fetch" will be inefficient.
And we need to take into account that if there are some insert in the middle, then all the rdf:_xx above should be modified. Same if there is a delete in the middle.

Eventually, it will be good to know if we really need "random access to pages" or if we can use only next and previous.

And also, maybe use a special kind of URL for next and previous that in fact, contains the URI of the last resource in the current page (for next) or the first resource of the current page (for previous) so that it will be easy to find the next/previous items and build the page on the fly. (like GraphQL is doing, as you say). This mechanism also works if there were some concurrent inserts or deletes between the navigation from one page to another occurs, so that the page size is always correct. And also it works without persisting any metadata about the pages.

@niko-ng
Copy link

niko-ng commented Nov 13, 2024

I don't think so, you might want to apply a filter before though, as described above. Can we reduce the problem to applying filters and then scrolling from there sequentually?

yes that would be great. It also solves the "before 2023" question. you just need to find the starting point, and then iterate pages with next. Finding the starting point is one SPARQL query, that can be quick.

Filtering can also be quick, specially if the ordering is based on long integer timestamps. as explained above. The 30 sec. you see for now is because Jena doesn't keep indexes on DateTime values.

@niko-ng
Copy link

niko-ng commented Nov 14, 2024

BTW, I just saw that OxiGraph is automatically converting DateTime dataTypes to long in the storage, and back to DateTime when outputing. This is smart because indexing on long is automatically done. I remember that Jena is not doing that and is not indexing DateTime.

@srosset81 and @Laurin-W would you have a moment to answer my questions above so we can have a final decision on this topic?

@Laurin-W
Copy link
Contributor

About servers coming back online and synchronizing that with your collection -- I think that makes sense, if you want to have a chronological order. I'd say we can live without that, if it makes things easier but it's definitely a nice to have!

Eventually, it will be good to know if we really need "random access to pages" or if we can use only next and previous.

Agree, I don't think we need that. But filtering would be nice, as detailed above.
With that, I can't think of a good use case where you want to have items 580 to 590, if you don't know what's before or after...
The only thing that came to my mind was that if you had search results ranked by relevance somehow and you wanted to know how lower-ranked results look like, you could jump to page 15 or so. But that's not really a use-case for social media, if at all.

And also, maybe use a special kind of URL for next and previous that in fact, contains the URI of the last resource in the current page (for next) or the first resource of the current page (for previous) so that it will be easy to find the next/previous items and build the page on the fly.

I like that idea! :)

BTW, I just saw that OxiGraph is automatically converting DateTime dataTypes to long

Nice! :D

@niko-ng
Copy link

niko-ng commented Nov 18, 2024

Thanks for your reply @Laurin-W

So if I summarise, the solution could be :

  • we do not persist pages. they are generated on the fly.
  • we do not allow "random access to pages", saving us a lot of trouble.
  • instead, each page has a next and previous link, that includes the last or first item in the current page (respectively) in order for the pod provider (the middleware) to construct the next or previous page
  • all pages have a first and last link, that is just "URI of collection"+"/last" or +"/first". and never changes. It is the middleware that will generate those page dynamically.
  • a parameter can be passed to define the page size (called page_size ?), that can also have a default value set system-wide.
  • we do not use rdf:seq as it isn't needed
  • the ordering is done by the SPARQL engine (oxigraph) with the timestamps that are indexed automatically. On Jena, we will have to convert he DateTime values into long (UNIX timestamp), and store them this way. And then when presenting the data in a page, convert the long back to a DateTime.
  • when a new page needs to be generated, we take the starting item and FILTER (?time > [last]) with it, then LIMIT and ORDER BY
  • additional filtering can be done with SPARQL before the LIMIT. any combination is possible.

@Laurin-W
Copy link
Contributor

Perfect, that sounds good!
Thanks for writing this summary :)

@elf-pavlik
Copy link

I only skimmed over the details of this conversation. When AS paged collections were drafted, besides parallel work on LDP pagination, there was also an exploration of pagination in Hydra CG https://www.w3.org/community/hydra/wiki/Collection_Design#Pagination

The one in Hydra was my favorite at the time.

BTW you might consider submitting your pagination, ordering and filtering use cases to the Linked Web Storage WG https://github.com/w3c/lws-ucs

@niko-ng
Copy link

niko-ng commented Nov 19, 2024

Thanks @elf-pavlik for looking into this issue and finding it worthy to be presented to LWS WG.
We have a look at the Hydra collections indeed, but they do have the caveat of relying on ?page=2 which is a "random access to pages" mechanism, which we want to avoid, as explained above.

BTW, while you are here, I want to ask you your opinion:
What do you think about this feature of "random access to pages" that is hard to implement and that we want to specifically not implement. Do you see some use case to access a random page number ?
I believe that starting from somewhere and then doing next > next > next or previous < previous < previous is enough for all cases of scrolling into a collection, streaming it, and going to the "current head" and "first item"

@srosset81
Copy link
Contributor Author

srosset81 commented Nov 19, 2024

Thanks @niko-ng for your proposal and sorry for my late reply.

So I understand that if we store the as:published predicate as a timestamp, the ordering will be faster. But if there are 6000 items in the collection, Fuseki will still need to go through the timestamp of these 6000 items (and more importantly, verify that the user has the permission to see each item because otherwise it won't be able to access the timestamp) to sort them correctly. So will performances be much better ? I wonder. And does it matter so much to go through pages with the URI of the first/last item if Fuseki has already sorted the 6000 activities anyway ?

  • a parameter can be passed to define the page size (called page_size ?), that can also have a default value set system-wide.

This is actually persisted in the collection, with the semapps:itemsPerPage predicate. We could allow to change this parameter with a query string, but this is not standardized. Maybe it's better to add this option when we tackle the issue of filtering collections (which is also not standardized)

If the performances can really be improved with this proposal, the only thing that bothers me is that we will not conform with ActivityPub specs, which expects a list as we can see in the official JSON-LD context. I found again this thread where people criticize this choice of not perstiting the order. It does not have an impact for ActivityPub servers, which don't care about how you store data. But we will still store data in a way that is particular to SemApps.

It's a choice we can do, but it's important to be aware of this downside.

If the performances are much better with Niko's proposal than if we persist the order, then I'd say performances matter more than standards compliance in that case.

@elf-pavlik
Copy link

We have a look at the Hydra collections indeed, but they do have the caveat of relying on ?page=2 which is a "random access to pages" mechanism, which we want to avoid, as explained above.

Looking at Hydra's PartialCollectionView hypermedia controls, I see only four affordances: first, previous, next, and last. The edges would only have two; for example, the first part would have next and last.
I imagine a URI Template would be required to provide a functional hypermedia control that allows access to a random page. Otherwise, clients have to treat all URIs as opaque.

I don't find accessing a random page/part based on its index useful. Filtering the collection before paging it would be more practical, e.g., all the activity from the last month. I still haven't looked at your issue about filtering

@niko-ng
Copy link

niko-ng commented Nov 19, 2024

I see only four affordances: first, previous, next, and last

Yes. Those next and last are similar to what as:OrderedCollectionPage has. And this is what we will use. But the question is about what to put as the URL value in the next and last properties.

And I was just saying that in their example, they put URLs that have a query parameter ?page=2. It is just as an example I guess, but it denotes that they consider their PartialCollectionView to have a URL of this form, which is not something we would like to implement, because it comes with a high performance cost.

But yes generally I agree that both Hydra and AS ontology for pagination is very similar. And honestly I don't know which one is best.

@pietercolpaert
Copy link

Let me add another alternative next to Hydra and AS:

I proposed the TREE specification for that use case: the server has a fixed pagination and the client needs to follow links. However, these links are explained: i.e., you can follow a link to a page that is indicated to contain for example, entities later in time. This way you can build search trees instead of just providing traversal in 1 or two directions with next and previous, and the client can understand a sense of ordering.

@SlyRock SlyRock self-assigned this Dec 2, 2024
@SlyRock
Copy link
Contributor

SlyRock commented Dec 5, 2024

Hi guys,

The ordering is done by the SPARQL engine (oxigraph) with the timestamps that are indexed automatically. On Jena, we will have to convert he DateTime values into long (UNIX timestamp), and store them this way. And then when presenting the data in a page, convert the long back to a DateTime.

After discussing of that issue with @srosset81 I did a "quick" benchmark to see if using timestamp had any impact on the performance of the incriminated query. Sadly I found it didn't...

Here's the methodology I used :

  1. Generate a respectable amount of data in the outbox of a pod with a supplementary publishedTimestamp field alongside the published field. With all the side effects, it was quite time consuming so I stopped the process at 1000 activities which was enough to start getting some long response time.
  2. Query that list while sorting by one field then the other, ASC and DESC (for good measure), with ten iterations of each case

The results where consistent : around 7 seconds for all cases alike.

{
  collectionUri: 'http://localhost:3000/sylvain/outbox',
  iterations: 10,
  metrics: {
    publishedAsc: {
      mean: 7160.1521171,
      max: 7439.259866,
      min: 6972.5529559999995,
      stddev: 183.10672135111417
    },
    publishedDesc: {
      mean: 7135.4899212,
      max: 7425.90527,
      min: 6953.325168,
      stddev: 142.5905536293837
    },
    publishedTimestampAsc: {
      mean: 7176.426645899999,
      max: 7468.686462,
      min: 7037.848225,
      stddev: 114.1708708671262
    },
    publishedTimestampDesc: {
      mean: 7258.4688469,
      max: 7573.864117,
      min: 6995.655813,
      stddev: 211.2586274176565
    }
  }
}

The same query without checking the WAC (webid system) is around 110ms:

{
  collectionUri: 'http://localhost:3000/sylvain/outbox',
  iterations: 10,
  metrics: {
    publishedAsc: {
      mean: 134.372504,
      max: 196.038779,
      min: 101.399353,
      stddev: 27.674281999092027
    },
    publishedDesc: {
      mean: 120.93124560000001,
      max: 164.976534,
      min: 97.509598,
      stddev: 16.894840022577135
    },
    publishedTimestampAsc: {
      mean: 110.37560160000001,
      max: 124.70949,
      min: 100.440995,
      stddev: 7.989068604510643
    },
    publishedTimestampDesc: {
      mean: 107.18265260000001,
      max: 122.531356,
      min: 88.062927,
      stddev: 9.926224379452332
    }
  }
}

We reviewed the benchmarking process with @srosset81 this morning and found no fault.
So unless we missed something, it means that using the timestamp isn't a solution to that issue...
@niko-ng you seem really sure about that option so maybe it would be worth for you to take a look at my benchmarking process and see if we missed something here !

the code of the benchmark is on a branch of the activitypods repo : https://github.com/activitypods/activitypods/tree/benchmark-query

@niko-ng
Copy link

niko-ng commented Dec 5, 2024

Hello :)
That's great that you ran a benchmark! congrats! It will be useful to have this in the future.
I understand that the very bad perfs already occur with 1000 records, so the problem is not with the indexing capacity of Jena, but with the WAC. your benchmark proves it very well.
And yes, I still think that indexing on a xsd:integer is much more efficient in Jena than with a xsd:dateTime, but with 1000 records you cannot see the difference. Maybe we see a little difference in perfs when not using WAC (110ms instead of 134ms, 107 instead of 120) but overall, the main problem is with WACs and that's where all the perfs are degraded, rendering the gain of using the integer, insignificant, specially on a 1000 activities sample.

For the time being, I can only advice you to try not using the WAC if you want resonable perfs.
Another solution could be to store the publishedTimestamp predicate inside the container itself. But that would imply that the subject of this predicate is the activity, and maybe it is a problem to store in a container, some triples with a subject other than the container itself?

# inside the container's resource
<activity1> as:publishedTimestamp "123456"
<activity2> as:publishedTimestamp "123457"
<activity3> as:publishedTimestamp "123458"

If you can try this quickly, I think the WACs will have less impact on perfs if done like this.
I am not saying it is a solution we can adopt in general in SemApps. But just for the sake of benchmarking, I think you will get much better perfs if all the as:publishedTimestamp are in the same resource (a container)

Anyway thank you very much for your efforts in solving this issue!

@niko-ng
Copy link

niko-ng commented Dec 5, 2024

And in general I am kind of surprised to see such bad perfs with WAC. @srosset81 are those the kind of perfs you are used to in general when using WACs? for 1000 records, it shouldn't be that bad

@srosset81
Copy link
Contributor Author

srosset81 commented Dec 5, 2024

For the time being, I can only advice you to try not using the WAC if you want resonable perfs.

One idea we had with @SlyRock is to do this long SPARQL query without WAC indeed. So basically it would return all activities sorted by date, even if the logged user is not allowed to see them. And then we go through these activities one by one, ignoring errors thrown for activities the user is not allowed to see, until we have the correct number of requested activities.

The performance would be much better, except in the case where a user fetch a collection on which there are thousands of activities, but the user is only allowed to see a few of them. In this case, the performances would probably be much worse. It's difficult to evaluate what scenario would be the most probable.

The other problem is that we will not be able to give the number of activities in the collection, since we don't know how many of them can be seen by the logged user without running the long SPARQL query. But that's a problem we will probably also have if we use rdf:Seq or rdf:List.

Another solution could be to store the publishedTimestamp predicate inside the container itself. But that would imply that the subject of this predicate is the activity, and maybe it is a problem to store in a container, some triples with a subject other than the container itself?

# inside the container's resource
<activity1> as:publishedTimestamp "123456"
<activity2> as:publishedTimestamp "123457"
<activity3> as:publishedTimestamp "123458"``` 

Unfortunately I don't think it's possible at the moment, since a RDF document cannot have subjects which are not the same as the document URI. :( At least not until ActivityPods 3.0 ... That would be a good idea though, could it be done in another way ?

@srosset81
Copy link
Contributor Author

srosset81 commented Dec 5, 2024

And in general I am kind of surprised to see such bad perfs with WAC. @srosset81 are those the kind of perfs you are used to in general when using WACs? for 1000 records, it shouldn't be that bad

I'm not surprised. As soon as there are many permissions to check, queries can take several seconds to run. And in the worst cases, it can be several minutes. @simonLouvet can tell you about it.

@niko-ng
Copy link

niko-ng commented Dec 5, 2024

It is not normal.
I should check it. the only problem is that i dont really have time for that, and we want to reimplement WACs anyway, with NextGraph. so i don't know what to do.
the fact that you have benchmark now, will encourage me to have a look into the Java code again
the data set that you use @SlyRock is only containing one user and this collection with 1000 activities?

@SlyRock
Copy link
Contributor

SlyRock commented Dec 6, 2024

the data set that you use @SlyRock is only containing one user and this collection with 1000 activities?

@niko-ng I started with a fresh install, created 2 pods, ran this service's only action through moleculer console to send Notes from one pod to the other.

So I have one pod with a 1000 activities outbox and the other one with smaller inbox cause I interrupted the script before the queued tasks were all handled.

I then ran the benchmark on the 1000 activities outbox using this service only action through the moleculer console.

Don't hesitate to ask if it needs clarification :)

@simonLouvet
Copy link
Contributor

@mguihal to look date ordering

@srosset81
Copy link
Contributor Author

srosset81 commented Jan 14, 2025

One idea we had with @SlyRock is to do this long SPARQL query without WAC indeed. So basically it would return all activities sorted by date, even if the logged user is not allowed to see them. And then we go through these activities one by one, ignoring errors thrown for activities the user is not allowed to see, until we have the correct number of requested activities.

The performance would be much better, except in the case where a user fetch a collection on which there are thousands of activities, but the user is only allowed to see a few of them. In this case, the performances would probably be much worse. It's difficult to evaluate what scenario would be the most probable.

The other problem is that we will not be able to give the number of activities in the collection, since we don't know how many of them can be seen by the logged user without running the long SPARQL query. But that's a problem we will probably also have if we use rdf:Seq or rdf:List.

On further thinking, we will also have the performance problems mentionned above if we use rdf:Seq or rdf:List: the items will be ordered, but we will still need to go through them to see if they are available for the logged user.

With that perspective, I would suggest to run the long SPARQL query without WAC (with webacl: 'system'), and then to fetch these activities with the webId of the logged user. We will avoid a complete refactoring of all existing collections and the performances will be much better.

There is still the issue of pagination: if a collection has 100 items, and the first page returned the items N°4, 8, 15, 28 and 35 (if we have 5 items per page), when loading the second page we want to make sure we start looking AFTER item N°35. So a pointer will be necessary instead of the page param we use now.

  • next: ?after={URI of last item returned in the page} (or no link if we are at the end of the list)
  • prev: ?before={URI of first item returned in the page}
  • first: ?after={URI of first item} (or first=true if it's complicated to find/keep this URI)
  • last: ?before={URI of the last item} (or last=true if it's complicated to find/keep this URI)

As for totalItems, it is optional so we won't return it (alternatively we could display the total number of available items, including those not visible by the logged user, but that would be misleading)

@niko-ng @Laurin-W Does it sound good to you ? @SlyRock will need to work on this soon so it's important we make a decision.

@pietercolpaert
Copy link

Still think a search tree design would be more efficient than a list of pages that can be traversed in one (or two) directions. It would also be a more generic solution: with TREE you can describe the links to the child nodes based on a comparison of entities, and thus allow for any of the above potential solutions without having to hard code them in a client implementation.

@niko-ng
Copy link

niko-ng commented Jan 14, 2025

TREE is indeed very elegant, but in this specific case, we would need to persist a different tree:Collection for each user, as we have to enforce permissions and different users would see different resources listed in the Collection. Also, every time the permissions change, we would have to rebuild all those Collections.

The major issue we are trying to solve right now, is related to the latency of the permission system in SemApps at the moment.
I find the solution proposed here good enough. The next and prev described by Seb now are exactly what I proposed and I find this solution suiting us for now. rdf:Seq and rdf:List I had discarded them long time ago anyway, as they do not contribute to improving performances, and only add clutter and cumbersome metadata.

One thing to take into account: When you will list the items in the Collection, you will want to order them by timestamp (as I explained above), filter them so they are all after the next (or before the prev) and probably also limit the results to a fixed quantity. Which quantity exactly? is not easy to answer, because if you need to produce a page of 5 items, asking for 5 items will probably not be enough. Because you will go over them and test their permissions, returning only those the user has access to. So you might want to ask for 10 items in the first place (like, double the page size). But this can also fail, if the first 10 items that are returned, are all forbidden to read by the user. In this case, you can loop back to getting the next 10 items, and try again, and so on and so on, until you eventually get your 5 items for the page, or maybe less, or maybe even, none.

@srosset81
Copy link
Contributor Author

srosset81 commented Jan 14, 2025

TREE is indeed very elegant, but in this specific case, we would need to persist a different tree:Collection for each user, as we have to enforce permissions and different users would see different resources listed in the Collection. Also, every time the permissions change, we would have to rebuild all those Collections.

Indeed, we need to stay compatible with ActivityPub, so using TREE would mean persisting all data twice. @pietercolpaert Maybe this is an issue to be raised in the ActivityPub community ? (Evan Prodromou, one of the core AP editor, has been working on improving/fixing the specs in the past few months)

...you will want to order them by timestamp (as I explained above)...

Since we have seen the performances were not much improved by using a timestamp, and since NextGraph/Oxigraph will handle this natively (as you explained), I would not bother about creating timestamps.

Which quantity exactly? is not easy to answer, because if you need to produce a page of 5 items, asking for 5 items will probably not be enough. Because you will go over them and test their permissions, returning only those the user has access to. So you might want to ask for 10 items in the first place (like, double the page size). But this can also fail, if the first 10 items that are returned, are all forbidden to read by the user. In this case, you can loop back to getting the next 10 items, and try again, and so on and so on, until you eventually get your 5 items for the page, or maybe less, or maybe even, none.

Please remember that sorting cannot be done in a partial subset: we are forced to go through the whole list to find the full order.

Once we have the full sorted list, we can probably ask items one by one, discarding items we don't have the right to read, until we have the right number of items. I don't think the performances will be much improved by doing "batch requests", and it will be more complicated to check permissions with Moleculer (we must keep in mind that very soon all permissions checks will be done on Moleculer side).

@niko-ng
Copy link

niko-ng commented Jan 14, 2025

I meant timestamp or whatever yo use for sorting ;) Not specifically integer timestamps. sorry if it wasn't clear.

So you are saying that you prefer to make a query that returns ALL the items in the Collection? Even if this collection is really big? And then take that list in memory and find the right item to start with, and then iterate and try to fill the page?

I thought you would at least give a filter to SPARQL so it will return you only the items after the next (or before the previous ) so it will save you at least this amount of JS memory. SPARQL will order first, and filter after, obviously.

About the optimisation if to ask for al items until the end (or the beginning) or if to go by chunks, and repeat, I don't know what is the best, you would have to try. But I am just afraid of the case when there are millions of items and you will have to keep that in memory until you fill your page

@srosset81
Copy link
Contributor Author

srosset81 commented Jan 14, 2025

So you are saying that you prefer to make a query that returns ALL the items in the Collection? Even if this collection is really big? And then take that list in memory and find the right item to start with, and then iterate and try to fill the page?

I thought you would at least give a filter to SPARQL so it will return you only the items after the next (or before the previous ) so it will save you at least this amount of JS memory. SPARQL will order first, and filter after, obviously.

Indeed in the case where we have a pointer we can add a filter to the SPARQL query to immediately discard items that are before or after the publication date of the given pointer, so that we reduce the amount of returned items. But Fuseki will still need to go through ALL items, this is unavoidable when doing sorting.

On the JS side, we will then receive a nicely ordered list of URIs and we can then go through the list after or before the URI of the pointer. Yes, we will need to keep the list in memory, at least for a brief moment, but we won't have millions of items, the maximum now is several thousands (I'm generally against over optimization - I prefer to optimize when the performance problem arise).

@niko-ng
Copy link

niko-ng commented Jan 14, 2025

But Fuseki will still need to go through ALL items, this is unavoidable when doing sorting.

yes this is exactly what I just wrote above. We both know very well how Fuseki or sorting in general works. no doubt about that.

On the JS side, we will then receive a nicely ordered list of URIs

Yes you get an ordered list in JS. better if it already starts from the point where you want to actually start checking ACLs, so it saves you data in memory, and also it saves you to have to find that starting point yourself. That's all I was saying.

For the other optimisation of the end of the list (all of the remain items, or just a subset) this is for you to see if you can handle thousands in memory , or if it is better to requery in small batches (in case you didn't get enough).
I remember I was fetching always the double of what I needed for a page in a previous software (Par le Peuple) and then if I didn't have enough, I would fetch again from database. It is for you to see... as you say.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants