-
-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OrderedCollection: persist items order #568
Comments
This would be important for performances ! For an inbox with 6000 activities, and with WAC permissions check, it takes 25-30s to load the first page (and all other pages) because all activities must be ordered by the |
Some docs about RDF ordering: |
One possibility could be to persist the OrderedCollectionPages (SemApps does not persist them at the moment, they are auto-generated by the middleware). It would be faster to go to a specific page. The ActivityStreams spec has a long section about collections: |
Maybe it makes sense to distinguish here between RDF representation and implementation.
You mean in the redis cache for example? Spinning this further, we could even build a custom index (for at least some predicates) in the middleware to handle this but it's not really pretty either... |
The problem is not indexing. Fuseki handles indexing very well. The problem is that, by definition, a triplestore cannot keep track of the order of a list, so we need to rely on the
No I was suggesting to persist it in the triple store. Right now when we fetch a collection, we build "on the fly" the special API for collections, such as I'm not sure however it's a good idea, because it would be a lot of work to "rearrange" the triplestore whenever there is a change. For example the So with this solution, the GET may be faster, but the POST may be longer. This is not a big problem, but I'm more afraid of the kind of bugs that could come out of this. |
FYI I found out that the LDP paging specs allow some sort of ordering, but it is not persisted: rather, it is only a way for a server to inform the client the way the resources have been ordered when paging is requested. See the example of what can be added to a container: <> ldp:pageSortCriteria ( <#Sort-o.marketValue-Ascending> ).
<#Sort-o.marketValue-Ascending>
a ldp:pageSortCriterion;
ldp:pageSortOrder ldp:Ascending;
ldp:pageSortPredicate o:marketValue. That looks a bit like what we implement now now for ActivityStreams ordered collections (like in this example). If we applied this part of the LDP paging spec, we would certainly have the same kind of performance issues. Anyway, if LDP cannot handle ordered lists, then it means we probably can't consider ActivityStreams collections as another "representation" of LDP containers, as @elf-pavlik suggested in our recent call. Or at least not ordered collections (used for inboxes and outboxes) |
We will have to be very careful here on this topic. I am planning to do some major work in NextGraph to have ordered collections (arrays) work well, performance wise, and also compatible with CRDT. But for now, we do not have that. My first question is about this query that takes 30s. This is not normal and can probably optimised. The first problem I see is that you fetch all the triples in memory, and order them in JS. This is suboptimal. Then the discussion about pagination is not an easy one. I started to talk about it in Matrix with Laurin and we have to understand exactly what are the uses cases, because it has an impact on how we implement it and what kind of features we can skip, in order to improve prefs. As far as I understood, we need to be able to add an item at the beginning of the list, and at the end of it, but not in the middle. Please confirm or debate on that here as it is important to know the answer now. Another question: do we need to randomly access arbitrary page numbers "out of nowhere" ? like Then there is the question of how to retrieve the pages. They can be generated in memory on the fly (on request), or they can be persisted (I previously came up with this idea) but less easily if we can insert items at the beginning of the collection (it will create a first page that is not full, which is not very nice in terms of UX). And not possible at all if we want to insert random items in the middle of the list. If we can persist the pages, then we do not need to persist the individual order of each item, as this is easily fetched by knowing the starting item of a page, and querying in the next X ordered items (X being the page size). Jena will do that easily. Please note that Jena has no support for Please detail here below the user cases (Inbox, Outbox, other ordered collections, and the requirements for them (insert at the begin, end, in the middle, and the rational behind those requirements). |
in fact, I was suggesting that ;) |
Hi @niko-ng, thanks for your comment ! Just a few quick answers, in the hope it helps you. I'll read your comment more thoroughly later, in case I missed some important informations.
No, we don't sort them with JS. The SPARQL query that takes 30s is here: semapps/src/middleware/packages/activitypub/services/activitypub/subservices/collection.js Lines 264 to 283 in 83e7a7e
OK, but then this would be something internal to NextGraph ? Because if we want to conform to ActivityPub specs, the published date must be a ... date.
Well the specs about OrderedCollections do not specify that, so in theory we should be able to add items anywhere we want. But in ActivityPub, OrderedCollections are used only for inboxes and outboxes (and I don't know of other usages in the fediverse, but I'm not an expert), so naturally it should be added at the beginning or the end (the most recent activities should appear on the first page, but we could use a reverse order).
Collections only need to implement
Do we really need special SPARQL queries here ? If we want the items 20 to 30 in a |
I'd say so too. For removing items though, it's not the same. E.g. if I undo a
So we also get the problem of filtering (which we have already though).
I would say so too. Though I see a use-case where you want to see all posts that were added before e.g. 2023.
I don't think so, you might want to apply a filter before though, as described above. Can we reduce the problem to applying filters and then scrolling from there sequentually? |
My bad f you use SPARQl for ordering, better! but indeed, the 30seconds come from the fact that SPARQL cannot index dates. You would store both a date and a timestamp, and return only the date to ActivityPub compliant APIs. the timestamp would just be used internally. Or if you want to save space, you would not save the dateTime, and instead regenerate it from the timestamp at the moment of outputting the data towards ActivityPub systems. Nothing specific to NExtGraph here. tried to stay away from anything specific to NextGraph in my answer, because I know it concerns a task that is starting now. |
If we have the choice, then better to add at the end (and reverse order if needed) because adding at the end is always simpler for managing pages. The question of "insert in the middle" might occur if you have activities added to the inbox by example, by a remote server, who was offline for some time (for some technical reason) and that will add "outdated" activities, that have an |
Yes we can, but each "fetch" will be inefficient. Eventually, it will be good to know if we really need "random access to pages" or if we can use only next and previous. And also, maybe use a special kind of URL for next and previous that in fact, contains the URI of the last resource in the current page (for next) or the first resource of the current page (for previous) so that it will be easy to find the next/previous items and build the page on the fly. (like GraphQL is doing, as you say). This mechanism also works if there were some concurrent inserts or deletes between the navigation from one page to another occurs, so that the page size is always correct. And also it works without persisting any metadata about the pages. |
yes that would be great. It also solves the "before 2023" question. you just need to find the starting point, and then iterate pages with Filtering can also be quick, specially if the ordering is based on long integer timestamps. as explained above. The 30 sec. you see for now is because Jena doesn't keep indexes on DateTime values. |
BTW, I just saw that OxiGraph is automatically converting DateTime dataTypes to long in the storage, and back to DateTime when outputing. This is smart because indexing on long is automatically done. I remember that Jena is not doing that and is not indexing DateTime. @srosset81 and @Laurin-W would you have a moment to answer my questions above so we can have a final decision on this topic? |
About servers coming back online and synchronizing that with your collection -- I think that makes sense, if you want to have a chronological order. I'd say we can live without that, if it makes things easier but it's definitely a nice to have!
Agree, I don't think we need that. But filtering would be nice, as detailed above.
I like that idea! :)
Nice! :D |
Thanks for your reply @Laurin-W So if I summarise, the solution could be :
|
Perfect, that sounds good! |
I only skimmed over the details of this conversation. When AS paged collections were drafted, besides parallel work on LDP pagination, there was also an exploration of pagination in Hydra CG https://www.w3.org/community/hydra/wiki/Collection_Design#Pagination The one in Hydra was my favorite at the time. BTW you might consider submitting your pagination, ordering and filtering use cases to the Linked Web Storage WG https://github.com/w3c/lws-ucs |
Thanks @elf-pavlik for looking into this issue and finding it worthy to be presented to LWS WG. BTW, while you are here, I want to ask you your opinion: |
Thanks @niko-ng for your proposal and sorry for my late reply. So I understand that if we store the
This is actually persisted in the collection, with the If the performances can really be improved with this proposal, the only thing that bothers me is that we will not conform with ActivityPub specs, which expects a list as we can see in the official JSON-LD context. I found again this thread where people criticize this choice of not perstiting the order. It does not have an impact for ActivityPub servers, which don't care about how you store data. But we will still store data in a way that is particular to SemApps. It's a choice we can do, but it's important to be aware of this downside. If the performances are much better with Niko's proposal than if we persist the order, then I'd say performances matter more than standards compliance in that case. |
Looking at Hydra's I don't find accessing a random page/part based on its index useful. Filtering the collection before paging it would be more practical, e.g., all the activity from the last month. I still haven't looked at your issue about filtering |
Yes. Those And I was just saying that in their example, they put URLs that have a query parameter But yes generally I agree that both Hydra and AS ontology for pagination is very similar. And honestly I don't know which one is best. |
Let me add another alternative next to Hydra and AS: I proposed the TREE specification for that use case: the server has a fixed pagination and the client needs to follow links. However, these links are explained: i.e., you can follow a link to a page that is indicated to contain for example, entities later in time. This way you can build search trees instead of just providing traversal in 1 or two directions with next and previous, and the client can understand a sense of ordering. |
Hi guys,
After discussing of that issue with @srosset81 I did a "quick" benchmark to see if using timestamp had any impact on the performance of the incriminated query. Sadly I found it didn't... Here's the methodology I used :
The results where consistent : around 7 seconds for all cases alike.
The same query without checking the WAC (webid system) is around 110ms:
We reviewed the benchmarking process with @srosset81 this morning and found no fault. the code of the benchmark is on a branch of the activitypods repo : https://github.com/activitypods/activitypods/tree/benchmark-query |
Hello :) For the time being, I can only advice you to try not using the WAC if you want resonable perfs.
If you can try this quickly, I think the WACs will have less impact on perfs if done like this. Anyway thank you very much for your efforts in solving this issue! |
And in general I am kind of surprised to see such bad perfs with WAC. @srosset81 are those the kind of perfs you are used to in general when using WACs? for 1000 records, it shouldn't be that bad |
One idea we had with @SlyRock is to do this long SPARQL query without WAC indeed. So basically it would return all activities sorted by date, even if the logged user is not allowed to see them. And then we go through these activities one by one, ignoring errors thrown for activities the user is not allowed to see, until we have the correct number of requested activities. The performance would be much better, except in the case where a user fetch a collection on which there are thousands of activities, but the user is only allowed to see a few of them. In this case, the performances would probably be much worse. It's difficult to evaluate what scenario would be the most probable. The other problem is that we will not be able to give the number of activities in the collection, since we don't know how many of them can be seen by the logged user without running the long SPARQL query. But that's a problem we will probably also have if we use
Unfortunately I don't think it's possible at the moment, since a RDF document cannot have subjects which are not the same as the document URI. :( At least not until ActivityPods 3.0 ... That would be a good idea though, could it be done in another way ? |
I'm not surprised. As soon as there are many permissions to check, queries can take several seconds to run. And in the worst cases, it can be several minutes. @simonLouvet can tell you about it. |
It is not normal. |
@niko-ng I started with a fresh install, created 2 pods, ran this service's only action through moleculer console to send Notes from one pod to the other. So I have one pod with a 1000 activities outbox and the other one with smaller inbox cause I interrupted the script before the queued tasks were all handled. I then ran the benchmark on the 1000 activities outbox using this service only action through the moleculer console. Don't hesitate to ask if it needs clarification :) |
@mguihal to look date ordering |
On further thinking, we will also have the performance problems mentionned above if we use With that perspective, I would suggest to run the long SPARQL query without WAC (with There is still the issue of pagination: if a collection has 100 items, and the first page returned the items N°4, 8, 15, 28 and 35 (if we have 5 items per page), when loading the second page we want to make sure we start looking AFTER item N°35. So a pointer will be necessary instead of the
As for @niko-ng @Laurin-W Does it sound good to you ? @SlyRock will need to work on this soon so it's important we make a decision. |
Still think a search tree design would be more efficient than a list of pages that can be traversed in one (or two) directions. It would also be a more generic solution: with TREE you can describe the links to the child nodes based on a comparison of entities, and thus allow for any of the above potential solutions without having to hard code them in a client implementation. |
TREE is indeed very elegant, but in this specific case, we would need to persist a different The major issue we are trying to solve right now, is related to the latency of the permission system in SemApps at the moment. One thing to take into account: When you will list the items in the Collection, you will want to order them by timestamp (as I explained above), filter them so they are all after the |
Indeed, we need to stay compatible with ActivityPub, so using TREE would mean persisting all data twice. @pietercolpaert Maybe this is an issue to be raised in the ActivityPub community ? (Evan Prodromou, one of the core AP editor, has been working on improving/fixing the specs in the past few months)
Since we have seen the performances were not much improved by using a timestamp, and since NextGraph/Oxigraph will handle this natively (as you explained), I would not bother about creating timestamps.
Please remember that sorting cannot be done in a partial subset: we are forced to go through the whole list to find the full order. Once we have the full sorted list, we can probably ask items one by one, discarding items we don't have the right to read, until we have the right number of items. I don't think the performances will be much improved by doing "batch requests", and it will be more complicated to check permissions with Moleculer (we must keep in mind that very soon all permissions checks will be done on Moleculer side). |
I meant timestamp or whatever yo use for sorting ;) Not specifically integer timestamps. sorry if it wasn't clear. So you are saying that you prefer to make a query that returns ALL the items in the Collection? Even if this collection is really big? And then take that list in memory and find the right item to start with, and then iterate and try to fill the page? I thought you would at least give a filter to SPARQL so it will return you only the items after the About the optimisation if to ask for al items until the end (or the beginning) or if to go by chunks, and repeat, I don't know what is the best, you would have to try. But I am just afraid of the case when there are millions of items and you will have to keep that in memory until you fill your page |
Indeed in the case where we have a pointer we can add a filter to the SPARQL query to immediately discard items that are before or after the publication date of the given pointer, so that we reduce the amount of returned items. But Fuseki will still need to go through ALL items, this is unavoidable when doing sorting. On the JS side, we will then receive a nicely ordered list of URIs and we can then go through the list after or before the URI of the pointer. Yes, we will need to keep the list in memory, at least for a brief moment, but we won't have millions of items, the maximum now is several thousands (I'm generally against over optimization - I prefer to optimize when the performance problem arise). |
yes this is exactly what I just wrote above. We both know very well how Fuseki or sorting in general works. no doubt about that.
Yes you get an ordered list in JS. better if it already starts from the point where you want to actually start checking ACLs, so it saves you data in memory, and also it saves you to have to find that starting point yourself. That's all I was saying. For the other optimisation of the end of the list (all of the remain items, or just a subset) this is for you to see if you can handle thousands in memory , or if it is better to requery in small batches (in case you didn't get enough). |
Issue
Currently, to manage
OrderedCollection
, items are ordered according to a predicate (for exampleas:published
). But normally the order of items is persisted.Proposal
rdf:List
to persist the order of items.semapps:sortPredicate
andsemapps:sortOrder
predicates from ordered collectionsThe text was updated successfully, but these errors were encountered: