Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Federation throughput in a synchronous manner is limited to network distance #4529

Closed
5 tasks done
ticoombs opened this issue Mar 13, 2024 · 60 comments
Closed
5 tasks done
Labels
area: federation support federation via activitypub area: performance bug Something isn't working

Comments

@ticoombs
Copy link
Contributor

Requirements

  • Is this a bug report? For questions or discussions use https://lemmy.ml/c/lemmy_support
  • Did you check to see if this issue already exists?
  • Is this only a single bug? Do not put multiple bugs in one issue.
  • Do you agree to follow the rules in our Code of Conduct?
  • Is this a backend issue? Use the lemmy-ui repo for UI / frontend issues.

Summary

Problem: Activities are sequential but requires external data to be validated/queried that doesn't come with the request.
Server B -> A, says here is an activity. In that request can be a like/comment/new post. An example of a new post would mean that Server A, to show the post metadata (such as subtitle, or image) queries the new post.

Every one of these outbound requests that the receiving server does are:

  • Sequential, (every request must happen in order: 1,2,3,4...
  • Is blocking. Server B which sent a message to server A, must wait for Server A to say "I'm Finished" before sending the next item in queue.
  • Are inherently subsequent to network latency (20ms to 600ms)
    • Australia to NL is 278ms (round trip 556ms)
    • NL to LA is 145ms (round trip 290ms)
    • I picked NL because it is geographically, and literally, on the other side of the world from Australia. This is (one of) if not the longest route between two lemmy servers.

Actual Problem

So every activity that results in a remote fetch delays activities. If the total activities that results in more than 1 per 0.6s, servers physically cannot and will never be able to catch up.
As such our decentralised solution to a problem requires a low-latency solution. Without intervention this will evidently ensure that every server will need to exist in only one region. EU or NA or APAC (etc.) (or nothing will exist in APAC, and it will make me sad)
To combat this solution we need to streamline activities and how lemmy handles them.

Steps to Reproduce

  1. Have a lemmy server in NL send activities faster that 1 request every 0.6 seconds to a lemmy server in australia.
  2. If you send New Post activities, they can affect the activity processing the most / are the longest to help validate the PoC.

Technical Details

Trace 1:

Lemmy has to verify a user (is valid?). So it connects to a their server for information. AU -> X (0.6) + time for server to respond = 2.28s but that is all that happened.

- 2.28s receive:verify:verify_person_in_community: activitypub_federation::fetch: Fetching remote object http://server-c/u/user
- request completes and closed connection

Trace 2:

Similar to the previous trace, but after it verfied the user, it then had to do another from_json request to the instance itself. (No caching here?) As you can see 0.74 ends up being the server on the other end responding in a super fast fashion (0.14s) but the handshake + travel time eats up the rest.

- 2.58s receive:verify:verify_person_in_community: activitypub_federation::fetch: Fetching remote object http://server-b/u/user
- 0.74s receive:verify:verify_person_in_community:from_json: activitypub_federation::fetch: Fetching remote object http://server-b/
- request continues

Trace 3:

Fetching external content. I've seen external servers take upwards of 10 seconds to report data, especially because whenever a fediverse link is shared, every server refreshes it's own data. As such you basically create a mini-dos when you post something.

- inside a request already
- 4.27s receive:receive:from_json:fetch_site_data:fetch_site_metadata: lemmy_api_common::request: Fetching site metadata for url: https://example-tech-news-site/bitcoin-is-crashing-sell-sell-sell-yes-im-making-a-joke-here-but-its-still-a-serious-issue-lemmy-that-is-not-bitcoin

Trace 4:

Sometimes a lemmy server takes a while to respond for comments.

- 1.70s receive:community: activitypub_federation::fetch: Fetching remote object http://server-g/comment/09988776

Version

0.19.3

Lemmy Instance URL

No response

@ticoombs ticoombs added the bug Something isn't working label Mar 13, 2024
@sunaurus
Copy link
Collaborator

sunaurus commented Mar 13, 2024

Just collecting some further points from discussions on Matrix:

  • The core issue is that even if Lemmy did absolutely no processing for incoming events, a ping of ~300ms still limits us to ~3 incoming activities per second. This is right on the edge of what lemmy.world is generally sending out, so any kind of extra delays on top of this (or growth from lemmy.world) will guarantee falling behind.
  • Sending/receiving activities in parallel without any further logic (as was done before 0.19) can speed things up, but will make federation unreliable once again (as it was before 0.19) - different instances will be very likely to show different versions of posts, comments, votes, etc, as all edits and vote changes will not propagate in the correct order through the network.
  • Splitting up the queue on the sending side (for example, by the community where the activity is in) could improve the situation on the receiving side, but would add a lot of overhead on the sending side
  • Batching activities together would be ideal, as it would allow us to maintain correct ordering, but it appears that ActivityPub does not support such a thing at the moment
  • @phiresky mentioned the option of switching to unreliable federation (as in parallelized, without trying to guarantee the correct order of activities) whenever the sender detects they are falling behind - this is perhaps the quickest band-aid solution, with the trade-off of still creating opportunities for instances to go out of sync

@sunaurus sunaurus added area: federation support federation via activitypub area: performance labels Mar 13, 2024
@Nutomic
Copy link
Member

Nutomic commented Mar 13, 2024

Here are some potential solutions.

  • Apparently our HTTP client doesnt keep connections alive. So by enabling that and other options we can get rid of some overhead. Relevant code and documentation.
  • Lemmy processes incoming activities immediately, which can be slow when additional data needs to be fetched. Mastodon and other platforms instead put received activities in a queue for later processing. So we can also consider it.
  • A major fraction of all activities are probably votes. We could mark votes (or only comment votes) as low priority in the outgoing federation queue. Then if federation falls behind, drop some of the vote activities. Similar to add option to disable & discard comment votes #4527 but it can probably be automatic.
  • Batching multiple activities in a single request would be ideal, but Im not aware of any previous work in this direction. It would be good to open an FEP to discuss this and get input from other Fediverse devs.

@db0
Copy link
Contributor

db0 commented Mar 13, 2024

Solution through brainstorming on matrix to parallelize sending requests

  • Pick up the last X requests needing to be sent to target server. Where X = 10 (with 10 adjustable), or perhaps more dynamically X = ceil(queue_size / 1000) (with 1000 adjustable). Currently X == 1.
  • For every request we pickup, split it into a different thread per parent post or add to the "parentless queue". So let's say I pick these 10
activity parent ID
post A post A 1
comment A post A 2
vote A post A 3
vote B post B 4
ban user parentless 5
edit comment A (1 attempt) post A 6
edit comment A (2 attempt) post A 7
edit comment B post B 8

We end up with the following sending queues

Queue A ids: 1,2,3,6,7
Queue B ids: 4,8
Queue C ids: 5

Now each queue parallel to and independent from other other queues, follows the existing logic to send its post to the target instance. So queue A will send 1 then 2 then 3 etc. If any fails, the queue aborts anything subsequent. So if queue A failed to send ID 3, it will abort, so that 3,6 and 7 will remain in the overarching queue to send.

However the failure of 3, will not stop 4,8 and 5 from going through.

Now on the next iteration of sending, 1,2,4,8,5 are gone, so the next queues will pick up 3,6,7 to send, along with 7 other IDs and split them into individual queues.

This would allow an instance to send 1-n requests in parallel without ever running the risk of sending them in the wrong chronological order.

@Nutomic
Copy link
Member

Nutomic commented Mar 13, 2024

@db0 In particular we can assign activities to a specific queue by post_id: queue_id = post_id modulo 10 (or alternatively using community_id).

@db0
Copy link
Contributor

db0 commented Mar 13, 2024

You mean so that you have different posts per queue? Sure that could work as well. I like the idea of using the exact post ID for easier troubleshooting potential myself.

@phiresky
Copy link
Collaborator

Apparently our HTTP client doesnt keep connections alive

Do you have a source on that? As far as I'm aware it does keep them alive and thus for replication has one persistently open connection.

Other than that, I'll restate what I wrote on Matrix:

I think we can solve this in a fairly simple way without any robustness issues:

Instead of sending one activity, waiting for the 200 response, then sending the next activity, we can instead send multiple activities (sequentially down the same HTTP connection) and only start waiting for a response when 8 are currently in transit.

The only thing unclear to me is how this can be implemented. On tokio's side using a channel with .buffered(8) on the receiving end should work, but I'm not sure how to make reqwest send multiple requests while still ensuring they are in the same sequential connection.

This way, the internet latency does not matter as long as the parallel number chosen (8) is > than the internet latency divided by the internal processing time. E.g. if ping is 320ms and internal processing is 10ms, then 32 requests would need to be in the queue at once to not slow things down.

Note that this does not change anything about the internal latency of an instance, that would still be sequential per receiving instance. But it removes all the latency outside of the Rust process.

@Nutomic
Copy link
Member

Nutomic commented Mar 14, 2024

Maybe we can use something like this to get notified when the entire activity is sent, and then send out the next one. However that would require changes inside the federation library and exposing it in the public api. It would be easier and probably sufficient in practice if we simply wait eg 100ms between requests.

@wereii
Copy link
Contributor

wereii commented Mar 14, 2024

@phiresky
The only thing unclear to me is how this can be implemented. On tokio's side using a channel with .buffered(8) on the receiving end should work, but I'm not sure how to make reqwest send multiple requests while still ensuring they are in the same sequential connection.

Is this HTTP Response Streaming (Transfer-Encoding: chunked) or batching multiple activities into one request [{...}, {...}, ...] ?

For chunked these seem to be relevant in Reqwest:
https://docs.rs/reqwest/latest/reqwest/struct.Body.html#method.wrap_stream
https://docs.rs/reqwest/latest/reqwest/struct.Response.html#method.bytes_stream

Also a good note is that HTTP/2 basically always does chunked encoding, so the above applies only to 1.1.

@db0
Copy link
Contributor

db0 commented Mar 14, 2024

, but I'm not sure how to make reqwest send multiple requests while still ensuring they are in the same sequential connection.

That is exactly the problem with parallel requests in isolation and it's non-trivial to solve, which is why I suggested a more robust split-queue

@phiresky
Copy link
Collaborator

Why is it non-trivial to solve? if the library provides an option to do it it's trivial to solve. the only problem is that by default it apparently uses HTTP2-pipelining which is not in-order

@db0
Copy link
Contributor

db0 commented Mar 14, 2024

if the library provides an option to do it it's trivial to solve

Does the library provide an option then?

@phiresky
Copy link
Collaborator

phiresky commented Mar 14, 2024

Is this HTTP Response Streaming

It's HTTP pipelining
image

The docs you sent seem to be about responses, but here we really only care about request and request body

@db0
Copy link
Contributor

db0 commented Mar 14, 2024

This helps you send multiple requests, not ensure they're processed in order which is the non-trivial part

@phiresky
Copy link
Collaborator

they are guaranteed to be sent in order and received in order so processing them in order should not be hard

@db0
Copy link
Contributor

db0 commented Mar 14, 2024

so processing them in order should not be hard

famous last words :D

I don't think the http pipelining ensure that the response you get will be 200 before sending the next request. only that there will be a response. How does it guarantee for example that a request with ID 1 won't take too long to process before ID 2 hits the target?

@phiresky
Copy link
Collaborator

on the receiving side the requests can still be processed fully sequentially, no need for parallelism. note this only solves the problem in this ticket and not the one you had with internal latency

@db0
Copy link
Contributor

db0 commented Mar 14, 2024

Very important from the article in wikipedia

image

From the linked source

image

If I'm not mistaken, the POSTS to inbox, are not idempotent. A connection loss while sending will cause issues.

@phiresky
Copy link
Collaborator

phiresky commented Mar 14, 2024

it does look like reqwest doesn't really have options for this, probably since http1.1 pipelining is mostly seen as superseeded by http2 multiplexing, which is better in most ways except that it doesn't guarantee any sequentiality. so probably we should use just normal http2 multiplexing (which should just work with .buffered())

I think we can semi-ignore the sequentiality problem (e.g. just add a 10ms delay between sends) though because:

If an instance is catching up, and processing e.g. 100ms per activity, and we have a parallelity of 8, then while the first 8 events will be sent simultaneously and potentially processed out of order, after that each next request will wait for the OK response 8 requests earlier, which means it will have a average delay of 10ms after the previous activity.

@db0
Copy link
Contributor

db0 commented Mar 14, 2024

If you don't ensure for sequential you can end up with situations of the previous edits or votes overriding newer edits/votes if they are in the same pipeline. Likewise, a connection loss in the middle of the request will cause you issues which you can't resolve by resending, since you're not idempotent.

@phiresky
Copy link
Collaborator

Another alternative that's still simpler and less overhead than adding per-thread or per-community queues would be to add a field to each json activity or http request header with the sequential id and on the receiving side add (if the number is given) a priority queue to which events are added and processed immediately if the sequential id was incremented by 1 and with a max delay of 10s or so if it was incremented by > 1

@db0
Copy link
Contributor

db0 commented Mar 14, 2024

Anything which extends the apub protocol, will not be handled by other software properly.

@phiresky
Copy link
Collaborator

it's a compatible change so it doesn't affect anything else

@db0
Copy link
Contributor

db0 commented Mar 14, 2024

What do you mean by "compatible change"?

@phiresky
Copy link
Collaborator

phiresky commented Mar 14, 2024

it just adds a header or json field that will be ignored if not understood.

pseudo code:

post(/inbox).handle(activity):
     queues[activity.instance].add(priority=activity.sequence_id, activity)

def queue_process():
     queue = queues["lemmy.world"]
     last_id = 0
     while true:
         if queue.peek != last_id + 1: sleep 10s
         process(queue.take())

@db0
Copy link
Contributor

db0 commented Mar 14, 2024

That's my point. If you rely on something that can be ignored, then non-lemmy software (kbin, sublinks, mastodon, piefed etc) can experience significant issues, such as bad edits, missing content when issues occur

@phiresky
Copy link
Collaborator

but i have to say since completely ignoring activity sequence in a much much worse way than the thing above is how all of lemmy worked before 0.19 i don't think this minor sequencing issue is really going to cause trouble. on high-volume instances with a parallelity of like 10 it's very unlikely there's a conflicting event is <10 events next to another conflicting one because you don't edit a post 10 times per second, and on less active instance it's also not going to be an issue because the parallelity doesn't actually have to come into effect

@db0
Copy link
Contributor

db0 commented Mar 14, 2024

You can't know that though. We can't foresee all potential issues. but we know that this approach has potential pitfalls from simple connection problems. I don't see the point of refactoring to a method with includes known risks and does exactly what the designers of http pipelining tell you not to do.

PS: I am known to hit the wrong vote and correct it within the same second. Somtimes even setting whole posts as stickies through fat fingers

@Nutomic
Copy link
Member

Nutomic commented Mar 25, 2024

  1. Send activities in parallel with max-inflight = e.g. 10
  • This should work well with the persistent queue - the last_successful_sent activity id would just be the lower bound of activity IDs - in case of issues, the worst case would be to dupe resend up to 10 activities.

This is actually a problem in #4559, it immediately updates last_successful_id when an activity is sent, regardless if the previous activity was successful or not. So I suppose all send results need to be collected in a single place, and only all subsequent ones were successfully updated, write it to db. This could be done by moving federation state writes into receive_print_stats(), storing most recent sends in memory. However it gets tricky as not every activity gets delivered to every instance.

  1. Make all activities commutative - that is, make their order not matter.

We should definitely do this and use timestamps, because sending sequentially wont use all available server resources. So we need to be able to send in parallel. I dont think there is a general solution here but we need to handle problems individually for each activity type (eg checking the edit timestamp like you say).

  1. On the sending side, create N queues per receiver and split activities by dependency chains (as suggested by db0 above)
    \9. On the sending side, still have a single persistent queue per receiver, BUT create a buffer of N (e.g. = 20) activities. Send the HTTP requests in parallel, but for each affected community id within the buffer send the activities in sequence.

I dont see any difference between these. Anyway it could be implemented by storing the community_id in SentActivity table, and then each worker being assigned tasks as community_id mod N. I dont see a need for something more complicated here. As discussed above we could instead assign by post_id, so that a single large community can still be spread across multiple workers.

  1. On the receiving side, create a receiving queue per sending instance, delay receives by 1-5s, and process them per instance by order of timestamp

This is implemented in #4559.

Also I moved metadata fetches to a background task in #4564.

@phiresky
Copy link
Collaborator

phiresky commented Mar 25, 2024

Here's an implementation in pseudocode of what i wrote above in point (9). It would be great to get comments (especially from @Nutomic since it's an alternative to their PR) before I implement it properly.

To recap, the advantage of doing this on the sending side is that this way the queue is guaranteed to be persistent - if done on the receiving side we either need to add a second persistent queue to the database or have the danger of losing activities since the 200 response happens before anything reaches a storage medium. Backpressure comes from limiting the number of in-flight requests.

/// For each community, we keep a queue of sends.
/// Sends happen in parallel for different communities, but in sequence for one community.
/// This is a map that tracks which community we are currently sending an activity for
/// and the dependent activities we can only send afterwards
struct PendingSendsMap {
    // the first entry in the map value here is always the activity that is currently being sent
    map: HashMap<community_id, VecDequeue<Activity>>;
    // total count of activities in all map values so we don't use unbounded memory
    total_count: i64;
}
impl PendingSendsMap {
    /// Add an activity to the blocklist ONLY if there's already an activity send currently happening for that community
    fn push_if_has(self, activity) -> bool {
        if let Some(entry) = self.map.entry(activity.community_id) {
            self.total_count += 1;
            entry.push_back(activity);
            true
        } else { false }
    }
    fn pop_next(self, community_id) -> Option<Activity> {
        if let Some(entry) = self.map.entry(community_id) {
            let activity = entry.pop_front();
            assert activity.is_some(); // empty lists can't happen
            self.total_count -= 1;
            if entry.len() == 0 { self.map.remove(entry); }
            activity
        } else { None }
    }
    /// how many activities we currently have in RAM
    fn total_in_ram() -> i64 {
        self.total_count
    }
    /// how many activities are currently being sent out (regardless of how many blocked ones are held in RAM)
    fn total_in_flight() -> i64 {
        self.map.len()
    }
}
let blocked = PendingSendsMap::new();

fn sending_loop() {
    loop {
        while blocked.total_in_ram() >= 1000 || blocked.total_in_flight() >= 10 {
            wait_for_one_send_to_complete().await;
        }
        let Some(activity) = get_next_activity() else {
            // everything already sent, end loop.
            break;
        }
        if !blocked.push_if_has(activity) {
            spawn(send_task(activity))
        }
    }
}

fn send_task(activity) {
    retry_loop { // as before
        http_request(activity).await;
    }
    // success! 
    let act = blocked.get(activity).pop_front();
    assert act == activity;
    if let next = blocked.pop_front() {
        spawn(send_task(next));
    }
    signal_send_complete(); // can be done with a mpsc channel sending `()`
    // todo: keep track of last_successful_id
}

Note that in real code i would not use community_id anywhere and instead using an abstract "queue id" which can for now be community_id but may later be updated to be coalesce(affected_post_id, affected_community_id) when appropriate for more parallelisation.

@phiresky
Copy link
Collaborator

phiresky commented Mar 25, 2024

I dont see any difference between these [(7 and 9)]

The difference is only in what state is persisted and needs to be kept track of. I imagined (7) as having fully separate persistent queues per community or per post, which means the federation_queue_state table would grow from 1k entries to 100k+ entries, with 100+k tasks reading and writing to that table. 9 on the other hand keeps one persistent queue per instance but within each one creates tiny in-memory queues with a bounded total size (e.g. 100 in the above code). On crashes, the in-memory queue is lost and restarts sending from at most 100 events earlier.

@phiresky
Copy link
Collaborator

phiresky commented Mar 25, 2024

This is actually a problem in #4559, it immediately updates last_successful_id when an activity is sent, regardless if the previous activity was successful or not

If we don't care about send order at all (as you do in your PR), this can be solved with a tokio Stream with .buffered(N). That way N futures from the stream happen in parallel but the results retrieved from the stream are still guaranteed to be in order.

For my above implementation, I'm not sure yet how to get the lowest successful id cleanly.

@Nutomic
Copy link
Member

Nutomic commented Mar 25, 2024

Im not entirely sure how your BlockListMap would work. Instead of that it would also be possible to change inflight_requests variable in my pr from an int to Vec`, then when sending an activity, check if there is already a request inflight for this community and wait for it to finish.

Also its probably a bad idea to specify the number of send workers per target instance. I think it makes more sense to have a pool of send workers so you can configure the total number of concurrent, outgoing requests (for all target instances). Then more workers can be assigned to a larger target instance when necessary. This could be implemented by having InstanceWorker only loop through SentActivity to determine which activities need to be delivered, and pass them to send workers pool via UnboundedChannel. Finally each send worker passes the result to receive_print_stats() where it gets cached and then written to db. That way database reads and writes are each done sequentially in a single place, and only the network requests are actually in parallel.

@phiresky
Copy link
Collaborator

then when sending an activity, check if there is already a request inflight for this community and wait for it to finish.

That alone doesn't work, because you don't just need to wait for the current in-flight activity to finish, you also need to wait for other not-yet-inflight activities that are also waiting to finish. So you need a Vec per inflight_requests. That's what the BlockListMap is! a map from community id to list of waiting requests for that community

I think it makes more sense to have a pool of send workers

Maybe but then how exactly do you assign the numbers? How would you prevent a huge instance from starving out smaller instances?

@phiresky
Copy link
Collaborator

phiresky commented Mar 25, 2024

Also remember that it's not really "workers", rather there's a single permanent task per instance, and then each individual HTTP request is one task. The concurrency is limited per instance, but for instances that are up to date nothing is actually running apart from the permanent task

and pass them to send workers pool via UnboundedChannel

It definitely can't be unbounded because then you'll just load millions of messages into memory again if an instance is out of date.

@Nutomic
Copy link
Member

Nutomic commented Mar 25, 2024

Does it really make sense to have the per-community queue on top of the per-instance queue? Seems like that would be two kinds of parallelism on top of each other. So maybe it would make sense to replace the current InstanceWorker with a CommunityWorker that does essentially the same thing, per community (or per post). Though it would also require a way to cleanup inactive senders.

@phiresky
Copy link
Collaborator

phiresky commented Mar 25, 2024

That's what I meant with variant 7 vs 9: #4529 (comment)

What you describe is not really possible because:

  • Each instance worker has a pretty high overhead and expects to be fairly long-lived (especially reading and writing the state to the database)
  • It is also what decides in which horizontally scaled process it runs by its id. If ones get added and removed often that's more of a mess
  • Each instance worker expects to care about a significant portion (>50%?) of activities. Every worker has to look at every single activity even if the receiving instance doesn't want it, because filtering at the database level here is complicated and conflicts with caching. If you split it by community this number will drop to <1% and there would be huge overhead.

In addition, we do (probably) want to limit requests per-instance and not per-community, which is not possible if each community would be handled fully separately

@phiresky
Copy link
Collaborator

phiresky commented Mar 25, 2024

An alternative to the tiny in-memory queues per community this is another option I thought of:

buffer = [] // the next 100 activities we might want to send
currently_sending_communities = Set()

every time a send succeeds (or every time interval):
    linear search the buffer for elements that are not in currently sending communities.
         remove activity from the buffer, add to the sending set, and send it

But I revised it to the Map<id, List<Activity>> because it feels better

@Nutomic
Copy link
Member

Nutomic commented Mar 25, 2024

Honestly I don't think sending all activities sequentially is the solution. Even a single community can become so large that it becomes too much for a small, remote instance sequentially. Especially when there is another spam attack, or when the target instance needs to catch up after some downtime. We could repeat to split by post_id, but even that may not be enough.

The idea of my pr is that things don't need to be in order during sending, but can be put in the correct order by the receiving side. This has unlimited scaling potential simply by increasing the amount of parallel workers, and is also much easier to implement.

@phiresky
Copy link
Collaborator

phiresky commented Mar 25, 2024

I kinda agree and it's what I thought too before I read your PR code and realized all those complexities and that it's not really simpler. You still need a solution for what happens during server restart or server crash so no activities are lost, and right now you have the same sequential processing per community due to the per-community queue and the issue you have otherwise with activities being processed out-of-order. To fix that you (I think) need the same data structure I made above (it could also be on the receiving side but that's not really any better).

@phiresky
Copy link
Collaborator

phiresky commented Mar 25, 2024

Maybe the way forward is to purely implement only the fully-parallel sending (limited to N=10 inflight per instance) and go through every activity type and make sure we can make them commutative (basically just skip activities that have been overwritten already (point 5+6 from my list above).

@phiresky
Copy link
Collaborator

We could repeat to split by post_id, but even that may not be enough.

I kinda disagree with this part though. We can easily move to post_id (community_id not needed except for non-post related actions) and it's very unlikely that we will ever need more. Even if we do, using coalesce(comment_thread_id, post_id, community_id) would still be a possible later change.

@Nutomic
Copy link
Member

Nutomic commented Mar 25, 2024

I kinda agree and it's what I thought too before I read your PR code and realized all those complexities and that it's not really simpler. You still need a solution for what happens during server restart or server crash so no activities are lost, and right now you have the same sequential processing per community due to the per-community queue and the issue you have otherwise with activities being processed out-of-order. To fix that you (I think) need the same data structure I made above (it could also be on the receiving side but that's not really any better).

Not true, my pr doesn't have any per-community queue. And handling restarts or crashes works just like before, using last_activity_id (still need to find a way to update that correctly). If an instance is crashed or restarted it will resend some of the same activities again, but those duplicates already get rejected by Lemmy when receiving so it's fine. And activities are put in the correct order on the receiving side.

Only problem is now with configuration, having 5 workers per instance could mean 10 instances * 5 workers = 50 workers, or could be 1000 * 5 = 5000. So a shared worker pool for all instances would be better so that you can configure exactly 50 workers to be active at any time. Anyway the pr implementation works and is better than what we have now.

Maybe the way forward is to purely implement only the fully-parallel sending (limited to N=10 inflight per instance) and go through every activity type and make sure we can make them commutative (basically just skip activities that have been overwritten already (point 5+6 from my list above).

I agree this is better than going by published timestamp. It doesn't require a complicated queue and works even if an activity is out of order by more than a second.

I kinda disagree with this part though. We can easily move to post_id (community_id not needed except for non-post related actions) and it's very unlikely that we will ever need more. Even if we do, using coalesce(comment_thread_id, post_id, community_id) would still be a possible later change.

Sure it's possible, but I believe it would result in much more complex code than the above. And I don't see any benefit that would justify the extra complexity.

@phiresky
Copy link
Collaborator

phiresky commented Mar 26, 2024

Not true, my pr doesn't have any per-community queue

Sorry, I misspoke, you need a per-something queue. In your case you added a per-instance queue on the receiving side, which means you solve the network latency problem (this issue, point 2 above) but not the internal latency issue (the one db0 had, point 3 above). If you wanted to solve 3 you'd have to split the receiving queue further while still keeping sequentiality somewhere.

And handling restarts or crashes works just like before

I was talking about crashes or restarts on the receiving side. You respond with 200 before actually having processed anything which means the sender thinks everything has been sent and will never resend it, even if the receiver crashes with many activities only in memory, those are lost. That's why I don't think any of that receiving side things should be added.

Sure it's possible, but I believe it would result in much more complex code than the above. And I don't see any benefit that would justify the extra complexity.

My main point is that you already need that in your PR if you want to solve all the problems we have, since you don't remove any sequentiality as opposed to 0.19 at all just network latency. Whether on sending or receiving side.

@Nutomic
Copy link
Member

Nutomic commented Mar 26, 2024

Right I also think that the receiving queue is not such a good idea, and better to change individual activity handlers to check the timestamps, so that older edits or votes are ignored. Then the sending side should work in parallel without problems.

dessalines pushed a commit that referenced this issue Mar 27, 2024
* Generate post thumbnail/metadata in background (ref #4529)

* fix api test

* Apply suggestions from code review

Co-authored-by: SleeplessOne1917 <28871516+SleeplessOne1917@users.noreply.github.com>

* fix test

---------

Co-authored-by: SleeplessOne1917 <28871516+SleeplessOne1917@users.noreply.github.com>
@Nutomic
Copy link
Member

Nutomic commented Apr 2, 2024

Thinking about this more, I dont think its possible to handle all activities correctly when they are received in the wrong order. Its fine for post or comment edits as we have can read the existing post and compare timestamps. But imagine you upvote a post and then immediately undo the vote. If another instance receives them in the wrong order, it would be Undo/Vote first and then Vote. The undo would do nothing as the corresponding vote doesnt exist yet. Then the vote gets received and stored in the database. So there will be one vote counted wrong, and I think this action is rather common. The same would happen if you remove a post and immediately restore, or add a mod and immediately remove him again, or sticky and then unsticky a post. Handling that for each activity type separately would get too complex, so I dont think there is any other way than a time-based queue for incoming activities.

@phiresky
Copy link
Collaborator

phiresky commented Apr 2, 2024

it would be Undo/Vote first and then Vote.

Right now maybe, but those could be changed right? For example, currently PostLike::remove currently deletes the row, but it could be changed to update(post_link).where(newpublished.lg(published)).set(score.eq(0)).set(published.eq(...)). This would mean unvotes remain in DB but that shouldn't be an issue.

The same goes for post remove + restore. A post update can't use the published field but it can use the updated field (or a new field) - a post remove would only go through if the updated published timestamp of the remove is larger than the updated timestamp of the post.

I would consider these part of what I meant above with making them commutative. It doesn't seem too complex to me, though it would probably require thinking through every case individually in detail. It seems like it should be possible to get through all or most by just comparing timestamps.

@Nutomic
Copy link
Member

Nutomic commented Apr 2, 2024

Youre right that should work.

dessalines pushed a commit that referenced this issue Apr 10, 2024
* Ignore old federated post edits (ref #4529)

* use filter on insert

* coalesce(updated, published)

* avoid comment conflict clause

---------

Co-authored-by: SleeplessOne1917 <28871516+SleeplessOne1917@users.noreply.github.com>
@Fmstrat
Copy link

Fmstrat commented Jun 8, 2024

Two questions (hopefully not dumb):

  • Why can't there be batch queues? If 12 comments are marked to federate, do up to n at once? Reducing overhead is a common way to solve these types of problems when ordering is important. I'm guessing this has been considered before. It may not be standard AP, but no reason supporting instances can't interact in nonstandard ways.

  • If ordering is the problem, why not add a parent key to activities? Then if an activity comes in with a parent, but that parent doesn't yet exist, it can be placed in a local queue until the parent arrives. This allows for parallel processing over the wire, and garbage collection if something sits for say, 48 hours.

@Nutomic
Copy link
Member

Nutomic commented Sep 9, 2024

Fixed in #4623

@Nutomic Nutomic closed this as completed Sep 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: federation support federation via activitypub area: performance bug Something isn't working
Projects
None yet
Development

No branches or pull requests

8 participants