-
Notifications
You must be signed in to change notification settings - Fork 92
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
V4 Job Completion Duration Performance Hit #338
Comments
Hi, thanks for reporting this! Can you provide some additional info to help us track down the source of this regression?
There are a few reasons behind the change. One of them is that yes, there is a lot of overhead for a high throughput system in doing a I have a hunch that the specific number of clients and insert rate in this example is basically at odds with the chosen default debounce timer in the app level notifier. Hopefully it’s something we can trivially improve by tuning a couple of internal settings or worst case by exposing a new setting. |
hI,
|
@malonaz ah, I may have misinterpreted this report initially. This Instead of being the result of #301 and the v4 migration, I suspect you are seeing this stat increase as a result of the changes in #258 to introduce a batching async job completer. This change has the effect of tremendously increasing total throughput with the tradeoff of some additional latency, because completed jobs are batched up in memory in the client to be marked completed in groups. You can see the core of how it works here river/internal/jobcompleter/job_completer.go Lines 160 to 266 in 30a97ff
As of now, there are no user-facing knobs for tuning this behavior. We even still have the old async completer in code, as well as an inline one (the latter has much worse throughput). There is no way to activate these as a user though, and no way to customize the thresholds for batching. I wonder if @brandur has thoughts on whether this much of an increase is expected, and whether/how we might want to allow users to customize the behavior here? @malonaz Could you add some more detail on the throughput rate for these queues? Additionally, I want to ask about this bit:
The phrase "the same transaction" struck me once I realized you were measuring using the |
Yeah, this is a little nuanced, but although the batch completer's ticker fires every 50 ms, it'll only complete a batch every 50 ms if a minimum batch size has accumulated during that time. Otherwise it waits until a tick that's a multiple of 5 to complete a batch. So you can expect up to a 5 * 50 = 250 ms delay. Let me caveat that though to say that the delay doesn't actually matter — the job still finished in the same amount of time, and it won't be reworked (unless there's a hard crash). It's just that setting it to fully completed in the database might take a little bit longer. I'd hold off on further customization unless we find a good reason for such. The defaults should be pretty good for everyone. In terms of measuring statistics, it migh t make more sense to observe |
@malonaz haven't heard anything back on whether or not you still believe there's an issue here, though I think it's mostly a matter of how the durations were being measured. Please let us know if you think otherwise! ❤️ |
Hi,
just wanted to share some data on our measured impact of the new notification system on the job completion duration:
We're observing a roughly 400% increase at p90 (from 50ms to 200ms), and with occasional spikes to 500ms.
This is with a
1 job inserted / second
and1 job exec / second
.Big fan of you work 👍 - this isn't really a dealbreaker, I'm curious to hear about your internal discussions around the performance tradeoffs of this new notification system. Was the previous NOTIFY implementation so penalizing at a certain scale that you felt you should make this change?
Thanks
The text was updated successfully, but these errors were encountered: