-
Notifications
You must be signed in to change notification settings - Fork 595
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bug: barrier_inflight_latency
is too high and keeps growing
#6571
Comments
Why |
In short, |
I can consistenly reproduce this issue with
I originally suspected this is caused by high storage read latency but the cache hit rate is high and the read tail latency shown in the metrics is not significant. I will continue investigating into the issue. |
Actor traces:
|
While the barrier in-flight latency is high, the source throughput is not extremely low in this issue. I suspect this is somehow related to the bufferred messages when backpressure happens. With some experiements, I can conclude that this is indeed the cause and the behaviour changes come from the permit-based backpressure introduced by #6170. I deployed one CN using
In summary, the default setting in #6170 makes the number of bufferred messages larger than before by default. This leads to more messages being processed between barrier when backpressure happens so the barrier in-flight latency increases. This won't affect source throughput because the buffer size won't affect the actor throughput. IMO, we can decrease the default buffer size (i.e. lower |
Great work!!! In short, the messages were piled up in the exchange channels. I agree that we may need to decrease the Some other ideas:
|
Some ideas:
|
The CPU usage is constantly at 800%. In my setting (3 fragments on 1CN with worker_node_parallelism=4), I think the bottleneck is the CPU.
True. But in the experiement I did, only local channel is used so the RTT is significantly smaller. In such case, buffer 32K can be a problem. Maybe we can use
Agree. After we hit the in-flight barrier limit, the barrier injection interval will be largened to e2e barrier latency (in this issue it is >2min) instead of the configured barrier send interval (250ms). I think @fuyufjh's proposal on throttling the source when we hit the barrier limit makes sense. To extend from that, we can bound the total size of messages ingested between two barriers to throttle source. |
It's indeed a potential problem that could lead to benchmark results that are far from the real cases. 😇 |
@wangrunji0408 points out that the data generator will sleep and delay to meet the ratio of 46:3:1, only if we have set the The throughput of the actor 1-4 in my case is ~70000 rows per sec, which is much larger than the result from @hzxa21. 🤔 This is interesting. Do we test under debug profile before? |
I use |
I guess the main cause is the "debug build". 🤔 Maybe we should set a smaller initial permits in debug build. |
Indeed. I observed way higher actor throughput (70k rows/s) with release build and barrier in-flight latency becomes normal. |
The original test done by me was run against release build. |
I'm designing a fine-grained backpressure mechanism to control the join actors' event consumption from two sides. Will first write a design document for this. |
See #6571 for background. After some discussion with @BugenZhao @yezizp2012 @tabVersion, we realized that it's futile to attempt to "backpressure" the streaming jobs by limiting `in_flight_barrier_nums`. On the contrary, it makes things more tricky, such as the interval between barriers becoming much larger, which in turn results in uncontrolled memory consumption. This PR removes the limit of `in_flight_barrier_nums` in a quick & dirty way, and reduces the permits of exchange channels to mitigate the high-latency issue. Of course, we must remove the related code. Just do this fix for the imminent version release. Approved-By: BugenZhao Approved-By: yezizp2012 Co-Authored-By: Eric Fu <eric@singularity-data.com>
I’m running NexMark with a self-made query, which consists of one Join and one Agg.
What could be the reason that barrier latency keeps growing? It has become > 8min
The query is
Is it stuck? -- No
After pause, the barrier number starts to drop. So, it’s not stuck. The pause succeed after 324.06s 😂 and the in-flight barriers were cleared
So, what could be the reason?
The text was updated successfully, but these errors were encountered: