feat: limit num of barriers between Meta -> Source #9980

fuyufjh · 2023-05-24T06:08:50Z

Is your feature request related to a problem? Please describe.

Currently, the channel between Meta server and CN i.e. SourceExecutor is unlimited, while other channels between actors limit the barrier number to 2 (#9427). As a result, when system is overloaded, hundreds of barriers are stuck in that Meta->Source channal and released to downstream in an instant after some time, and this may cause unpredicatable consequence to the cluster performance.

Describe the solution you'd like

Proposed by @hzxa21

We can try to limit that channel size and back-presure Meta to "drop" new barriers if there is already 1 barrier left in that Meta->Source channel.

I am not sure whther this can solve the fluctuation problem, but it should not have negative effect.

Describe alternatives you've considered

No response

Additional context

No response

fuyufjh · 2023-05-24T06:15:35Z

@st1page has proposed to control checkpoint based on time e.g. 10 seconds instead of barrier frequency e.g. 10 barriers to mitigate fluctuation problem. I think my proposal will achieve similar effect.

For example, assuming now the Meta->CN channel has 1 barrier stuck there, and the Meta has already dropped 4 non-checkpoint barrier and 1 checkpoint barrier. In this case, the next barrier that Meta tries to emit should always be a checkpoint barrier, until one is successfully emitted out, then it can go back to the normal frequency.

BugenZhao · 2023-05-24T10:43:03Z

The challenge I come up with in this approach is how to "drop" the barrier. According to the above, once some source gets back-pressured, all sources in all compute nodes should reject this barrier atomically. Here are some possible ways to achieve this:

Use two-phase barrier injection. The first RPC tries to reserve (similar to try_reserve) a slot, and if succeeded, send the second RPC to inject.
Back pressure based on the information on the meta side, since we know the maximum depth of the streaming graph and the current in-flight barrier number.

Another approach I find that might be simpler to implement is to let the compute node block the inject_barrier RPC from the meta service, until there're free slots in the bounded barrier receiver in the source. During the blocking, new barriers (no matter manually issued or periodical) are also blocked, and we have to adapt the idea of "checkpoint based on time" for this.

fuyufjh · 2023-05-26T04:45:58Z

Suddenly I realized a problem... The Meta->Source channel we discussed here is shared among all streaming tasks. But in our case (barrier stuck for minutes #9723), actually only one job (Q16) is problematic, and other jobs should not be affected by it. Specifically, if we stop inject barriers for all jobs, the other jobs will get longer barrier duration / more events between two barriers, which may become another issue.

Is it reasonable to block all of these jobs? cc. @hzxa21 @BugenZhao

hzxa21 · 2023-05-26T05:36:58Z

Suddenly I realized a problem... The Meta->Source channel we discussed here is shared among all streaming tasks. But in our case (barrier stuck for minutes #9723), actually only one job (Q16) is problematic, and other jobs should not be affected by it, so is it reasonable to block all of these jobs? cc. @hzxa21 @BugenZhao

I guess this is unavoidable since we are doing global checkpoint. If we allow independent checkpointing for non-connected streaming graph (proposed long time ago), we can reduce the impact of a slow job. Are all nexmark jobs run on a shared table (materialized source)? If that is the case, the streaming graph of different jobs are still connected.

fuyufjh · 2023-05-26T05:55:39Z

Are all nexmark jobs run on a shared table (materialized source)? If that is the case, the streaming graph of different jobs are still connected

In this case, No, they are using Kafka Source.

But anyway, it seems hard to explain "source can have better isolation while table can't" to users.

BugenZhao · 2023-05-26T08:06:47Z

it seems hard to explain "source can have better isolation while table can't" to users.

This reminds me of a long-resolved question that why we don't have "physical resources" for sources. 😂

fuyufjh · 2023-07-14T04:54:48Z

Can't solve the problem

fuyufjh added the type/feature label May 24, 2023

github-actions bot added this to the release-0.20 milestone May 24, 2023

fuyufjh closed this as not planned Won't fix, can't repro, duplicate, stale Jul 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: limit num of barriers between Meta -> Source #9980

feat: limit num of barriers between Meta -> Source #9980

fuyufjh commented May 24, 2023 •

edited

Loading

fuyufjh commented May 24, 2023

BugenZhao commented May 24, 2023 •

edited

Loading

fuyufjh commented May 26, 2023 •

edited

Loading

hzxa21 commented May 26, 2023

fuyufjh commented May 26, 2023 •

edited

Loading

BugenZhao commented May 26, 2023

fuyufjh commented Jul 14, 2023

feat: limit num of barriers between Meta -> Source #9980

feat: limit num of barriers between Meta -> Source #9980

Comments

fuyufjh commented May 24, 2023 • edited Loading

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Describe alternatives you've considered

Additional context

fuyufjh commented May 24, 2023

BugenZhao commented May 24, 2023 • edited Loading

fuyufjh commented May 26, 2023 • edited Loading

hzxa21 commented May 26, 2023

fuyufjh commented May 26, 2023 • edited Loading

BugenZhao commented May 26, 2023

fuyufjh commented Jul 14, 2023

fuyufjh commented May 24, 2023 •

edited

Loading

BugenZhao commented May 24, 2023 •

edited

Loading

fuyufjh commented May 26, 2023 •

edited

Loading

fuyufjh commented May 26, 2023 •

edited

Loading