Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug(recovery): DynamicFilterExecutor exit unexpected when randomly kill meta #5726

Closed
Tracked by #4527 ...
yezizp2012 opened this issue Oct 9, 2022 · 7 comments · Fixed by #6016
Closed
Tracked by #4527 ...

bug(recovery): DynamicFilterExecutor exit unexpected when randomly kill meta #5726

yezizp2012 opened this issue Oct 9, 2022 · 7 comments · Fixed by #6016
Assignees
Labels
help wanted Issues that need help from contributors type/bug Something isn't working

Comments

@yezizp2012
Copy link
Member

yezizp2012 commented Oct 9, 2022

Command

RUST_BACKTRACE=1 MADSIM_TEST_SEED=1 ./risedev sslt --release -- --kill-meta "e2e_test/streaming/tpch_upstream.slt"

Bug

actors exit unexpected when executing include ../tpch/insert_orders.slt.part:

2022-01-02T02:50:32.590521Z ERROR node{id=8 name="compute-3"}:task{id=966416}: risingwave_stream::task::stream_manager: actor exit actor=76007 error=Executor error: Deleting non-existent element
2022-01-02T02:50:32.590523Z ERROR node{id=8 name="compute-3"}:task{id=966412}: risingwave_stream::executor::actor: actor exit without stop barrier actor_id=76001
2022-01-02T02:50:32.590524Z ERROR node{id=6 name="compute-1"}:task{id=970788}: risingwave_stream::executor::actor: actor exit without stop barrier actor_id=76006
2022-01-02T02:50:32.590527Z ERROR node{id=8 name="compute-3"}:task{id=966418}: risingwave_stream::executor::subtask: actor downstream subtask failed
2022-01-02T02:50:32.590527Z ERROR node{id=8 name="compute-3"}:task{id=966418}: risingwave_stream::task::stream_manager: actor exit actor=76008 error=Executor error: Deleting non-existent element
...
2022-01-02T02:50:32.664685Z ERROR node{id=8 name="compute-3"}:task{id=966220}: risingwave_stream::task::stream_manager: actor exit actor=25019 error=Executor error: failed to pull left message, stream closed unexpectedly
2022-01-02T02:50:32.664695Z ERROR node{id=7 name="compute-2"}:task{id=974637}: risingwave_stream::task::stream_manager: actor exit actor=25028 error=Executor error: right barrier received while left stream end
2022-01-02T02:50:32.664747Z ERROR node{id=7 name="compute-2"}:task{id=974635}: risingwave_stream::task::stream_manager: actor exit actor=25027 error=Executor error: right barrier received while left stream end
2022-01-02T02:50:32.664839Z ERROR node{id=6 name="compute-1"}:task{id=970246}: risingwave_stream::task::stream_manager: actor exit actor=40012 error=Executor error: right barrier received while left stream end
2022-01-02T02:50:32.664850Z ERROR node{id=6 name="compute-1"}:task{id=970674}: risingwave_stream::task::stream_manager: actor exit actor=49011 error=Executor error: left barrier received while right stream end
2022-01-02T02:50:32.664888Z ERROR node{id=8 name="compute-3"}:task{id=966502}: risingwave_stream::task::stream_manager: actor exit actor=46008 error=Executor error: right barrier received while left stream end
2022-01-02T02:50:32.664935Z ERROR node{id=8 name="compute-3"}:task{id=966276}: risingwave_stream::task::stream_manager: actor exit actor=19015 error=Executor error: left barrier received while right stream end

Seems like this bug caused in range cache of dynamic_filter state.

@github-actions github-actions bot added this to the release-0.1.14 milestone Oct 9, 2022
@yezizp2012 yezizp2012 changed the title DynamicFilterExecutor exist unexpected when executing include ../tpch/insert_orders.slt.part bug(recovery): DynamicFilterExecutor exist unexpected when randomly kill meta Oct 9, 2022
@yezizp2012 yezizp2012 added type/bug Something isn't working help wanted Issues that need help from contributors labels Oct 9, 2022
@yezizp2012 yezizp2012 changed the title bug(recovery): DynamicFilterExecutor exist unexpected when randomly kill meta bug(recovery): DynamicFilterExecutor exit unexpected when randomly kill meta Oct 9, 2022
@yezizp2012
Copy link
Member Author

@jon-chuang would you please help to take a look?

@jon-chuang
Copy link
Contributor

jon-chuang commented Oct 10, 2022

Yes, we currently do not support recovery on DynamicFilter. Tracking: #3419

Is this a critical bug? I guess it is. I'll try to prioritize it, the corresponding logic has already been outlined here:

In the future, this cache will need to support vnode-based eviction: #5567

@jon-chuang
Copy link
Contributor

jon-chuang commented Oct 26, 2022

@yezizp2012 Is the behaviour required that there is completely no error? Now with my PR for fix it seems that we are flooded with error messages like:

2022-01-02T02:22:48.841810Z ERROR node{id=7 name="compute-2"}:task{id=201064}: risingwave_stream::executor::actor: actor exit without stop barrier actor_id=25057

@yezizp2012
Copy link
Member Author

yezizp2012 commented Oct 26, 2022

@yezizp2012 Is the behaviour required that there is completely no error? Now with my PR for fix it seems that we are flooded with error messages like:

2022-01-02T02:22:48.841810Z ERROR node{id=7 name="compute-2"}:task{id=201064}: risingwave_stream::executor::actor: actor exit without stop barrier actor_id=25057

Cool!!! Those are expected error messages in deterministic recovery test. Then I guess this bug will be fixed with you PR.

@jon-chuang
Copy link
Contributor

But it seems we have another deterministic test bug now. I will investigate.

@yezizp2012
Copy link
Member Author

Seems like the DynamicFilterExecutor still exit unexpected. We can reproduce it using the following command:

RUST_BACKTRACE=1 ./risedev sslt -- --kill-meta "e2e_test/streaming/dynamic_filter.slt"

The test failed in L44: insert into t2 values (2);.
But the real reason is the cluster recovery failure, the detailed error log:

2022-08-12T08:31:46.209577Z ERROR node{id=8 name="compute-3"}:task{id=23369}: risingwave_stream::task::stream_manager: actor exit actor=4004 error=Executor error: Inconsistent Delete - current: None, delete: [Some(Int32(0))]
2022-08-12T08:31:46.212059Z ERROR node{id=7 name="compute-2"}:task{id=23473}: risingwave_stream::executor::subtask: actor downstream subtask failed
2022-08-12T08:31:46.212059Z ERROR node{id=7 name="compute-2"}:task{id=23473}: risingwave_stream::task::stream_manager: actor exit actor=4006 error=Executor error: Inconsistent Delete - current: None, delete: [Some(Int32(0))]
2022-08-12T08:31:46.213262Z ERROR node{id=6 name="compute-1"}:task{id=23258}: risingwave_stream::executor::subtask: actor downstream subtask failed
2022-08-12T08:31:46.213262Z ERROR node{id=6 name="compute-1"}:task{id=23258}: risingwave_stream::task::stream_manager: actor exit actor=4002 error=Executor error: Inconsistent Delete - current: None, delete: [Some(Int32(0))]
2022-08-12T08:31:46.213568Z ERROR node{id=6 name="compute-1"}:task{id=23256}: risingwave_stream::executor::subtask: actor downstream subtask failed
2022-08-12T08:31:46.213568Z ERROR node{id=6 name="compute-1"}:task{id=23256}: risingwave_stream::task::stream_manager: actor exit actor=4001 error=Executor error: Inconsistent Delete - current: None, delete: [Some(Int32(0))]
2022-08-12T08:31:46.213634Z ERROR node{id=7 name="compute-2"}:task{id=23471}: risingwave_stream::executor::subtask: actor downstream subtask failed
2022-08-12T08:31:46.213634Z ERROR node{id=7 name="compute-2"}:task{id=23471}: risingwave_stream::task::stream_manager: actor exit actor=4005 error=Executor error: Inconsistent Delete - current: None, delete: [Some(Int32(0))]
2022-08-12T08:31:46.214473Z ERROR node{id=8 name="compute-3"}:task{id=23367}: risingwave_stream::executor::subtask: actor downstream subtask failed
2022-08-12T08:31:46.214473Z ERROR node{id=8 name="compute-3"}:task{id=23367}: risingwave_stream::task::stream_manager: actor exit actor=4003 error=Executor error: Inconsistent Delete - current: None, delete: [Some(Int32(0))]

Cc @jon-chuang , could you please help to take a look again? 😂

@jon-chuang
Copy link
Contributor

Indeed, I only fixed cluster recovery error on LHS, not RHS 😢 Looks like I will have to fix RHS now ~

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Issues that need help from contributors type/bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants