Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug(streaming): insert sanity check failed in hashagg #5911

Closed
Tracked by #4543
xiangjinwu opened this issue Oct 18, 2022 · 13 comments
Closed
Tracked by #4543

bug(streaming): insert sanity check failed in hashagg #5911

xiangjinwu opened this issue Oct 18, 2022 · 13 comments
Labels
type/bug Something isn't working

Comments

@xiangjinwu
Copy link
Contributor

xiangjinwu commented Oct 18, 2022

Describe the bug

e2e parallel, in memory

1st run:
https://buildkite.com/risingwavelabs/pull-request/builds/10452#0183eb82-3791-450c-a8b4-0b4e285d2c6a

2nd run:
https://buildkite.com/risingwavelabs/pull-request/builds/10454#0183eb93-6dd2-4262-91b3-20b0217b7b6e

thread 'risingwave-streaming-actor' panicked at 'overwrites an existing key!
table_id: 2381, vnode: 99, key: Row([Some(Utf8("30"))])
value in storage: Row([Some(Int64(0)), Some(Int64(0)), Some(Decimal(Normalized(0.00)))])
value to write: Row([Some(Int64(1)), Some(Int64(1)), Some(Decimal(Normalized(7638.57)))])', /risingwave/src/storage/src/table/streaming_table/state_table.rs:647:13
stack backtrace:
2022-10-18T14:45:16.676305Z DEBUG risingwave_stream::task::stream_manager: drop actors actors=[847, 848, 849, 850, 859, 860, 861, 862, 871, 872, 873, 874]
   0: rust_begin_unwind
             at /rustc/9067d5277d10f0f32a49ec9c125a33828e26a32b/library/std/src/panicking.rs:584:5
   1: core::panicking::panic_fmt
             at /rustc/9067d5277d10f0f32a49ec9c125a33828e26a32b/library/core/src/panicking.rs:142:14
   2: risingwave_storage::table::streaming_table::state_table::StateTable<S>::do_insert_sanity_check::{{closure}}
             at ./src/storage/src/table/streaming_table/state_table.rs:647:13
   3: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
             at /rustc/9067d5277d10f0f32a49ec9c125a33828e26a32b/library/core/src/future/mod.rs:91:19
   4: risingwave_storage::table::streaming_table::state_table::StateTable<S>::batch_write_rows::{{closure}}
             at ./src/storage/src/table/streaming_table/state_table.rs:608:70
   5: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
             at /rustc/9067d5277d10f0f32a49ec9c125a33828e26a32b/library/core/src/future/mod.rs:91:19
   6: risingwave_storage::table::streaming_table::state_table::StateTable<S>::commit::{{closure}}
             at ./src/storage/src/table/streaming_table/state_table.rs:570:57
   7: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
             at /rustc/9067d5277d10f0f32a49ec9c125a33828e26a32b/library/core/src/future/mod.rs:91:19
   8: risingwave_stream::executor::hash_agg::HashAggExecutor<K,S>::flush_data::{{closure}}
             at ./src/stream/src/executor/hash_agg.rs:437:40

To Reproduce

No response

Expected behavior

No response

Additional context

No response

@yuhao-su
Copy link
Contributor

Might be caused by the same bug in #5891

@xiangjinwu xiangjinwu changed the title ene-to-end test (parallel, in-memory) fails with transport error end-to-end test (parallel, in-memory) fails with transport error Oct 18, 2022
@xxchan xxchan changed the title end-to-end test (parallel, in-memory) fails with transport error end-to-end test (parallel, in-memory) fails with insert sanity check Oct 18, 2022
@xxchan xxchan changed the title end-to-end test (parallel, in-memory) fails with insert sanity check end-to-end test (parallel, in-memory) fails with insert sanity check in hashagg Oct 18, 2022
@xxchan
Copy link
Member

xxchan commented Oct 18, 2022

Might be caused by the same bug in #5891

It's in hash agg, not hash join? 👀

@xxchan xxchan changed the title end-to-end test (parallel, in-memory) fails with insert sanity check in hashagg bug(streaming): insert sanity check failed in hashagg Oct 18, 2022
@chenzl25
Copy link
Contributor

chenzl25 commented Oct 19, 2022

It might be caused by #5882. I will also take a look.

@BugenZhao
Copy link
Member

Seems not resolved yet?

https://buildkite.com/risingwavelabs/main/builds/1982

@chenzl25
Copy link
Contributor

Seems not resolved yet?

https://buildkite.com/risingwavelabs/main/builds/1982

This test looks like run before the fix.

@xxchan
Copy link
Member

xxchan commented Oct 19, 2022

image

It's after the fix 😇

@chenzl25
Copy link
Contributor

image

It's after the fix 😇

I mean that PR branch is stale if it hasn't merge the latest main at that time.

@BugenZhao
Copy link
Member

This is on the main branch.

@BugenZhao
Copy link
Member

Might be caused by the same bug in #5891

It's in hash agg, not hash join? 👀

It's possible there's a hash agg in the upstream. Note the error pattern is similar to #5913. 👀

@chenzl25
Copy link
Contributor

chenzl25 commented Oct 19, 2022

This is on the main branch.

Soga, it needs more investigation. Let me see.

@BugenZhao
Copy link
Member

BugenZhao commented Oct 24, 2022

Another occurrence: https://buildkite.com/risingwavelabs/pull-request/builds/10708#01840934-31bd-4e6b-946c-3e516c15d800 👀

thread 'risingwave-streaming-actor' panicked at 'overwrites an existing key!
table_id: 1187, vnode: 169, key: Row([None])
value in storage: Row([Some(Int64(0)), Some(Int64(0))])
value to write: Row([Some(Int64(1)), Some(Int64(1))])', /risingwave/src/storage/src/table/streaming_table/state_table.rs:651:13
stack backtrace:

...

*** async stack trace context of current task ***

Actor 1192: `SELECT ad_clicks.ad_id AS ad_id, CAST(ad_clicks.clicks_count AS NUMERIC) / ad_impressions.impressions_count AS ctr FROM (SELECT ad_impression.ad_id AS ad_id, COUNT(*) AS impressions_count FROM ad_impression GROUP BY ad_id) AS ad_impressions JOIN (SELECT ai.ad_id, COUNT(*) AS clicks_count FROM ad_click AS ac LEFT JOIN ad_impression AS ai ON ac.bid_id = ai.bid_id GROUP BY ai.ad_id) AS ad_clicks ON ad_impressions.ad_id = ad_clicks.ad_id` [5.252011895s]
  Epoch 3235312993894400 [!!! 3.556008053s]
    MaterializeExecutor 4A8000000CB (actor 1192, executor 203) [!!! 3.556008053s]
      ProjectExecutor 4A8000000CA (actor 1192, executor 202) [!!! 3.556008053s]
        HashJoinExecutor 4A8000000C8 (actor 1192, executor 200) [!!! 3.556008053s]
          hash_join_barrier_align [!!! 3.556008053s]
            ProjectExecutor 4A8000000C6 (actor 1192, executor 198) [!!! 3.556008053s]
  Subtask [!!! 5.252011895s]
  Subtask [!!! 5.248011886s]
    HashAggExecutor 4A8000000C4 (actor 1192, executor 196) [!!! 3.404007709s]  <== current

The query is from e2e_test/streaming/demo/ad_ctr.slt.

@chenzl25
Copy link
Contributor

chenzl25 commented Oct 24, 2022

Another occurrence: https://buildkite.com/risingwavelabs/pull-request/builds/10708#01840934-31bd-4e6b-946c-3e516c15d800 👀

thread 'risingwave-streaming-actor' panicked at 'overwrites an existing key!
table_id: 1187, vnode: 169, key: Row([None])
value in storage: Row([Some(Int64(0)), Some(Int64(0))])
value to write: Row([Some(Int64(1)), Some(Int64(1))])', /risingwave/src/storage/src/table/streaming_table/state_table.rs:651:13
stack backtrace:

...

*** async stack trace context of current task ***

Actor 1192: `SELECT ad_clicks.ad_id AS ad_id, CAST(ad_clicks.clicks_count AS NUMERIC) / ad_impressions.impressions_count AS ctr FROM (SELECT ad_impression.ad_id AS ad_id, COUNT(*) AS impressions_count FROM ad_impression GROUP BY ad_id) AS ad_impressions JOIN (SELECT ai.ad_id, COUNT(*) AS clicks_count FROM ad_click AS ac LEFT JOIN ad_impression AS ai ON ac.bid_id = ai.bid_id GROUP BY ai.ad_id) AS ad_clicks ON ad_impressions.ad_id = ad_clicks.ad_id` [5.252011895s]
  Epoch 3235312993894400 [!!! 3.556008053s]
    MaterializeExecutor 4A8000000CB (actor 1192, executor 203) [!!! 3.556008053s]
      ProjectExecutor 4A8000000CA (actor 1192, executor 202) [!!! 3.556008053s]
        HashJoinExecutor 4A8000000C8 (actor 1192, executor 200) [!!! 3.556008053s]
          hash_join_barrier_align [!!! 3.556008053s]
            ProjectExecutor 4A8000000C6 (actor 1192, executor 198) [!!! 3.556008053s]
  Subtask [!!! 5.252011895s]
  Subtask [!!! 5.248011886s]
    HashAggExecutor 4A8000000C4 (actor 1192, executor 196) [!!! 3.404007709s]  <== current

The query is from e2e_test/streaming/demo/ad_ctr.slt.

Thanks for this useful information, I find out how to reproduce it and fix it in #6007

@chenzl25
Copy link
Contributor

Closed by #6007.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants