Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error After change_membership: assertion failed: value > prev #584

Closed
fredfortier opened this issue Oct 28, 2022 · 5 comments · Fixed by #585
Closed

Error After change_membership: assertion failed: value > prev #584

fredfortier opened this issue Oct 28, 2022 · 5 comments · Fixed by #585
Assignees
Labels
A-replication Area: replication protocol C-bug Category: bug

Comments

@fredfortier
Copy link

Sorry this isn't well isolated into a unit test, but can you make sense of this error:

{"timestamp":" 289.320893297s","level":"DEBUG","message":"append_entries resp: Success","target":"openraft::replication","span":{"name":"send_append_entries"},"spans":[{"id":"0","name":"Node"},{"name":"init_raft"},{"cluster":"ddx-epoch-0","id":"0","name":"new"},{"cluster":"ddx-epoch-0","id":"0","name":"RaftCore"},{"id":"0","target":"1","name":"replication"},{"cluster":"ddx-epoch-0","target":"1","vote":"vote:1-0","name":"main"},{"name":"line_rate_loop"},{"name":"send_append_entries"}],"threadId":"ThreadId(4)"}
{"timestamp":" 289.321038838s","level":"DEBUG","message":"update_matched","self.max_possible_matched_index":418,"self.matched":"None","new_matched":"Some(LogId { leader_id: LeaderId { term: 1, node_id: 0 }, index: 349 })","target":"openraft::replication","span":{"name":"send_append_entries"},"spans":[{"id":"0","name":"Node"},{"name":"init_raft"},{"cluster":"ddx-epoch-0","id":"0","name":"new"},{"cluster":"ddx-epoch-0","id":"0","name":"RaftCore"},{"id":"0","target":"1","name":"replication"},{"cluster":"ddx-epoch-0","target":"1","vote":"vote:1-0","name":"main"},{"name":"line_rate_loop"},{"name":"send_append_entries"}],"threadId":"ThreadId(4)"}
{"timestamp":" 289.321200359s","level":"DEBUG","message":"matched updated","target":"1","matched":"Some(LogId { leader_id: LeaderId { term: 1, node_id: 0 }, index: 349 })","target":"openraft::replication","span":{"name":"send_append_entries"},"spans":[{"id":"0","name":"Node"},{"name":"init_raft"},{"cluster":"ddx-epoch-0","id":"0","name":"new"},{"cluster":"ddx-epoch-0","id":"0","name":"RaftCore"},{"id":"0","target":"1","name":"replication"},{"cluster":"ddx-epoch-0","target":"1","vote":"vote:1-0","name":"main"},{"name":"line_rate_loop"},{"name":"send_append_entries"}],"threadId":"ThreadId(4)"}
{"timestamp":" 289.321383408s","level":"DEBUG","message":"close","time.busy":"2.66ms","time.idle":"8.28ms","target":"openraft::replication","span":{"name":"send_append_entries"},"spans":[{"id":"0","name":"Node"},{"name":"init_raft"},{"cluster":"ddx-epoch-0","id":"0","name":"new"},{"cluster":"ddx-epoch-0","id":"0","name":"RaftCore"},{"id":"0","target":"1","name":"replication"},{"cluster":"ddx-epoch-0","target":"1","vote":"vote:1-0","name":"main"},{"name":"line_rate_loop"}],"threadId":"ThreadId(4)"}
{"timestamp":" 289.321538169s","level":"DEBUG","message":"replication res","target":"1","res":"Ok(())","target":"openraft::replication","span":{"name":"line_rate_loop"},"spans":[{"id":"0","name":"Node"},{"name":"init_raft"},{"cluster":"ddx-epoch-0","id":"0","name":"new"},{"cluster":"ddx-epoch-0","id":"0","name":"RaftCore"},{"id":"0","target":"1","name":"replication"},{"cluster":"ddx-epoch-0","target":"1","vote":"vote:1-0","name":"main"},{"name":"line_rate_loop"}],"threadId":"ThreadId(4)"}
{"timestamp":" 289.321678224s","level":"DEBUG","message":"current matched: Some(LogId { leader_id: LeaderId { term: 1, node_id: 0 }, index: 349 }) max_possible_matched_index: Some(418)","target":"openraft::replication","span":{"name":"line_rate_loop"},"spans":[{"id":"0","name":"Node"},{"name":"init_raft"},{"cluster":"ddx-epoch-0","id":"0","name":"new"},{"cluster":"ddx-epoch-0","id":"0","name":"RaftCore"},{"id":"0","target":"1","name":"replication"},{"cluster":"ddx-epoch-0","target":"1","vote":"vote:1-0","name":"main"},{"name":"line_rate_loop"}],"threadId":"ThreadId(4)"}
{"timestamp":" 289.321878102s","level":"DEBUG","message":"check_consecutive","last_purged":"Some(LogId { leader_id: LeaderId { term: 1, node_id: 0 }, index: 299 })","self.max_possible_matched_index":"Some(418)","target":"openraft::replication","span":{"name":"send_append_entries"},"spans":[{"id":"0","name":"Node"},{"name":"init_raft"},{"cluster":"ddx-epoch-0","id":"0","name":"new"},{"cluster":"ddx-epoch-0","id":"0","name":"RaftCore"},{"id":"0","target":"1","name":"replication"},{"cluster":"ddx-epoch-0","target":"1","vote":"vote:1-0","name":"main"},{"name":"line_rate_loop"},{"name":"send_append_entries"}],"threadId":"ThreadId(4)"}
{"timestamp":" 289.322047737s","level":"DEBUG","message":"load entries","self.matched":"Some(LogId { leader_id: LeaderId { term: 1, node_id: 0 }, index: 349 })","self.max_possible_matched_index":"Some(418)","last_purged":"Some(LogId { leader_id: LeaderId { term: 1, node_id: 0 }, index: 299 })","prev_index":"Some(381)","end":419,"target":"openraft::replication","span":{"name":"send_append_entries"},"spans":[{"id":"0","name":"Node"},{"name":"init_raft"},{"cluster":"ddx-epoch-0","id":"0","name":"new"},{"cluster":"ddx-epoch-0","id":"0","name":"RaftCore"},{"id":"0","target":"1","name":"replication"},{"cluster":"ddx-epoch-0","target":"1","vote":"vote:1-0","name":"main"},{"name":"line_rate_loop"},{"name":"send_append_entries"}],"threadId":"ThreadId(4)"}
{"timestamp":" 289.322382185s","level":"DEBUG","message":"start sending append_entries, timeout: 100","payload":"vote=vote:1-0, prev_log_id=1-0-381, leader_commit=1-0-417, entries=1-0-382:normal ... 1-0-418:membership: members:[{0:{OperatorNode { url: \"http://localhost:11000/\" }}},{0:{OperatorNode { url: \"http://localhost:11000/\" }},1:{OperatorNode { url: \"http://localhost:11001/\" }}}],learners:[]","target":"openraft::replication","span":{"name":"send_append_entries"},"spans":[{"id":"0","name":"Node"},{"name":"init_raft"},{"cluster":"ddx-epoch-0","id":"0","name":"new"},{"cluster":"ddx-epoch-0","id":"0","name":"RaftCore"},{"id":"0","target":"1","name":"replication"},{"cluster":"ddx-epoch-0","target":"1","vote":"vote:1-0","name":"main"},{"name":"line_rate_loop"},{"name":"send_append_entries"}],"threadId":"ThreadId(4)"}

{"timestamp":" 289.323796724s","level":"DEBUG","message":"recv from rx_api: UpdateMatchIndex: target: 1, result: Ok(LogId { leader_id: LeaderId { term: 1, node_id: 0 }, index: 349 }), server_state_vote: vote:1-0, membership_log_id: 1-0-418","target":"openraft::core::raft_core","span":{"id":"0","state":"Leader","name":"handle_api_msg"},"spans":[{"id":"0","name":"Node"},{"name":"init_raft"},{"cluster":"ddx-epoch-0","id":"0","name":"new"},{"cluster":"ddx-epoch-0","id":"0","name":"RaftCore"},{"name":"main"},{"id":"0","name":"runtime_loop"},{"id":"0","state":"Leader","name":"handle_api_msg"}],"threadId":"ThreadId(3)"}
{"timestamp":" 289.324032213s","level":"DEBUG","message":"handle_update_matched","target":"1","result":"Ok(LogId { leader_id: LeaderId { term: 1, node_id: 0 }, index: 349 })","target":"openraft::core::raft_core","span":{"name":"handle_update_matched"},"spans":[{"id":"0","name":"Node"},{"name":"init_raft"},{"cluster":"ddx-epoch-0","id":"0","name":"new"},{"cluster":"ddx-epoch-0","id":"0","name":"RaftCore"},{"name":"main"},{"id":"0","name":"runtime_loop"},{"id":"0","state":"Leader","name":"handle_api_msg"},{"name":"handle_update_matched"}],"threadId":"ThreadId(3)"}
{"timestamp":" 289.324265909s","level":"DEBUG","message":"update_progress: node_id:1 log_id:Some(LogId { leader_id: LeaderId { term: 1, node_id: 0 }, index: 349 })","target":"openraft::engine::engine_impl","span":{"name":"update_progress"},"spans":[{"id":"0","name":"Node"},{"name":"init_raft"},{"cluster":"ddx-epoch-0","id":"0","name":"new"},{"cluster":"ddx-epoch-0","id":"0","name":"RaftCore"},{"name":"main"},{"id":"0","name":"runtime_loop"},{"id":"0","state":"Leader","name":"handle_api_msg"},{"name":"handle_update_matched"},{"name":"update_progress"}],"threadId":"ThreadId(3)"}
{"timestamp":" 289.324459389s","level":"DEBUG","message":"leader progress","progress":"VecProgress { quorum_set: Joint { data: [[0], [0, 1]], _p: PhantomData }, granted: Some(LogId { leader_id: LeaderId { term: 1, node_id: 0 }, index: 417 }), voter_count: 2, vector: [(0, Some(LogId { leader_id: LeaderId { term: 1, node_id: 0 }, index: 418 })), (1, Some(LogId { leader_id: LeaderId { term: 1, node_id: 0 }, index: 417 }))], stat: Stat { update_count: 430, move_count: 421, is_quorum_count: 421 } }","target":"openraft::engine::engine_impl","span":{"name":"update_progress"},"spans":[{"id":"0","name":"Node"},{"name":"init_raft"},{"cluster":"ddx-epoch-0","id":"0","name":"new"},{"cluster":"ddx-epoch-0","id":"0","name":"RaftCore"},{"name":"main"},{"id":"0","name":"runtime_loop"},{"id":"0","state":"Leader","name":"handle_api_msg"},{"name":"handle_update_matched"},{"name":"update_progress"}],"threadId":"ThreadId(3)"}
thread 'tokio-runtime-worker' panicked at 'assertion failed: value > prev', /root/.cargo/git/checkouts/openraft-63cdbcc3b9ffe82f/a314bdd/openraft/src/progress/mod.rs:207:9
stack backtrace:
   0: rust_begin_unwind
             at /rustc/285fa7ecd05dcbfdaf2faaf20400f5f92b39b3c6/library/std/src/panicking.rs:584:5
   1: core::panicking::panic_fmt
             at /rustc/285fa7ecd05dcbfdaf2faaf20400f5f92b39b3c6/library/core/src/panicking.rs:143:14
   2: core::panicking::panic
             at /rustc/285fa7ecd05dcbfdaf2faaf20400f5f92b39b3c6/library/core/src/panicking.rs:48:5
   3: <openraft::progress::VecProgress<ID,V,QS> as openraft::progress::Progress<ID,V,QS>>::update
             at /root/.cargo/git/checkouts/openraft-63cdbcc3b9ffe82f/a314bdd/openraft/src/progress/mod.rs:207:9
   4: openraft::engine::engine_impl::Engine<NID,N>::update_progress
             at /root/.cargo/git/checkouts/openraft-63cdbcc3b9ffe82f/a314bdd/openraft/src/engine/engine_impl.rs:696:23
   5: openraft::core::raft_core::RaftCore<C,N,S>::handle_update_matched::{{closure}}::{{closure}}
             at /root/.cargo/git/checkouts/openraft-63cdbcc3b9ffe82f/a314bdd/openraft/src/core/raft_core.rs:1413:9
   6: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
             at /rustc/285fa7ecd05dcbfdaf2faaf20400f5f92b39b3c6/library/core/src/future/mod.rs:91:19
   7: <tracing::instrument::Instrumented<T> as core::future::future::Future>::poll
             at /root/.cargo/registry/src/github.com-1ecc6299db9ec823/tracing-0.1.37/src/instrument.rs:272:9
   8: openraft::core::raft_core::RaftCore<C,N,S>::handle_update_matched::{{closure}}
             at /root/.cargo/git/checkouts/openraft-63cdbcc3b9ffe82f/a314bdd/openraft/src/core/raft_core.rs:1384:5
   9: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
             at /rustc/285fa7ecd05dcbfdaf2faaf20400f5f92b39b3c6/library/core/src/future/mod.rs:91:19
  10: openraft::core::raft_core::RaftCore<C,N,S>::handle_api_msg::{{closure}}::{{closure}}
             at /root/.cargo/git/checkouts/openraft-63cdbcc3b9ffe82f/a314bdd/openraft/src/core/raft_core.rs:1367:67
  11: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
             at /rustc/285fa7ecd05dcbfdaf2faaf20400f5f92b39b3c6/library/core/src/future/mod.rs:91:19
  12: <tracing::instrument::Instrumented<T> as core::future::future::Future>::poll
             at /root/.cargo/registry/src/github.com-1ecc6299db9ec823/tracing-0.1.37/src/instrument.rs:272:9

This seems to happen following a change_membership, promoting a learner and keeping the same leader. This started after upgrading to the latest master recently. Any pointers would be appreciated to inform my own investigation.

@github-actions
Copy link

👋 Thanks for opening this issue!

Get help or engage by:

  • /help : to print help messages.
  • /assignme : to assign this issue to you.

@drmingdrmer drmingdrmer self-assigned this Oct 28, 2022
@drmingdrmer
Copy link
Member

drmingdrmer commented Oct 28, 2022

Thanks for reporting this bug!

This error said that it tried to update the replication state to a former state.

I will need some more time to figure out what was wrong. My guess is either of the following:

  • Snapshot replication updated the replication state, but the line-rate replication was unaware.
  • When new nodes are added, the replication tasks are re-spawned. Thus it forgot the latest already reported replication state.

Do you have a complete log file about this error? I believe the former logs have recorded what happened.

@drmingdrmer drmingdrmer added C-bug Category: bug A-replication Area: replication protocol labels Oct 28, 2022
@drmingdrmer
Copy link
Member

I can reproduce it in a unit test, it's the second reason. Let me fix it.

drmingdrmer added a commit to drmingdrmer/openraft that referenced this issue Oct 28, 2022
Error After change_membership: `assertion failed: value > prev`,
when changing membership by converting a learner to a voter.

Because the replication streams are re-spawned, thus progress reverts to
zero. Then a reverted progress causes the panic.
drmingdrmer added a commit to drmingdrmer/openraft that referenced this issue Oct 28, 2022
…atabendlabs#584

Problem:

Error After change_membership: `assertion failed: value > prev`,
when changing membership by converting a learner to a voter.

Because the replication streams are re-spawned, thus progress reverts to
zero. Then a reverted progress causes the panic.

Solution:

When re-spawning replications, remember the previous progress.

- Fix: databendlabs#584
@fredfortier
Copy link
Author

Thanks for the stellar support as usual! I see you’ve added related commits so I’ll retry. I’m happy to provide a longer log if the issue persists.

drmingdrmer added a commit that referenced this issue Oct 29, 2022
Fix: Error After change_membership: `assertion failed: value > prev`: #584
@drmingdrmer
Copy link
Member

I’m happy to provide a longer log if the issue persists.

It does not look necessary:) I can 100% reproduce this bug.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-replication Area: replication protocol C-bug Category: bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants