Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

consistency issues #587

Open
jerrytesting opened this issue Mar 27, 2023 · 11 comments
Open

consistency issues #587

jerrytesting opened this issue Mar 27, 2023 · 11 comments
Labels
Bug Confirmed to be a bug transferred-from-raft

Comments

@jerrytesting
Copy link

jerrytesting commented Mar 27, 2023

We found one consistency issue about not read uncomitted in the latest raft version. Here is the bug log: https://github.com/jerrytesting/inconsistent-bugs-canonical-raft/tree/main/not_read_uncommitted and hope it could help you debug.

@jerrytesting jerrytesting changed the title inconsistent bugs Inconsistent bugs about read uncomitted data Mar 27, 2023
@jerrytesting jerrytesting changed the title Inconsistent bugs about read uncomitted data Inconsistent bugs about not read uncomitted Mar 27, 2023
@jerrytesting jerrytesting changed the title Inconsistent bugs about not read uncomitted consistency issues Mar 27, 2023
@jerrytesting
Copy link
Author

Besides, we also found one consistency issue about not read committed: https://github.com/jerrytesting/inconsistent-bugs-canonical-raft/tree/main/not_read_committed

@cole-miller
Copy link
Contributor

Thanks for the reports! Are you able to share any information about the Jepsen setup that produced these results? We'd love to incorporate any changes into https://github.com/canonical/jepsen.dqlite that are needed to reproduce them in our CI.

@jerrytesting
Copy link
Author

Sorry that our Jepsen setup is not open-sourced now but it will be soon. Currently, I can only give you enough bug reports to help you figure out the issues.

@cole-miller
Copy link
Contributor

@jerrytesting No problem, we appreciate your work on this. Does your setup include an "assertion checker", like this in jepsen.dqlite?

@jerrytesting
Copy link
Author

jerrytesting commented Mar 28, 2023

Yes, included the assertion checker and reported two assertion failures in ISSUE-291. We are still running to see whether we can capture other issues.

@MathieuBordere MathieuBordere self-assigned this Mar 28, 2023
@MathieuBordere
Copy link
Contributor

MathieuBordere commented Mar 28, 2023

hey @jerrytesting do you have the app.log for the nodes for not_read_committed like you have for the other bug in the repo (there the app.log for n3 is missing though) ?

@MathieuBordere
Copy link
Contributor

Also, which version of raft are you using? Based on the log messages,you are not on v0.17.1 which contains quite some bugfixes.

@MathieuBordere
Copy link
Contributor

MathieuBordere commented Mar 28, 2023

What is also strange is e.g. in not_read_uncommitted in n1's app.log, the node starts around log line 88690 as a single node in a cluster and it's accepting requests from Jepsen, timestamp 2023/02/13 14:58:56.631530.

An example of an inconsistent read is key 55 around timestamp 2023-02-13 14:59:41, that's also a time during which node n1 is accepting requests as a single member and leader of a cluster, but during that time n3 is the leader of another cluster with n2, n4 and n5. So you have effectively 2 clusters running.

Is it possible you are not setting up the cluster correctly in your tests?

@jerrytesting
Copy link
Author

Hey, we run in v0.14.0. I think it should not be the setup issue as we use your Jepsen harness.

Will upload the full logs soon (some logs are very large, so they are missed).

@nurturenature
Copy link

Looking at the logs from a Jepsen test perspective, it's hard to make sense of what's happening.
My interpretation is most likely a synthetic, i.e. test environment, failure.

From looking at the log messages, the dqlite.jepsen tests are quite old, >= 8 months.

The test framework mediator wrappers that were added to the core of Jepsen break nemesis behavior, impact,and intent.

jepsen.mediator.wrapper.MediatorGen is not behaving like a generator:

  • some nemesis operations are not invoke!d
  • operations are emitted duplicated, out of order
  • dispatches some net operations to mediator.wrapper net and some straight to Jepsen's net
  • repeats some single operation generators indefinitely
  • doesn't wrap :final-generator for healing

Running the current jepsen.dqlite with the same test opts did not reproduce the issue.

Note: Jepsen provides extension mechanisms, e.g. net protocol, gen/map, etc that will do what's visible in the logs without the need for a mediator.

@jerrytesting
Copy link
Author

Yeah, there was one bug in our tool that the stop-partition nemesis was not invoked and so not executed. Thanks for finding this bug in our tool. However, the other points you said are not issues. Our tool only decides how to dispatch nemesis but the real generators are still from Jepsen.

I think this consistency violation really exists but happens in an extreme environment; for example, the network partition doesn't be healed in long time.

@MathieuBordere MathieuBordere added the Bug Confirmed to be a bug label Jun 12, 2023
@cole-miller cole-miller transferred this issue from canonical/raft Feb 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Confirmed to be a bug transferred-from-raft
Projects
None yet
Development

No branches or pull requests

4 participants