Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add script to quickly reproduce issue 14370 #14912

Closed
wants to merge 1 commit into from
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
21 changes: 21 additions & 0 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -129,6 +129,27 @@ gofail-disable: install-gofail
install-gofail:
cd tools/mod; go install go.etcd.io/gofail@${GOFAIL_VERSION}

# Reproduce historical issues

.PHONY: reproduce-issue14370
reproduce-issue14370: ./bin/etcd-v3.5.4-failpoints
cp ./bin/etcd-v3.5.4-failpoints ./bin/etcd
GO_TEST_FLAGS='-v --run=TestLinearizability/Issue14370 --count 100 --failfast' make test-linearizability
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two comments:

  1. After we remove the package from the failpoint name, this will not work any more. Because you always checkout & build v3.5.4.
  2. Makefile seems not like a good place to just demonstrate the ability of linearizability test. Can we just add a script file under ./script/ directory? @ptabor @spzala WDYT?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ad 1. It will work as both gofail binary and libraries are from same version taken from main branch GOFAIL_VERSION.

Ad 2. I can agree that we shouldn't put everything into root scripts, however I'm against adding this as a script. Makefile ensures that we keep code simple, which we will loose when we migrate everything to bash scripts. See 700 line ./scripts/tests.sh.

I think we our makefile is still pretty simple and readable. Splitting things too early is also a bad pattern.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ad 1. It will work as both gofail binary and libraries are from same version taken from main branch GOFAIL_VERSION

Sooner or later, eventually we will bump new gofail version, and update all the existing failpoint names (e.g. beforeCommit --> backendBeforeCommit), and also update linearizability test (e.g. backend/beforeCommit --> backendBeforeCommit). Will the script in this PR be still working at that time?

FYI. I just created a tag v0.1.0 for gofail: https://github.com/etcd-io/gofail/releases/tag/v0.1.0

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with Benjamin:
- Makefile shouldn't cary project's history. It should contains up-to-date receipts to work with the current code.

We should answer what's the purpose of the code:
a) Just documentation - historical note how the problem was reproductionable in 3.5.4 ?
b) Potential regression test - that can be applied to up-to-date sources to make sure the similar scenario does not happens (or even can be used with bisect).

I prefer the latter.

I'm thinking about putting this code to /tests/repros that would lead to:

  • In /tests/repros/Issue14370.sh it would make sense to code only i--local mode to cover the current code.
  • We would need to make a new-tag v3.5.4-repro with cherry picked /tests/repros/Issue14370.sh to have a historical repro with all the toolkit compatible with v3.5.4 (assuming toolkit longterm will change and we want to keep it working)

Testing any revision since v3.5.4-repro forward is as simple as:

  git checkout `rev`
  /tests/repros/Issue14370.sh

And this mitigates risks at changes at HEAD to the new toolkit will break the script working for the older revisions.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Purpose of this code is to ensure that during development of linearizability testing, we don't loose ability to reproduce historical issues. We are testing 1to1 same scenarios on main branch, however due to nature of failure injection testing it's hard to guarantee those scenarios will maintain ability to reproduce those issues. Imagine that we have a issue that can be only reproduce under high volume of PUT requests, if during development of linearizability tests we change request time proportions we might no longer be able to reproduce this.

I use those scripts as a sanity checks during linearizability test development. Unfortunately we cannot automatically test this as issues don't have 100% reproducability as they are based on race condition. I hope that in future improvements to gofail will allow us to do that by using sleeps to always reproduce a race, but we are far from it.

In /tests/repros/Issue14370.sh it would make sense to code only i--local mode to cover the current code.

Those scenarios are already running on main branch. No need for local mode.

We would need to make a new-tag v3.5.4-repro with cherry picked /tests/repros/Issue14370.sh to have a historical repro with all the toolkit compatible with v3.5.4 (assuming toolkit longterm will change and we want to keep it working)

Really don't like idea of maintaining large number of repro only tags.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those scenarios are already running on main branch. No need for local mode.

Yes - I wrote local to stress it doesn't have a code to to git checkout of another branch in tmp dir

We would need to make a new-tag v3.5.4-repro with cherry picked /tests/repros/Issue14370.sh to have a historical repro with all the toolkit compatible with v3.5.4 (assuming toolkit longterm will change and we want to keep it working)

Really don't like idea of maintaining large number of repro only tags.

I hope this will be not a large number. It's worth doing only for big issues.

Thinking what's the reliable path to replay on 3.5.4 when the main will be 3.7.x and making sure all the scripts from 3.7.x work well both: on 3.5.4 and on 3.7.4 ?
And if we don't want the repro tag... the only reliable alternative I see is:

    git checkout v3.5.5 (the first version with the repro code)
    ./tests/repros/repro_at.sh Issue14370.sh -ref=v3.5.4

Where ./tests/repros/repro_at.sh have on orchestration of checkout of ref but finally calls Issues14370.sh from current branch on the tmp checked out code.


./bin/etcd-v3.5.4-failpoints:
rm -rf /tmp/etcd-release-v3.5.4/
mkdir -p /tmp/etcd-release-v3.5.4/
cd /tmp/etcd-release-v3.5.4/; \
git clone https://github.com/etcd-io/etcd.git .; \
git checkout v3.5.4; \
go get go.etcd.io/gofail@${GOFAIL_VERSION}; \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that lines L145-L149,L151 are generic code that could have it's 'makefile'

it's

make build-etcd-failpoints

(cd server; go get go.etcd.io/gofail@${GOFAIL_VERSION}); \
(cd etcdctl; go get go.etcd.io/gofail@${GOFAIL_VERSION}); \
(cd etcdutl; go get go.etcd.io/gofail@${GOFAIL_VERSION}); \
FAILPOINTS=true ./build;
mkdir -p ./bin
cp /tmp/etcd-release-v3.5.4/bin/etcd ./bin/etcd-v3.5.4-failpoints

# Cleanup

clean:
Expand Down