Skip to content
This repository has been archived by the owner on Aug 2, 2021. It is now read-only.

fix intermittent failures in incentives simulation tests #1899

Closed
mortelli opened this issue Oct 22, 2019 · 8 comments · Fixed by #1930
Closed

fix intermittent failures in incentives simulation tests #1899

mortelli opened this issue Oct 22, 2019 · 8 comments · Fixed by #1930

Comments

@mortelli
Copy link
Contributor

there is an occasional, hard-to-reproduce bug occurring in swap/simulations_tests.go.

it looks like this:

--- FAIL: TestPingPongChequeSimulation (1.94s)
    simulations_test.go:327: expected cumulative payout to be 412298906469954, but is 431932187730428

it's currently unknown whether this is a proper bug, or something wrong with the sim tests themselves. things are pointing towards the former.

tests known to be affected by this bug:

  • TestPingPongChequeSimulation
  • TestMultiChequeSimulation

high count values when running swap tests can sometimes trigger the bug, e.g.:

go test ./swap -count=30

hence the bug is suspected to have something to do with concurrency/synchronization.

the incentives-simulations-bug branch is available for repeating CI tests under PR #1893.

@mortelli
Copy link
Contributor Author

tagging @janos and @holisticode (feel free to edit the issue if you have additional relevant info or corrections)

@acud
Copy link
Member

acud commented Oct 23, 2019

@mortelli does this actually flake on travis or only when running locally with high test count? if it is the case of the latter, then it is unfortunate but i'd say not very critical. we have other tests with the same symptoms. does it also occur when you run it multiple times enclosed by a for loop in the shell? i.e. for i in {0..100}; do; go test -v -count 1 . -run TestPingPongChequeSimulation; va=$?; if [[ $va -ne 0 ]] ;then break;else;echo $i; fi; sleep 0.1s; done;

@janos
Copy link
Member

janos commented Oct 23, 2019

@acud Failures are more frequently occurring on Ubuntu 19.10 running as a guest host in VirtualBox. This is how I was able to reproduce them for debugging.

The only symptom that I managed to find is that there are stored balances with double amount of payout threshold only stored when the error happens. But I did not manage to fix the problem with that information.

@mortelli
Copy link
Contributor Author

mortelli commented Oct 23, 2019

@mortelli does this actually flake on travis or only when running locally with high test count?

both. i was actually made aware of this error through the AppVeyor failed builds, and then managed to reproduce it locally with a high test count. it blocked multiple PRs and this is why i looked into it in the first place.

does it also occur when you run it multiple times enclosed by a for loop in the shell? i.e. for i in {0..100}; do; go test -v -count 1 . -run TestPingPongChequeSimulation; va=$?; if [[ $va -ne 0 ]] ;then break;else;echo $i; fi; sleep 0.1s; done;

it does not.

@ralph-pichler
Copy link
Member

it looks like pending cheques introduced a new flaky error in the TestMultiChequeSimulation test. I can only reproduce it inside a vm so far:

=== RUN   TestMultiChequeSimulation
WARN [11-05|13:11:24.556|github.com/ethersphere/swarm/swap/swap.go:450]              cheque bounced                           swaplog=* base=a5126ccf56b605e7 tx=0e0000…947e6f
WARN [11-05|13:11:24.571|github.com/ethersphere/swarm/swap/swap.go:450]              cheque bounced                           swaplog=* base=a5126ccf56b605e7 tx=568d95…ff9a5c
WARN [11-05|13:11:24.581|github.com/ethersphere/swarm/swap/swap.go:450]              cheque bounced                           swaplog=* base=a5126ccf56b605e7 tx=37e346…6cc0f7
WARN [11-05|13:11:24.596|github.com/ethersphere/swarm/swap/swap.go:450]              cheque bounced                           swaplog=* base=a5126ccf56b605e7 tx=918b39…83cdd2
WARN [11-05|13:11:24.608|github.com/ethersphere/swarm/swap/swap.go:450]              cheque bounced                           swaplog=* base=a5126ccf56b605e7 tx=ce0e24…413c86
WARN [11-05|13:11:24.620|github.com/ethersphere/swarm/swap/swap.go:450]              cheque bounced                           swaplog=* base=a5126ccf56b605e7 tx=18146b…4133c6
WARN [11-05|13:11:24.621|github.com/ethersphere/swarm/swap/swap.go:353]              cheque sent by peer has already been received in the past swaplog=* base=d9b5d8bd215a7f98 peer=bb40ff67e0811dfa cumulativePayout=176699531344266
WARN [11-05|13:11:24.621|github.com/ethersphere/swarm/swap/swap.go:415]              ignoring confirm msg, no pending cheque  swaplog=* base=d9b5d8bd215a7f98 peer=4ac7527c49bc60c6 confirm message cheque="Contract: 6c60ee4660ee385f658345b6bed000b513757ff4 Beneficiary: 467bcb061b0199d906a306874500cd796d119ebf CumulativePayout: 176699531344266 Honey: 19633281260474"
ERROR[11-05|13:11:24.981|accounts/usbwallet/hub.go:170]                              Failed to enumerate USB devices          hub=trezor vendor=4617  failcount=3 err="failed to initialize libusb: libusb: unknown error [code -99]"
ERROR[11-05|13:11:24.981|accounts/usbwallet/hub.go:170]                              Failed to enumerate USB devices          hub=ledger vendor=11415 failcount=3 err="failed to initialize libusb: libusb: unknown error [code -99]"
ERROR[11-05|13:11:24.981|accounts/usbwallet/hub.go:170]                              Failed to enumerate USB devices          hub=trezor vendor=21324 failcount=3 err="failed to initialize libusb: libusb: unknown error [code -99]"
ERROR[11-05|13:11:44.622|github.com/ethersphere/swarm/p2p/protocols/protocol.go:249] dropping peer with DiscSubprotocolError  peer=4ac7527c49bc60c6 reason="hive stopping"
--- FAIL: TestMultiChequeSimulation (20.15s)
    simulations_test.go:430: timed out waiting for cheque to be processed

@ralph-pichler
Copy link
Member

Probably unrelated but I noticed that the tests recognise received cheques by the cashDone channel which is sent to by overriding the defaultCashCheque function but since Devcon defaultCashCheque is only called if (cheque.CumulativePayout - paidOut.Uint64()) > 2*transactionCosts. This appears to be always true in the sim tests for now but it means that cashDone is not a reliable indicator as to wether a cheque was received.

@ralph-pichler
Copy link
Member

The expected cumulative payout issue (in the ping pong test) seems to be caused by the receiving node receiving the test message too late. It receives and processes the cheque first which puts it in debt (because it hasn't accounted the profit from the message it receives yet). Since the cheque was processed the loop continues and sends another message. This causes the node to send a cheque twice the payment threshold. From that moment forward the cumulative payout is no longer what we expect after each iteration.

@ralph-pichler
Copy link
Member

ralph-pichler commented Nov 5, 2019

Here is some log output. (Don't worry about the ERROR loglevel, I just logged everything with it so I can easier separate the relevant logs from the rest). Since those are mostly custom logs I put in there for testing I also added some comments to the log.

// for loop iteration begins (expected is the value we would expect for the cumulative payout of the cheque sent after the message)
// it's p2Peers turn to send
ERROR[11-05|18:47:41.006|github.com/ethersphere/swarm/swap/simulations_test.go:313] iteration                                i=26 expected=274865937646636

// p2Peer sends a testMsgBigPrice, putting it into debt and causing a cheque
ERROR[11-05|18:47:41.006|github.com/ethersphere/swarm/swap/swap.go:313]             add balance adjustment                   swaplog=* base=69a563da59b58161 peer=f2bd3fa67882d3b5 from=0               to=-19633281260474
ERROR[11-05|18:47:41.006|github.com/ethersphere/swarm/swap/peer.go:210]             send balance adjustment                  swaplog=* base=69a563da59b58161 peer=f2bd3fa67882d3b5 from=-19633281260474 to=0
ERROR[11-05|18:47:41.006|github.com/ethersphere/swarm/swap/peer.go:212]             sending cheque to peer                   swaplog=* base=69a563da59b58161 peer=f2bd3fa67882d3b5 cheque="Contract: b6f7dfdb23e3e5863e27db10f6a731f23cc77c44 Beneficiary: 3461ffad9adad7d0714257cb56f938b4bc474f6c CumulativePayout: 274865937646636 Honey: 19633281260474"

// p1Peer receives the cheque and adjusts the balance
ERROR[11-05|18:47:41.006|github.com/ethersphere/swarm/swap/swap.go:352]             received cheque from peer                swaplog=* base=69a563da59b58161 peer=cf313b579ef0c053 cheque="Contract: b6f7dfdb23e3e5863e27db10f6a731f23cc77c44 Beneficiary: 3461ffad9adad7d0714257cb56f938b4bc474f6c CumulativePayout: 274865937646636 Honey: 19633281260474"
ERROR[11-05|18:47:41.006|github.com/ethersphere/swarm/swap/swap.go:379]             recv balance adjustment                  swaplog=* base=69a563da59b58161 peer=cf313b579ef0c053 from=0               to=-19633281260474

// the testMsgBigPrice arrives at p2Peer putting the balance back to 0
ERROR[11-05|18:47:41.007|github.com/ethersphere/swarm/swap/swap.go:313]             add balance adjustment                   swaplog=* base=69a563da59b58161 peer=cf313b579ef0c053 from=-19633281260474 to=0

// next iteration, this is where things start to go wrong
// it's p1Peers turn to send
ERROR[11-05|18:47:41.020|github.com/ethersphere/swarm/swap/simulations_test.go:313] iteration                                i=27 expected=274865937646636

// p1Peer sends a testMsgBigPrice, putting it into debt and causing a cheque
ERROR[11-05|18:47:41.020|github.com/ethersphere/swarm/swap/swap.go:313]             add balance adjustment                   swaplog=* base=69a563da59b58161 peer=cf313b579ef0c053 from=0               to=-19633281260474
ERROR[11-05|18:47:41.021|github.com/ethersphere/swarm/swap/peer.go:210]             send balance adjustment                  swaplog=* base=69a563da59b58161 peer=cf313b579ef0c053 from=-19633281260474 to=0
ERROR[11-05|18:47:41.021|github.com/ethersphere/swarm/swap/peer.go:212]             sending cheque to peer                   swaplog=* base=69a563da59b58161 peer=cf313b579ef0c053 cheque="Contract: c04c30e68db7fe333beab11882cb75f734fe8c53 Beneficiary: c805f3407d595cac89ad7928de38dc693af121aa CumulativePayout: 274865937646636 Honey: 19633281260474"

// p2Peer receives the cheque and adjusts the balance putting it into debt
ERROR[11-05|18:47:41.021|github.com/ethersphere/swarm/swap/swap.go:352]             received cheque from peer                swaplog=* base=69a563da59b58161 peer=f2bd3fa67882d3b5 cheque="Contract: c04c30e68db7fe333beab11882cb75f734fe8c53 Beneficiary: c805f3407d595cac89ad7928de38dc693af121aa CumulativePayout: 274865937646636 Honey: 19633281260474"
ERROR[11-05|18:47:41.021|github.com/ethersphere/swarm/swap/swap.go:379]             recv balance adjustment                  swaplog=* base=69a563da59b58161 peer=f2bd3fa67882d3b5 from=0               to=-19633281260474

// next iteration
// it's p2Peers turn to send
ERROR[11-05|18:47:41.030|github.com/ethersphere/swarm/swap/simulations_test.go:313] iteration                                i=28 expected=294499218907110

// p2Peer sends a testMsgBigPrice, putting it further into debt (the balance has not been reset yet because the other testMsgBigPrice has not arrived yet)
ERROR[11-05|18:47:41.030|github.com/ethersphere/swarm/swap/swap.go:313]             add balance adjustment                   swaplog=* base=69a563da59b58161 peer=f2bd3fa67882d3b5 from=-19633281260474 to=-39266562520948

// this causes p2Peer to send a cheque worth 39266562520948 honey (instead of the 19633281260474 we want)
ERROR[11-05|18:47:41.031|github.com/ethersphere/swarm/swap/peer.go:210]             send balance adjustment                  swaplog=* base=69a563da59b58161 peer=f2bd3fa67882d3b5 from=-39266562520948 to=0
ERROR[11-05|18:47:41.031|github.com/ethersphere/swarm/swap/peer.go:212]             sending cheque to peer                   swaplog=* base=69a563da59b58161 peer=f2bd3fa67882d3b5 cheque="Contract: b6f7dfdb23e3e5863e27db10f6a731f23cc77c44 Beneficiary: 3461ffad9adad7d0714257cb56f938b4bc474f6c CumulativePayout: 314132500167584 Honey: 39266562520948"

// the testMsgBigPrice from p2Peer arrives at p1Peer
ERROR[11-05|18:47:41.031|github.com/ethersphere/swarm/swap/swap.go:313]             add balance adjustment                   swaplog=* base=69a563da59b58161 peer=cf313b579ef0c053 from=0               to=19633281260474

// the cheque from p2Peer arrives at p1Peer
ERROR[11-05|18:47:41.031|github.com/ethersphere/swarm/swap/swap.go:352]             received cheque from peer                swaplog=* base=69a563da59b58161 peer=cf313b579ef0c053 cheque="Contract: b6f7dfdb23e3e5863e27db10f6a731f23cc77c44 Beneficiary: 3461ffad9adad7d0714257cb56f938b4bc474f6c CumulativePayout: 314132500167584 Honey: 39266562520948"
ERROR[11-05|18:47:41.031|github.com/ethersphere/swarm/swap/swap.go:379]             recv balance adjustment                  swaplog=* base=69a563da59b58161 peer=cf313b579ef0c053 from=19633281260474  to=-19633281260474

// finally the delayed testMsgBigPrice from p1Peer arrives at p2Peer
ERROR[11-05|18:47:41.031|github.com/ethersphere/swarm/swap/swap.go:313]             add balance adjustment                   swaplog=* base=69a563da59b58161 peer=f2bd3fa67882d3b5 from=0               to=19633281260474

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants