Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dealing with flakes so far #5281

Merged
merged 32 commits into from
Jun 27, 2022

Conversation

rustyrussell
Copy link
Contributor

@rustyrussell rustyrussell commented May 22, 2022

This removes flaky, and fixes the resulting issues (so far).

Changelog-None

@rustyrussell rustyrussell added testing voodoo Suspect user, issue or code has ancient curse labels May 22, 2022
@rustyrussell rustyrussell marked this pull request as draft May 23, 2022 00:40
@rustyrussell rustyrussell force-pushed the guilt/no-flaky branch 2 times, most recently from ddc9390 to 25eb4c8 Compare May 23, 2022 04:52
@rustyrussell rustyrussell force-pushed the guilt/no-flaky branch 12 times, most recently from b7205ed to 1390b0b Compare May 27, 2022 01:06
@rustyrussell rustyrussell changed the title Remove python-flaky, start dealing with flakes. Dealing with flakes so far Jun 5, 2022
@rustyrussell rustyrussell marked this pull request as ready for review June 5, 2022 03:04
@rustyrussell rustyrussell force-pushed the guilt/no-flaky branch 3 times, most recently from 83edeb9 to b16e118 Compare June 7, 2022 05:19
@rustyrussell rustyrussell force-pushed the guilt/no-flaky branch 4 times, most recently from 13605cd to a240f2b Compare June 17, 2022 00:47
…at_closes

Split harder!

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

```
 # This can timeout, so do it in four easy stages.
        for i in range(4):
>           bitcoind.generate_block(4032 // 4)

tests/test_closing.py:971: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
contrib/pyln-testing/pyln/testing/utils.py:475: in generate_block
    return self.rpc.generatetoaddress(numblocks, to_addr)
contrib/pyln-testing/pyln/testing/utils.py:372: in f
    res = proxy._call(name, *args)
/opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/site-packages/bitcoin/rpc.py:233: in _call
    response = self._get_response()
/opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/site-packages/bitcoin/rpc.py:263: in _get_response
    http_response = self.__conn.getresponse()
/opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/http/client.py:1373: in getresponse
    response.begin()
/opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/http/client.py:319: in begin
    version, status, reason = self._read_status()
/opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/http/client.py:280: in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <socket.SocketIO object at 0x7f1a4b3702d0>
b = <memory at 0x7f1a4ac51600>

    def readinto(self, b):
        """Read up to len(b) bytes into the writable buffer *b* and return
        the number of bytes read.  If the socket is non-blocking and no bytes
        are available, None is returned.
    
        If *b* is non-empty, a 0 return value indicates that the connection
        was shutdown at the other end.
        """
        self._checkClosed()
        self._checkReadable()
        if self._timeout_occurred:
            raise OSError("cannot read from timed out object")
        while True:
            try:
>               return self._sock.recv_into(b)
E               socket.timeout: timed out
```
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
We can fail to use larger channel if it's not ready yet:

```
2022-05-23T01:20:05.5325600Z         # Check it used the larger channel!
2022-05-23T01:20:05.5326376Z >       assert before[chan23a_idx]['to_us_msat'] == after[chan23a_idx]['to_us_msat']
2022-05-23T01:20:05.5326961Z E       assert 1000000000msat == 900000000msat
2022-05-23T01:20:05.5327240Z 
2022-05-23T01:20:05.5327621Z tests/test_connection.py:3896: AssertionError
```

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Under valgrind, at least, this test fails regularly, and this sleep
fixes it.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Two almost identical assertion fails under CI.  Try waiting.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Usually we won't see this, since private is deleted.  But we could
have already read the private channel before that.  Handle it properly.

(Tested by removing the gossip_store deletion code and making sure
this worked).

We have to fix up the test, which announces a channel twice!

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
When upgrading a channel from private to public, we would delete the
private channel then add the public one.  However, this is visible in
the gossmap!  In particular, the node may be removed from the gossmap
and then re-added, so it may temporarily vanish!

I was seeing an occasional assertion inside node_factory.line_graph:

```
@pytest.mark.developer
    def test_reconnect_remote_sends_no_sigs(node_factory):
        """We re-announce, even when remote node doesn't send its announcement_signatures on reconnect.
        """
>       l1, l2 = node_factory.line_graph(2, wait_for_announce=True, opts={'may_reconnect': True})

tests/test_connection.py:870: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
contrib/pyln-testing/pyln/testing/utils.py:1467: in line_graph
    self.join_nodes(nodes, fundchannel, fundamount, wait_for_announce, announce_channels)
contrib/pyln-testing/pyln/testing/utils.py:1460: in join_nodes
    wait_for(lambda: 'alias' in only_one(end.rpc.listnodes(n.info['id'])['nodes']))
contrib/pyln-testing/pyln/testing/utils.py:88: in wait_for
    while not success():
contrib/pyln-testing/pyln/testing/utils.py:1460: in <lambda>
    wait_for(lambda: 'alias' in only_one(end.rpc.listnodes(n.info['id'])['nodes']))
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

arr = []

    def only_one(arr):
        """Many JSON RPC calls return an array; often we only expect a single entry
        """
>       assert len(arr) == 1
E       AssertionError
```

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
We assumed that if the outgoing htlc has failed, we should have always
failed the incoming one.  This is not true, as this testcase demonstrates:

```
lightningd-2: 2022-05-25T04:38:31.449Z DEBUG   035d2b1192dfba134e10e540875d366ebc8bc353d5aa766b80c090b39c3a5d885d-onchaind-chan#2: Sending 1 missing htlc messages
lightningd-2: 2022-05-25T04:38:31.449Z DEBUG   035d2b1192dfba134e10e540875d366ebc8bc353d5aa766b80c090b39c3a5d885d-chan#2: onchain_failed_our_htlc
lightningd-2: 2022-05-25T04:38:31.449Z DEBUG   035d2b1192dfba134e10e540875d366ebc8bc353d5aa766b80c090b39c3a5d885d-chan#2: HTLC id 0 failonion = 0x55cb8bc045d8, failmsg = (nil), preimage = (nil)
lightningd-2: 2022-05-25T04:38:31.449Z **BROKEN** 035d2b1192dfba134e10e540875d366ebc8bc353d5aa766b80c090b39c3a5d885d-chan#2: HTLC id 0 already complete, but ->in not resolved! failonion = 206816a391d3a7666ddd5213914cbb68f5da1fc4a0937e729de5a1990c94d26312caa5f2778e8da0c6bdefc68dd1a3bc28b5b5650fc0bdb3c2247ecca94ed0bbb224c8448c2c71eb1656a8740cadb301bd1ee1c1e774a8fef817352f502e4217b11e93aa6877b88b37afab0e4d4e49ed0385be9ab9a1ab1ac0e3460e41cfafb30ed896cea96e346041919a6c524ce56c3e5f27c7cd78a36b6df221e90a1c6e048c72b4146a5a51885fb70649037fe7ace77a016ae3ec8aee97960d0e5f0582713f671df79d8dee11b82708b6d882ee5adbb328db1938e824110f57ead1b27410bc6f775c7bb4ae40c1768d77a166c9bfda8f634ba0ac4f8a9fe199894dd3754c5ce41c9694544c805ffc177517661f11221dd8dffac60ce1c8c5bf54cda8e5ea44d8ec6b, failmsg = (null), preimage = (null)
lightningd-2: 2022-05-25T04:38:31.449Z **BROKEN** 035d2b1192dfba134e10e540875d366ebc8bc353d5aa766b80c090b39c3a5d885d-chan#2: MISSING incoming fail for 0: failing incoming now
lightningd-2: 2022-05-25T04:38:31.449Z **BROKEN** 0266e4598d1d3c415f572a8488830b60f7e744ed9235eb0b1ba93283b315c03518-chan#1: MISSED incoming fail for 0: failing now
lightningd-2: 2022-05-25T04:38:31.449Z DEBUG   0266e4598d1d3c415f572a8488830b60f7e744ed9235eb0b1ba93283b315c03518-chan#1: HTLC in 0 RCVD_ADD_ACK_REVOCATION->SENT_REMOVE_HTLC
```

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
…ailed HTLC

1. We set an outgoing htlc's `failonion` when we get a commitment_signed.
2. We don't transfer it to the corresponding incoming HTLC until we send
   commitment_signed and receive revoke_and_ack (meaning, outgoing htlc is
   completely dead).
3. If between these steps we go onchain, onchaind (after 3 blocks) tells us
   to fail the HTLC.
4. hout->failonion is set, but hout->hin has not been failed yet.  We
   do a sanity check and print a nasty message, and fail it with
   WIRE_PERMANENT_CHANNEL_FAILURE instead of relaying the error.

So handle this case explicitly.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
We can hang up due to ping while waiting for channel to become active.
But we don't even need an active channel for this test, so simplify.

```
______________________________ test_ping_timeout _______________________________
[gw1] linux -- Python 3.7.13 /opt/hostedtoolcache/Python/3.7.13/x64/bin/python3

node_factory = <pyln.testing.utils.NodeFactory object at 0x7f132cca8a90>

    @pytest.mark.developer("dev-disconnect required")
    def test_ping_timeout(node_factory):
        # Disconnects after this, but doesn't know it.
        l1_disconnects = ['xWIRE_PING']
    
        l1, l2 = node_factory.line_graph(2, opts=[{'dev-no-reconnect': None,
                                                   'disconnect': l1_disconnects},
>                                                 {}])

tests/test_connection.py:3826: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
contrib/pyln-testing/pyln/testing/utils.py:1493: in line_graph
    self.join_nodes(nodes, fundchannel, fundamount, wait_for_announce, announce_channels)
contrib/pyln-testing/pyln/testing/utils.py:1470: in join_nodes
    nodes[i].wait_channel_active(scids[i])
contrib/pyln-testing/pyln/testing/utils.py:1003: in wait_channel_active
    wait_for(lambda: self.is_channel_active(chanid))
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

success = <function LightningNode.wait_channel_active.<locals>.<lambda> at 0x7f132c3ef830>
timeout = 900

    def wait_for(success, timeout=TIMEOUT):
        start_time = time.time()
        interval = 0.25
        while not success():
            time_left = start_time + timeout - time.time()
            if time_left <= 0:
>               raise ValueError("Timeout while waiting for {}", success)
E               ValueError: ('Timeout while waiting for {}', <function LightningNode.wait_channel_active.<locals>.<lambda> at 0x7f132c3ef830>)
```
Some flakes are caused by weird races in this code.  Plus, if we
get things to write straight to files, we might see things in
there on post-mortem which happen after the python runner exits.

It's a bit less efficient, but much simpler.  Let's see if it helps!

Some tests need a rework now, since we don't get a failure (except
eventual timeout), but they're simpler.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Get it to log direct to stdout, so we see what's happening *as it
happens* rather than as we read it.  We could restore the thread we
were using before, but that added more problems than it solved.

This means that we need the hsm password prompts in the log though.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Valgrind in CI is slow:

```
    def test_bitcoin_failure(node_factory, bitcoind):
...
        # Ignore BROKEN log message about blocksonly mode.
        l2 = node_factory.get_node(start=False, expect_fail=True,
                                   allow_broken_log=True)
        l2.daemon.start(wait_for_initialized=False)
        # Will exit with failure code.
>       assert l2.daemon.wait() == 1

tests/test_misc.py:114: 
```

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
We were waiting for the start to timeout waiting for the "public key"
message.  Instead, start manually.

Before, this took (with TIMEOUT=30) 97 seconds, after 8 seconds.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
It shouldn't return nonsense, but it did, and we segfaulted.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
It actually timed out with the default 60 seconds, just before it
saw the block:

```
2022-06-07T02:16:05.2557049Z         bitcoind.generate_block(1)
2022-06-07T02:16:05.2557300Z         sync_blockheight(bitcoind, [node])
2022-06-07T02:16:05.2557594Z         fut1.result(5)
2022-06-07T02:16:05.2557912Z         assert not fut2.done()
2022-06-07T02:16:05.2558121Z     
2022-06-07T02:16:05.2558370Z         # Trigger two blocks.
2022-06-07T02:16:05.2558689Z         bitcoind.generate_block(1)
2022-06-07T02:16:05.2558941Z         sync_blockheight(bitcoind, [node])
2022-06-07T02:16:05.2559219Z >       fut2.result(5)
2022-06-07T02:16:05.2559350Z 
2022-06-07T02:16:05.2559508Z tests/test_misc.py:2138: 
...
2022-06-07T02:16:05.2586947Z         elif "error" in resp:
2022-06-07T02:16:05.2587398Z >           raise RpcError(method, payload, resp['error'])
2022-06-07T02:16:05.2588026Z E           pyln.client.lightning.RpcError: RPC call failed: method: waitblockheight, payload: {'blockheight': 103}, error: {'code': 2000, 'message': 'Timed out.'}
2022-06-07T02:16:05.2588325Z 
2022-06-07T02:16:05.2588563Z contrib/pyln-client/pyln/client/lightning.py:387: RpcError
```

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
```
>           raiseValueError(str(errors))
E           ValueError: 
E           Node errors:
E           Global errors:
E            - Node /tmp/ltests-x5sfpiwp/test_openchannel_hook_chaining_1/lightning-2/ has memory leaks: [
E               {
E                   "backtrace": [
E                       "ccan/ccan/tal/tal.c:442 (tal_alloc_)",
E                       "ccan/ccan/io/io.c:91 (io_new_conn_)",
E                       "lightningd/subd.c:773 (new_subd)",
E                       "lightningd/subd.c:827 (new_channel_subd_)",
E                       "lightningd/opening_control.c:870 (peer_start_openingd)",
E                       "lightningd/peer_control.c:1307 (peer_active)",
E                       "lightningd/connect_control.c:457 (connectd_msg)",
E                       "lightningd/subd.c:556 (sd_msg_read)",
E                       "lightningd/subd.c:357 (read_fds)",
E                       "ccan/ccan/io/io.c:59 (next_plan)",
E                       "ccan/ccan/io/io.c:407 (do_plan)",
E                       "ccan/ccan/io/io.c:417 (io_ready)",
E                       "ccan/ccan/io/poll.c:453 (io_loop)",
E                       "lightningd/io_loop_with_timers.c:22 (io_loop_with_timers)",
E                       "lightningd/lightningd.c:1181 (main)",
E                       "../csu/libc-start.c:308 (__libc_start_main)"
E                   ],
E                   "label": "ccan/ccan/io/io.c:91:struct io_conn",
E                   "parents": [
E                       "lightningd/lightningd.c:107:struct lightningd"
E                   ],
E                   "value": "0x2b5a898"
E               }
E           ]
```

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Make sure bitcoind gets tx before mining blocks!

```
# l1<->l2 mutual close should work
        chan = l1.get_channel_scid(l2)
        l2.rpc.connect(l1.info['id'], 'localhost', l1.port)
        l1.rpc.close(chan)
        l2.daemon.wait_for_log('State changed from CLOSINGD_SIGEXCHANGE to CLOSINGD_COMPLETE')
        bitcoind.generate_block(2)
        sync_blockheight(bitcoind, [l1, l2])
>       l1.daemon.wait_for_log('Resolved FUNDING_TRANSACTION/FUNDING_OUTPUT by MUTUAL_CLOSE')
tests/test_closing.py:851:
```

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
I saw another "only_one()" fail on alias checking: it's not
entirely clear to me with the more aggressive sending of
own gossip, that we necessarily process in order, so we might
not have actually seen all channels just because we saw
the farthest one.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Looks like we woke one of the startup io_loops early, and thus
we thought we'd finished connectd_activate and we hadn't.  This
caused us to use an uninitialized ld->announceable array, and
finally caused an assert fail in the main loop.

Make *every* loop assert that it was exited for the correct reason,
so if it happens again, we can maybe figure out what part of
the code to look at.

```
lightningd: lightningd/lightningd.c:1186: main: Assertion `io_loop_ret == ld' failed.
lightningd: FATAL SIGNAL 6 (version 4df66fa)
...
------------------------------- Valgrind errors --------------------------------
Valgrind error file: valgrind-errors.895509
==895509== Conditional jump or move depends on uninitialised value(s)
==895509==    at 0x22C58E: to_tal_hdr_or_null (tal.c:184)
==895509==    by 0x22D531: tal_bytelen (tal.c:637)
==895509==    by 0x1F10B6: towire_gossipd_init (gossipd_wiregen.c:100)
==895509==    by 0x13AC6E: gossip_init (gossip_control.c:254)
==895509==    by 0x1497EC: main (lightningd.c:1090)
==895509== 
```

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
I saw this once, but could not track it down.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Even though we generally wait until a node has seen the gossip,
that doesn't mean that connectd has processed it!  This means when
we connect it may still send us "old" gossip.

So we set the OPT_GOSSIP_QUERIES bit, which means don't send until we
ask.  But now it sends us WIRE_QUERY_CHANNEL_RANGE, so everyone needs
to filter that out.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Again, our new behaviour of sending our own gossip even before
they ask can confuse our gossip query tests.

Create a new node, attach it, and perform queries on it.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Again, our new behaviour of sending our own gossip even before
they ask can confuse our gossip query tests.

In this case, simply eliminate duplicates.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
There's a 1 in 256 chance that our signature on the transaction is 70,
not 71 bytes long.  This changes the freerate.  So fix up the weight in
this case, to be the expected weight.

```
@unittest.skipIf(TEST_NETWORK == 'liquid-regtest', "Fees on elements are different")
@pytest.mark.developer("uses dev-fail")
@pytest.mark.openchannel('v1')  # v2 the weight calculation is off by 3
deftest_multifunding_feerates(node_factory, bitcoind):
'''
    Test feerate parameters for multifundchannel
    '''
        funding_tx_feerate = '10000perkw'
        commitment_tx_feerate_int = 2000
        commitment_tx_feerate = str(commitment_tx_feerate_int) + 'perkw'
        l1, l2, l3 = node_factory.get_nodes(3, opts={'log-level': 'debug'})
        l1.fundwallet(1 << 26)
def_connect_str(node):
return'{}@localhost:{}'.format(node.info['id'], node.port)
        destinations = [{"id": _connect_str(l2), 'amount': 50000}]
        res = l1.rpc.multifundchannel(destinations, feerate=funding_tx_feerate,
                                      commitment_feerate=commitment_tx_feerate)
        entry = bitcoind.rpc.getmempoolentry(res['txid'])
        weight = entry['weight']
        expected_fee = int(funding_tx_feerate[:-5]) * weight // 1000
>       assert expected_fee == entry['fees']['base'] * 10 ** 8
E       AssertionError: assert 7000 == (Decimal('0.00007010') * (10 ** 8))
tests/test_connection.py:1982: AssertionError
```

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
@rustyrussell
Copy link
Contributor Author

Rebase again, and two more gossip flakes...

@rustyrussell rustyrussell merged commit 9152b8c into ElementsProject:master Jun 27, 2022
@vincenzopalazzo
Copy link
Collaborator

Good Job! Thanks to do that!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
testing voodoo Suspect user, issue or code has ancient curse
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants