Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Itest: proper connect/disconnect miner #4589

Merged
merged 4 commits into from
Sep 25, 2020

Conversation

yyforyongyu
Copy link
Member

@yyforyongyu yyforyongyu commented Sep 3, 2020

This PR tries to make it easier to catch flakes in itest. It,

  • catches error when miner tears down.
  • caches error when chain backend cleans up.
  • disables chain backend retrying connection to the miner.

The last one is somewhat tricky. Before this fix, the btcd chain backend automatically retries connecting to the miner on several occasions,

  • during the test setup, when miner was not ready, it tried to connect and failed, then it would retry it in 5s, as shown in the log,
    2020-08-29 16:24:04.980 [DBG] CMGR: Retrying connection to 127.0.0.1:11160 (reqid 1) in 5s
  • in the test testOpenChannelAfterReorg, we manually disconnected the chain backend and the miner. Still, the chain backend retried in 5s and succeeded.
  • and other occasions, which can be found in the logs.

The miner ended with multiple connections with the chain backend, which makes debugging difficult.

And a recent change in btcd validates the not found message and disconnects the peer when needed.

When this InvTypeWitnessTx is receieved by the chainbackend, it's considered invalid and the miner will be disconnected. Currently, due to the permanent connection, the chainbackend will retry connection in 5 seconds. This is more clear in the following log, notice that the miner and chainbackend logs are combined and ordered by time,

# miner started processing the transaction.
2020-09-08 09:50:42.155 ["miner"] TXMP: Processing transaction 2d7ae05ad86da2585de25e9ab274dbc673088415e553133d974de7287419c003
2020-09-08 09:50:42.301 ["miner"] TXMP: Accepted transaction 2d7ae05ad86da2585de25e9ab274dbc673088415e553133d974de7287419c003 (pool size: 4)
2020-09-08 09:50:42.368 ["miner"] PEER: Sending inv (tx 2d7ae05ad86da2585de25e9ab274dbc673088415e553133d974de7287419c003) to 127.0.0.1:59724-"chainbackend" (inbound)

# miner was asked to generate 10 more blocks, this would clean the mempool.
2020-09-08 09:50:43.533 ["miner"] TXMP: Accepted transaction d0fc68e689ba5eb76a82b19fc3642605072af24cfe390bce40d9dc781924ff32 (pool size: 20)
2020-09-08 09:50:43.536 ["miner"] RPCS: Received command <generate> from 127.0.0.1:46164-"itest"
2020-09-08 09:50:43.537 ["miner"] MINR: Generating 10 blocks
2020-09-08 09:50:43.538 ["miner"] MINR: Considering 20 transactions for inclusion to new block

# the chainbackend tried to fetch the transaction, since it was not in the mempool, the miner would send
# a notfound message.
2020-09-08 09:50:44.318 ["chainbackend"] PEER: Received inv (tx 2d7ae05ad86da2585de25e9ab274dbc673088415e553133d974de7287419c003) from 127.0.0.1:23460-"miner" (outbound)
2020-09-08 09:50:44.348 ["miner"] MINR: Adding tx 2d7ae05ad86da2585de25e9ab274dbc673088415e553133d974de7287419c003 (priority 6166666666.67, feePerKB %!f(int64=6418918))
2020-09-08 09:50:44.430 ["chainbackend"] PEER: Sending getdata (witness tx 2d7ae05ad86da2585de25e9ab274dbc673088415e553133d974de7287419c003) to 127.0.0.1:23460-"miner" (outbound)
2020-09-08 09:50:45.214 ["miner"] PEER: Received getdata (witness tx 2d7ae05ad86da2585de25e9ab274dbc673088415e553133d974de7287419c003) from 127.0.0.1:59724-"chainbackend" (inbound)
2020-09-08 09:50:45.215 ["miner"] PEER: Unable to fetch tx 2d7ae05ad86da2585de25e9ab274dbc673088415e553133d974de7287419c003 from transaction pool: transaction is not in the pool
2020-09-08 09:50:45.217 ["miner"] PEER: Sending notfound (witness tx 2d7ae05ad86da2585de25e9ab274dbc673088415e553133d974de7287419c003) to 127.0.0.1:59724-"chainbackend" (inbound)
2020-09-08 09:50:45.224 ["chainbackend"] PEER: Received notfound (witness tx 2d7ae05ad86da2585de25e9ab274dbc673088415e553133d974de7287419c003) from 127.0.0.1:23460-"miner" (outbound)
2020-09-08 09:50:45.225 ["chainbackend"] PEER: Invalid inv type '1073741825' in notfound message from 127.0.0.1:23460-"miner" (outbound)
2020-09-08 09:50:45.226 ["chainbackend"] PEER: Disconnecting 127.0.0.1:23460-"miner" (outbound)
2020-09-08 09:50:45.226 ["chainbackend"] SYNC: Lost peer 127.0.0.1:23460-"miner" (outbound)

The ending 1073741825 is a type,

InvTypeWitnessTx     InvType = InvTypeTx | InvWitnessFlag

The above log happened during test setup, right after this line. After generating the blocks, we would wait for the balance to be synced. It wouldn't happen though, as the miner was disconnected and later the sync would time out.
Detailed logs can be found here.

Current itest won't pass, blocked by this fix in btcd btcsuite/btcd#1625

Related #4553, #4561 and #4563.

@halseth halseth self-requested a review September 3, 2020 11:53
@halseth halseth modified the milestone: 0.12.0 Sep 3, 2020
lntest/bitcoind.go Outdated Show resolved Hide resolved
lntest/bitcoind.go Outdated Show resolved Hide resolved
lntest/btcd.go Outdated Show resolved Hide resolved

// After shutting down the chain backend, we'll make a copy of
// the log file before deleting the temporary log dir.
logFile := logDir + "/" + netParams.Name + "/btcd.log"
err := CopyFile("./output_btcd_chainbackend.log", logFile)
if err != nil {
fmt.Printf("unable to copy file: %v\n", err)
errStr += fmt.Sprintf("unable to copy file: %v\n", err)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd add \n if errStr != ""

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. The newline is added above when the first error appears.

Copy link
Contributor

@halseth halseth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR, I definitely agree that having more control over the miner connection will ease debugging.

Left a few small comments. Could you add a more detailed explanation to the PR description how the linked btcd PR is needed for the tests to pass?

lntest/btcd.go Show resolved Hide resolved
lntest/btcd.go Show resolved Hide resolved
@@ -14458,6 +14460,14 @@ func TestLightningNetworkDaemon(t *testing.T) {
ht.Fatalf("unable to request transaction notifications: %v", err)
}

// Connect chainbackend to miner.
if chainBackend.Name() != "neutrino" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could instead make the neutrino ConnectMiner a noop return

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep that's cleaner, changed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still not changed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, my bad, now fixed😂

Copy link
Member Author

@yyforyongyu yyforyongyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR, I definitely agree that having more control over the miner connection will ease debugging.

Left a few small comments. Could you add a more detailed explanation to the PR description how the linked btcd PR is needed for the tests to pass?

The description is updated, previous investigation can be found in this issue #4553.

lntest/btcd.go Show resolved Hide resolved
lntest/btcd.go Show resolved Hide resolved
@@ -14458,6 +14460,14 @@ func TestLightningNetworkDaemon(t *testing.T) {
ht.Fatalf("unable to request transaction notifications: %v", err)
}

// Connect chainbackend to miner.
if chainBackend.Name() != "neutrino" {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep that's cleaner, changed.

@@ -69,4 +69,4 @@ endif
# Construct the integration test command with the added build flags.
ITEST_TAGS := $(DEV_TAGS) $(RPC_TAGS) rpctest $(backend)

ITEST := rm lntest/itest/*.log; date; $(GOTEST) -v ./lntest/itest -tags="$(ITEST_TAGS)" $(TEST_FLAGS) -logoutput -goroutinedump
ITEST := rm lntest/itest/*.log; rm -r lntest/itest/.minerlogs; rm -r lntest/itest/.backendlogs; date; $(GOTEST) -v ./lntest/itest -tags="$(ITEST_TAGS)" $(TEST_FLAGS) -logoutput -goroutinedump
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this really needed? Won't they be deleted on shutdown, or alternatively overwritten on next run?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have doubts too. Normally it would be fine until there are .gz files, which won't be cleaned if a sequence of itest fails consecutively.
It might be more related to a different PR that focuses on fixing .gz files?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I would make that a separate PR if it is a problem.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now removed.

@halseth
Copy link
Contributor

halseth commented Sep 9, 2020

Thanks for the PR, I definitely agree that having more control over the miner connection will ease debugging.
Left a few small comments. Could you add a more detailed explanation to the PR description how the linked btcd PR is needed for the tests to pass?

The description is updated, previous investigation can be found in this issue #4553.

Thanks for the detailed explanation, it makes a lot of sense now 👍 We should update our btcd dependency to check whether that will get rid of some of the flakes, but when that is in I think this PR is more or less ready :)

@yyforyongyu yyforyongyu force-pushed the itest-proper-cleanup branch 2 times, most recently from c90653e to 2812dc2 Compare September 10, 2020 12:13
@cfromknecht
Copy link
Contributor

@yyforyongyu changes look great! can you add a commit that pulls in the btcd dep so we can get a clean run of the itests?

Previously the chainbackend connected to the miner using a permanent
connection, which would retry the connection when it’s disconnected.
It would leave multiple connections between the chainbackend and the
miner, causing difficulties in debugging. Currently, there are two
occasions when disconnection happens. One happens when running the open
channel reorg test, the miner is connected/disconnected multiple times
on purpose. The other happens when the chainbackend receives a
MSG_WITNESS_TX type from the miner, which would be considered as an
invalid type and disconnects the miner. With the latter one being fixed
in btcd, the chainbackend will still be disconnected from the miner if
it reaches the ban score by requesting too many notfound messages in a
short time which is why the `--nobanning` flag is added.
@yyforyongyu
Copy link
Member Author

@yyforyongyu changes look great! can you add a commit that pulls in the btcd dep so we can get a clean run of the itests?

The btcd version is bumped to github.com/btcsuite/btcd v0.20.1-beta.0.20200903105316-61634447e719.

Initially, I was using github.com/btcsuite/btcd v0.20.1-beta.0.20200908235819-9ef973c282e6, which caused a failure in testSweepAllCoins when running with bitcoind backend.

--- FAIL: TestLightningNetworkDaemon (31.06s)
    lnd_test.go:14526: Running 66 integration tests
    --- FAIL: TestLightningNetworkDaemon/sweep_coins (2.81s)
        lnd_test.go:112: Failed: (sweep coins): exited with error: 
            *errors.errorString unable to sweep coins: rpc error: code = Unknown desc = unable to broadcast sweep transaction: unmatched backend error: -1: sendrawtransaction "hexstring" ( maxfeerate )

Specifically, in this line.

I'm suspecting the change in this commit is causing the problem of using sendrawtransaction in bitcoind. From my observation, reverting the function makeParams to the old version would make the error disappear in lnd's itest.

It's unclear if the bug is in lnd, btcwallet, or btcd. I will dig into it later, currently, the btcd version in this commit seems to have problems with the windows machine which I'm still investigating.

Copy link
Contributor

@cfromknecht cfromknecht left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 💫

@yyforyongyu do you still want to investigate the windows build? i restarted travis and it passed first time. otherwise i think this is ready to merge

@yyforyongyu
Copy link
Member Author

LGTM 💫

@yyforyongyu do you still want to investigate the windows build? i restarted travis and it passed first time. otherwise i think this is ready to merge

Yes, I'm still investigating the test flake. Don't know if it's the best approach, but I've run the itest 10 times for each backend, a total of 30 times. The tests are still flaky, and I'm still investigating. That being said, this PR was meant to fix the miners so feel free to merge. I will open a new PR if I could find anything interesting.

@cfromknecht cfromknecht requested a review from halseth September 24, 2020 19:09
Copy link
Contributor

@halseth halseth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great change, thanks again! 🌟

@yyforyongyu yyforyongyu deleted the itest-proper-cleanup branch January 6, 2021 00:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants