Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

xud is unable to shutdown gracefully #1668

Closed
ghost opened this issue Jun 16, 2020 · 43 comments
Closed

xud is unable to shutdown gracefully #1668

ghost opened this issue Jun 16, 2020 · 43 comments
Assignees
Labels
bug Something isn't working P1 top priority
Milestone

Comments

@ghost
Copy link

ghost commented Jun 16, 2020

Xud logs from the start of the application that resulted in memory leaks.
xud-not-able-to-gracefully-shutdown.log

I suspect it's because after swap failure/timeout we're forgetting about cleaning up some listeners.

@ghost ghost added the bug Something isn't working label Jun 16, 2020
@ghost ghost self-assigned this Jun 16, 2020
@ghost ghost added the P3 low priority label Jun 16, 2020
@ghost
Copy link
Author

ghost commented Jun 17, 2020

I also noticed that in our security tests Alice is unable to to shutdown gracefully:

=== RUN   TestSecurity
2020/06/17 07:49:42 xud: creating network
2020/06/17 07:49:45 ltcd: launching node...
2020/06/17 07:49:45 ltcd: 200 blocks generated
2020/06/17 07:49:45 lnd-ltc: launching network...
2020/06/17 07:49:53 btcd: launching node...
2020/06/17 07:49:54 btcd: 200 blocks generated
2020/06/17 07:49:54 lnd-btc: launching network...
2020/06/17 07:49:59 xud: launching network...
2020/06/17 07:50:01 Running 4 security tests
=== RUN   TestSecurity/network_initialization
=== RUN   TestSecurity/taker_stalling_on_swapAccepted
&{0xc0006fb260 <nil>} is<nil> e.errorder_id:"284e73f0-b06f-11ea-9dc1-4b0fea2b3766" pair_id:"LTC/BTC" quantity:1000000 peer_pub_key:"02c72be41e310427b60e726d10e9704b252c089da9cde76350888c5c8db696ff29" failure_reason:"DealTimedOut"  swapFailure=== RUN   TestSecurity/maker_stalling_after_1st_htlc
=== RUN   TestSecurity/taker_stalling_after_swap_succeeded
2020/06/17 07:50:58 lnd-btc: network harness teared down
2020/06/17 07:50:58 btcd: harness teared down
2020/06/17 07:50:58 lnd-ltc: network harness teared down
2020/06/17 07:50:58 ltcd: harness teared down
2020/06/17 07:50:58 xud network harness teared down
--- PASS: TestSecurity (76.47s)
    --- PASS: TestSecurity/network_initialization (0.13s)
    --- PASS: TestSecurity/taker_stalling_on_swapAccepted (11.21s)
    --- PASS: TestSecurity/maker_stalling_after_1st_htlc (36.26s)
    --- PASS: TestSecurity/taker_stalling_after_swap_succeeded (1.58s)
=== RUN   TestSecurityUnsettledChannels
2020/06/17 07:50:58 xud: creating network
2020/06/17 07:50:58 ltcd: launching node...
2020/06/17 07:50:59 ltcd: 200 blocks generated
2020/06/17 07:50:59 lnd-ltc: launching network...
2020/06/17 07:51:07 btcd: launching node...
2020/06/17 07:51:07 btcd: 200 blocks generated
2020/06/17 07:51:07 lnd-btc: launching network...
2020/06/17 07:51:13 xud: launching network...
2020/06/17 07:51:15 Running 4 unsettled-channels security tests
=== RUN   TestSecurityUnsettledChannels/network_initialization
=== RUN   TestSecurityUnsettledChannels/maker_shutdown_after_1st_htlc
=== RUN   TestSecurityUnsettledChannels/taker_stalling_after_2nd_htlc
=== RUN   TestSecurityUnsettledChannels/taker_shutdown_after_2nd_htlc
2020/06/17 07:56:04 xud process (13-Alice) did not shutdown gracefully. process (6759) killed
2020/06/17 07:58:56 lnd-btc: network harness teared down
2020/06/17 07:58:56 btcd: harness teared down
2020/06/17 07:58:56 lnd-ltc: network harness teared down
2020/06/17 07:58:56 ltcd: harness teared down
2020/06/17 07:58:56 xud network harness teared down
--- PASS: TestSecurityUnsettledChannels (478.01s)
    --- PASS: TestSecurityUnsettledChannels/network_initialization (0.14s)
    --- PASS: TestSecurityUnsettledChannels/maker_shutdown_after_1st_htlc (108.08s)
    --- PASS: TestSecurityUnsettledChannels/taker_stalling_after_2nd_htlc (170.64s)
    --- PASS: TestSecurityUnsettledChannels/taker_shutdown_after_2nd_htlc (180.98s)
PASS
ok      github.com/ExchangeUnion/xud-simulation 554.514s

@kilrau
Copy link
Contributor

kilrau commented Jun 17, 2020

In another non-graceful shutdown (which was eventually killed by the 3 min docker timeout) I observed xud continuing to connect to peers:

17/06/2020 08:31:44.886 [GLOBAL] info: XUD is shutting down
17/06/2020 08:31:44.888 [LND-BTC] info: new status: Disconnected
17/06/2020 08:31:44.889 [LND-LTC] info: new status: Disconnected
17/06/2020 08:31:44.890 [CONNEXT] info: new status: Disconnected
17/06/2020 08:31:44.893 [HTTP] info: http server has closed
17/06/2020 08:31:44.895 [P2P] debug: Peer 025443023c0303eb5671522cda0d49acb0e74c77f1371bf86262bad637835eb9e4 (LaundryBrush): closing socket. reason: Shutdown
17/06/2020 08:31:44.897 [LND-BTC] error: failed to update total outbound capacity: Error: 2 UNKNOWN: Channel Destroyed
    at Object.exports.createStatusError (/app/node_modules/grpc/src/common.js:91:15)
    at Object.onReceiveStatus (/app/node_modules/grpc/src/client_interceptors.js:1209:28)
    at InterceptingListener._callNext (/app/node_modules/grpc/src/client_interceptors.js:568:42)
    at InterceptingListener.onReceiveStatus (/app/node_modules/grpc/src/client_interceptors.js:618:8)
    at callback (/app/node_modules/grpc/src/client_interceptors.js:847:24)
17/06/2020 08:31:44.898 [LND-LTC] error: failed to update total outbound capacity: Error: 2 UNKNOWN: Channel Destroyed
    at Object.exports.createStatusError (/app/node_modules/grpc/src/common.js:91:15)
    at Object.onReceiveStatus (/app/node_modules/grpc/src/client_interceptors.js:1209:28)
    at InterceptingListener._callNext (/app/node_modules/grpc/src/client_interceptors.js:568:42)
    at InterceptingListener.onReceiveStatus (/app/node_modules/grpc/src/client_interceptors.js:618:8)
    at callback (/app/node_modules/grpc/src/client_interceptors.js:847:24)
17/06/2020 08:31:44.899 [RPC] info: gRPC server completed shutdown
17/06/2020 08:31:44.901 [ORDERBOOK] debug: removed all orders for peer 025443023c0303eb5671522cda0d49acb0e74c77f1371bf86262bad637835eb9e4 (LaundryBrush)
17/06/2020 08:31:44.905 [GLOBAL] info: XUD shutdown gracefully
17/06/2020 08:31:44.906 [P2P] info: Peer 025443023c0303eb5671522cda0d49acb0e74c77f1371bf86262bad637835eb9e4 (LaundryBrush) socket closed
17/06/2020 08:31:45.060 [P2P] debug: Connection attempt #1 to 032f570ca0fa0f94564fa7d911f185b509357cbc06258946900eb107e78aaab251@evnyc7w6lyqlh4735g2yssvw4pw4rl75ubtphpyf3mwx2zhshd57p4id.onion:28885 failed: Socks5 proxy rejected connection - Failure. retrying in 5 sec...
17/06/2020 08:31:50.318 [P2P] debug: Connection attempt #2 to 032f570ca0fa0f94564fa7d911f185b509357cbc06258946900eb107e78aaab251@evnyc7w6lyqlh4735g2yssvw4pw4rl75ubtphpyf3mwx2zhshd57p4id.onion:28885 failed: Socks5 proxy rejected connection - Failure. retrying in 10 sec...
17/06/2020 08:32:00.591 [P2P] debug: Connection attempt #3 to 032f570ca0fa0f94564fa7d911f185b509357cbc06258946900eb107e78aaab251@evnyc7w6lyqlh4735g2yssvw4pw4rl75ubtphpyf3mwx2zhshd57p4id.onion:28885 failed: Socks5 proxy rejected connection - Failure. retrying in 20 sec...
17/06/2020 08:32:23.865 [P2P] debug: Connection attempt #4 to 032f570ca0fa0f94564fa7d911f185b509357cbc06258946900eb107e78aaab251@evnyc7w6lyqlh4735g2yssvw4pw4rl75ubtphpyf3mwx2zhshd57p4id.onion:28885 failed: Socks5 proxy rejected connection - HostUnreachable. retrying in 40 sec...
17/06/2020 08:33:04.363 [P2P] debug: Connection attempt #5 to 032f570ca0fa0f94564fa7d911f185b509357cbc06258946900eb107e78aaab251@evnyc7w6lyqlh4735g2yssvw4pw4rl75ubtphpyf3mwx2zhshd57p4id.onion:28885 failed: Socks5 proxy rejected connection - HostUnreachable. retrying in 80 sec...
17/06/2020 08:34:24.706 [P2P] debug: Connection attempt #6 to 032f570ca0fa0f94564fa7d911f185b509357cbc06258946900eb107e78aaab251@evnyc7w6lyqlh4735g2yssvw4pw4rl75ubtphpyf3mwx2zhshd57p4id.onion:28885 failed: Socks5 proxy rejected connection - HostUnreachable. retrying in 160 sec...

@kilrau kilrau added P2 mid priority and removed P3 low priority labels Jun 17, 2020
@kilrau
Copy link
Contributor

kilrau commented Jun 17, 2020

Saw the same today:

=== RUN   TestSecurityUnsettledChannels/taker_shutdown_after_2nd_htlc
2020/06/17 10:29:53 xud process (25-Alice) did not shutdown gracefully. process (6333) killed

that defitely increase travis runtime.

@raladev
Copy link
Contributor

raladev commented Jul 9, 2020

can be easily reproducible with combined setup and buy ETH/BTC trade.

  1. perform combine setup.
  2. send buy ETH/BTC order -> failed swap (because of current connext client behaviour)
  3. shutdown xud -> fail

@kilrau kilrau assigned sangaman and unassigned ghost Jul 9, 2020
@kilrau kilrau added P3 low priority P2 mid priority and removed P2 mid priority P3 low priority labels Jul 9, 2020
@kilrau kilrau added P1 top priority and removed P2 mid priority labels Aug 3, 2020
@rsercano
Copy link
Collaborator

I would like to take a look at this if that's okay for you guys?

@kilrau kilrau assigned rsercano and unassigned sangaman Aug 14, 2020
@rsercano
Copy link
Collaborator

I tried a few scenarios (simnet stop/start during when xud working, don't unlock, don't wait fully initialization) but I couldn't reproduce this, could you please step by step explain how to reproduce?

➜  xud git:(master) ✗ bin/xud
16/08/2020 08:55:14.253 [GLOBAL] info: config file loaded
16/08/2020 08:55:14.258 [GLOBAL] info: CUSTOM-XUD
16/08/2020 08:55:14.274 [RPC] info: gRPC server listening on localhost:28886
16/08/2020 08:55:14.370 [DB] info: connected to database /Users/sercanozdemir/.xud/xud-simnet.db
16/08/2020 08:55:14.395 [LND-LTC] debug: loaded tls cert from /Users/sercanozdemir/lndltc/tls.cert
16/08/2020 08:55:14.396 [LND-BTC] debug: loaded tls cert from /Users/sercanozdemir/lndbtc/tls.cert
16/08/2020 08:55:14.399 [LND-LTC] debug: loaded macaroon from /Users/sercanozdemir/lndltc/data/chain/litecoin/simnet/admin.macaroon
16/08/2020 08:55:14.399 [LND-LTC] info: new status: Initialized
16/08/2020 08:55:14.401 [LND-LTC] info: trying to verify connection to lnd at localhost:31009
16/08/2020 08:55:14.403 [LND-BTC] debug: loaded macaroon from /Users/sercanozdemir/lndbtc/data/chain/bitcoin/simnet/admin.macaroon
16/08/2020 08:55:14.404 [LND-BTC] info: new status: Initialized
16/08/2020 08:55:14.404 [LND-BTC] info: trying to verify connection to lnd at localhost:30009
16/08/2020 08:55:14.433 [LND-BTC] debug: pubkey is 0316725962eaea39eb70286eb0597f626af28091dff8944ec81b5dd23e8b2f55f2
16/08/2020 08:55:14.433 [LND-BTC] debug: uris are 0316725962eaea39eb70286eb0597f626af28091dff8944ec81b5dd23e8b2f55f2@hknsjxzibu53ecdstnjn4mqexv7agucdktm4ddra2quqi7sa2jqqv2yd.onion:29735
16/08/2020 08:55:14.434 [LND-BTC] debug: chain is bitcoin-simnet
16/08/2020 08:55:14.437 [LND-LTC] debug: pubkey is 0316725962eaea39eb70286eb0597f626af28091dff8944ec81b5dd23e8b2f55f2
16/08/2020 08:55:14.438 [LND-LTC] debug: uris are 0316725962eaea39eb70286eb0597f626af28091dff8944ec81b5dd23e8b2f55f2@ymjxg7hvxjy5beaq4hdr2242uzzwnjchzryoklyzdqgf6k4ayatqzhad.onion:30735
16/08/2020 08:55:14.438 [LND-LTC] debug: chain is litecoin-simnet
16/08/2020 08:55:14.449 [LND-LTC] debug: new channel maximum outbound capacity: 490000000
16/08/2020 08:55:14.449 [LND-LTC] debug: new channel inbound capacity: 489963800
16/08/2020 08:55:14.450 [LND-LTC] debug: new channel total outbound capacity: 490000000
16/08/2020 08:55:14.450 [LND-LTC] info: new status: ConnectionVerified
16/08/2020 08:55:14.451 [LND-BTC] debug: new channel maximum outbound capacity: 245000000
16/08/2020 08:55:14.452 [LND-BTC] debug: new channel inbound capacity: 244990950
16/08/2020 08:55:14.452 [LND-BTC] debug: new channel total outbound capacity: 245000000
16/08/2020 08:55:14.452 [LND-BTC] info: new status: ConnectionVerified
16/08/2020 08:55:14.453 [GLOBAL] info: Node key is encrypted, unlock with 'xucli unlock', 'xucli create', or 'xucli restore'
^C16/08/2020 08:55:18.775 [GLOBAL] info: XUD is shutting down
16/08/2020 08:55:18.775 [LND-BTC] info: new status: Disconnected
16/08/2020 08:55:18.776 [LND-LTC] info: new status: Disconnected
16/08/2020 08:55:18.776 [CONNEXT] error: cannot set status to Disconnected from NotInitialized
16/08/2020 08:55:18.777 [CONNEXT] error: cannot set status to Disconnected from NotInitialized
16/08/2020 08:55:18.777 [RPC] info: gRPC server completed shutdown
16/08/2020 08:55:18.778 [GLOBAL] info: XUD shutdown gracefully
➜  xud git:(master) ✗ 

@ghost
Copy link
Author

ghost commented Aug 16, 2020

I tried a few scenarios (simnet stop/start during when xud working, don't unlock, don't wait fully initialization) but I couldn't reproduce this, could you please step by step explain how to reproduce?

➜  xud git:(master) ✗ bin/xud
16/08/2020 08:55:14.253 [GLOBAL] info: config file loaded
16/08/2020 08:55:14.258 [GLOBAL] info: CUSTOM-XUD
16/08/2020 08:55:14.274 [RPC] info: gRPC server listening on localhost:28886
16/08/2020 08:55:14.370 [DB] info: connected to database /Users/sercanozdemir/.xud/xud-simnet.db
16/08/2020 08:55:14.395 [LND-LTC] debug: loaded tls cert from /Users/sercanozdemir/lndltc/tls.cert
16/08/2020 08:55:14.396 [LND-BTC] debug: loaded tls cert from /Users/sercanozdemir/lndbtc/tls.cert
16/08/2020 08:55:14.399 [LND-LTC] debug: loaded macaroon from /Users/sercanozdemir/lndltc/data/chain/litecoin/simnet/admin.macaroon
16/08/2020 08:55:14.399 [LND-LTC] info: new status: Initialized
16/08/2020 08:55:14.401 [LND-LTC] info: trying to verify connection to lnd at localhost:31009
16/08/2020 08:55:14.403 [LND-BTC] debug: loaded macaroon from /Users/sercanozdemir/lndbtc/data/chain/bitcoin/simnet/admin.macaroon
16/08/2020 08:55:14.404 [LND-BTC] info: new status: Initialized
16/08/2020 08:55:14.404 [LND-BTC] info: trying to verify connection to lnd at localhost:30009
16/08/2020 08:55:14.433 [LND-BTC] debug: pubkey is 0316725962eaea39eb70286eb0597f626af28091dff8944ec81b5dd23e8b2f55f2
16/08/2020 08:55:14.433 [LND-BTC] debug: uris are 0316725962eaea39eb70286eb0597f626af28091dff8944ec81b5dd23e8b2f55f2@hknsjxzibu53ecdstnjn4mqexv7agucdktm4ddra2quqi7sa2jqqv2yd.onion:29735
16/08/2020 08:55:14.434 [LND-BTC] debug: chain is bitcoin-simnet
16/08/2020 08:55:14.437 [LND-LTC] debug: pubkey is 0316725962eaea39eb70286eb0597f626af28091dff8944ec81b5dd23e8b2f55f2
16/08/2020 08:55:14.438 [LND-LTC] debug: uris are 0316725962eaea39eb70286eb0597f626af28091dff8944ec81b5dd23e8b2f55f2@ymjxg7hvxjy5beaq4hdr2242uzzwnjchzryoklyzdqgf6k4ayatqzhad.onion:30735
16/08/2020 08:55:14.438 [LND-LTC] debug: chain is litecoin-simnet
16/08/2020 08:55:14.449 [LND-LTC] debug: new channel maximum outbound capacity: 490000000
16/08/2020 08:55:14.449 [LND-LTC] debug: new channel inbound capacity: 489963800
16/08/2020 08:55:14.450 [LND-LTC] debug: new channel total outbound capacity: 490000000
16/08/2020 08:55:14.450 [LND-LTC] info: new status: ConnectionVerified
16/08/2020 08:55:14.451 [LND-BTC] debug: new channel maximum outbound capacity: 245000000
16/08/2020 08:55:14.452 [LND-BTC] debug: new channel inbound capacity: 244990950
16/08/2020 08:55:14.452 [LND-BTC] debug: new channel total outbound capacity: 245000000
16/08/2020 08:55:14.452 [LND-BTC] info: new status: ConnectionVerified
16/08/2020 08:55:14.453 [GLOBAL] info: Node key is encrypted, unlock with 'xucli unlock', 'xucli create', or 'xucli restore'
^C16/08/2020 08:55:18.775 [GLOBAL] info: XUD is shutting down
16/08/2020 08:55:18.775 [LND-BTC] info: new status: Disconnected
16/08/2020 08:55:18.776 [LND-LTC] info: new status: Disconnected
16/08/2020 08:55:18.776 [CONNEXT] error: cannot set status to Disconnected from NotInitialized
16/08/2020 08:55:18.777 [CONNEXT] error: cannot set status to Disconnected from NotInitialized
16/08/2020 08:55:18.777 [RPC] info: gRPC server completed shutdown
16/08/2020 08:55:18.778 [GLOBAL] info: XUD shutdown gracefully
➜  xud git:(master) ✗ 

Try a swap that fails/times out.

@rsercano
Copy link
Collaborator

Okay, at the first glance it looks like as an grpc client lndclient hangs when a swap fails/times out or if we shutdown xud during a swap (as soon as swap starts for example). Xud hangs in these situations because of a hanged up socket (with ESTABLISHED statement).

At first I tried to find a way to listen grpc status and implement a complicated logic to shutdown gracefully, but in the end we're already shutting down gracefully only problem is a TCP socket with ESTABLISHED statement, and nobody actually listens it.

So I suggest a very simple solution, can we call process.exit(); at the end of our shutdown method? Tried and worked like a charm for me! Let me know for bad sides if there's any that I couldn't think of.

@ghost
Copy link
Author

ghost commented Aug 17, 2020

At first I tried to find a way to listen grpc status and implement a complicated logic to shutdown gracefully, but in the end we're already shutting down gracefully only problem is a TCP socket with ESTABLISHED statement, and nobody actually listens it.

Can you link to the code you believe is preventing the shutdown? Perhaps we can clean up the listener?

@rsercano
Copy link
Collaborator

rsercano commented Aug 18, 2020

At first I tried to find a way to listen grpc status and implement a complicated logic to shutdown gracefully, but in the end we're already shutting down gracefully only problem is a TCP socket with ESTABLISHED statement, and nobody actually listens it.

Can you link to the code you believe is preventing the shutdown? Perhaps we can clean up the listener?

This actually happens on underlying socket, so when we call this

this.lightning.close();

we think everything is okay, actually everything must be okay, but somehow GRPC fails to switch outgoing TCP port to CLOSED and it hangs on ESTABLISHED if there's a failed swap or an ongoing swap.

You can try yourself by taking xud stuck statement and checking which port is on use by node. For mac;

sudo lsof -i -n -P | grep TCP

And you'll see that there's a random port is ESTABLISHED with 30009 (in my case with lndbtc) As soon as you stop lndbtc in this case, XUD shutdown too

@rsercano
Copy link
Collaborator

What do you guys think?

@rsercano
Copy link
Collaborator

rsercano commented Sep 3, 2020

That wasn't the case for my own trials, that somehow there's another case with P2P which prevents shutdown, is there a way to reproduce easily?

@kilrau
Copy link
Contributor

kilrau commented Sep 3, 2020

start an older xud with tons of previously connected peers, best an old simnet xud, then immediately shutdown

@rsercano
Copy link
Collaborator

rsercano commented Sep 3, 2020

Unfortunately I couldn't understand, don't know what you mean by "older xud" with current code base can you give me an environment that I can reproduce and see what's happening?

@kilrau
Copy link
Contributor

kilrau commented Sep 3, 2020

Sure, my simnet environment is in that state, I bet if you keep your environment online for some days and @raladev does his testing with tons of new simnet environments you'll get manya dead peers like me above and problems to shutdown too.

@rsercano
Copy link
Collaborator

rsercano commented Sep 3, 2020

Oh gotcha, will keep my xud open :)

@rsercano
Copy link
Collaborator

After a few days my xud failed to shutdown but there was no logs at all after gracefully shutdown log. Just the open ports, and seems like both connext & lnd connections failed to close. I'll try a few stuff.

@rsercano
Copy link
Collaborator

rsercano commented Sep 21, 2020

I tried to kill docker containers when xud stuck on shutdown but it still failed to shutdown 🤔

21/09/2020 09:08:10.203 [GLOBAL] info: XUD shutdown gracefully
21/09/2020 09:08:37.335 [CONNEXT] error: Error: socket hang up
    at connResetException (internal/errors.js:570:14)
    at Socket.socketOnEnd (_http_client.js:440:23)
    at Socket.emit (events.js:215:7)
    at endReadableNT (_stream_readable.js:1183:12)
    at processTicksAndRejections (internal/process/task_queues.js:80:21)
21/09/2020 09:08:37.335 [CONNEXT] error: failed to update total outbound capacity: Error: socket hang up
    at connResetException (internal/errors.js:570:14)
    at Socket.socketOnEnd (_http_client.js:440:23)
    at Socket.emit (events.js:215:7)
    at endReadableNT (_stream_readable.js:1183:12)
    at processTicksAndRejections (internal/process/task_queues.js:80:21)
21/09/2020 09:08:37.335 [CONNEXT] error: failed to update total outbound capacity: Error: socket hang up
    at connResetException (internal/errors.js:570:14)
    at Socket.socketOnEnd (_http_client.js:440:23)
    at Socket.emit (events.js:215:7)
    at endReadableNT (_stream_readable.js:1183:12)
    at processTicksAndRejections (internal/process/task_queues.js:80:21)
21/09/2020 09:08:37.335 [CONNEXT] error: failed to update total outbound capacity: Error: socket hang up
    at connResetException (internal/errors.js:570:14)
    at Socket.socketOnEnd (_http_client.js:440:23)
    at Socket.emit (events.js:215:7)
    at endReadableNT (_stream_readable.js:1183:12)
    at processTicksAndRejections (internal/process/task_queues.js:80:21)

Just a quick question: was xud successfully shutting down before connext implementation, anyone remember?

@kilrau
Copy link
Contributor

kilrau commented Sep 21, 2020

That's really hard to tell. But why just not disable connext and see where that gets you?

@rsercano
Copy link
Collaborator

Actually stucking on shutdown happens randomly so it's really hard to find out by disabling connext now, I checked ports as soon as I closed docker containers, there were no bound port, :/ I'll check how we integrate connext and what are the listeners that could cause such an issue on a spare time

@rsercano
Copy link
Collaborator

rsercano commented Sep 21, 2020

Update This definitely can be related with connext, just now connext started to throw 500 and I tried to stop xud but it didn't gracefully shutdown so I killed my docker process and suddenly xud shutdown too with the following logs:

image

@rsercano
Copy link
Collaborator

rsercano commented Oct 4, 2020

Update: With a very small test that uses eventemitter of nodejs as we do in the xud, I tried to send CTRL+C, and program shutdowns gracefully, so this cant be happening because of our internal listeners:

const { EventEmitter } = require('events');
const timerEventEmitter = new EventEmitter();
timerEventEmitter.emit("update");

let currentTime = 0;

// This will trigger the update event each passing second
setInterval(() => {
    currentTime++;
    timerEventEmitter.emit('update', currentTime);
}, 1000);

let listener = (time) => {
    console.log('Message Received from publisher');
    console.log(`${time} seconds passed since the program started`);
};

timerEventEmitter.on('update', listener);

setInterval(() => {
    timerEventEmitter.removeListener(listener);
}, 10000);

Output:

➜  test node test.js   
Message Received from publisher
1 seconds passed since the program started
Message Received from publisher
2 seconds passed since the program started
Message Received from publisher
3 seconds passed since the program started
^C
➜  test    

@kilrau
Copy link
Contributor

kilrau commented Oct 23, 2020

Since I do not have a single clean shutdown these days, could it be as simple as xud continuing wiht p2p connection attempts even though the shutdown procedure should prevent that? All my xud shutdown logs look like this: https://paste.ubuntu.com/p/VnrZPBXcGH/ @LePremierHomme @sangaman @erkarl @rsercano

@kilrau kilrau assigned LePremierHomme and unassigned rsercano Oct 23, 2020
@rsercano
Copy link
Collaborator

Well after @sangaman's suggestion I checked once again and seems like there're multiple listeners still while closing, I'm creating a PR to remove them for server (p2p) and httpserver, let's discuss there.

@kilrau
Copy link
Contributor

kilrau commented Oct 23, 2020

Can you take over looking into this in parallel? @LePremierHomme

raladev pushed a commit that referenced this issue Oct 28, 2020
This ensures that we don't attempt to reconnect to peers that have
disconnected from us after we have started closing the p2p pool. This
may help prevent scenarios where we unintentionally attempt to
reconnect to peers after shutting down xud.

> Should be tested against [#1668 (comment)](#1668 (comment)) @raladev

re-connection after shutdown is disappeared, but my xud still can not be gracefully  terminated, it waits something:

```
28/10/2020 05:17:43.164 [CONNEXT] trace: sending request to /balance/0x69C3d485623bA3f382Fc0FB6756c4574d43C1618
^C28/10/2020 05:17:44.087 [GLOBAL] info: XUD is shutting down
28/10/2020 05:17:44.088 [LND-BTC] info: new status: Disconnected
28/10/2020 05:17:44.089 [LND-LTC] info: new status: Disconnected
28/10/2020 05:17:44.090 [CONNEXT] info: new status: Disconnected
28/10/2020 05:17:44.093 [P2P] debug: Peer 03ece33a30db1dbce4b62fa96a5e9541138a24997ef5672eebed2d332270e39542 (OzoneYellow): closing socket. reason: Shutdown
28/10/2020 05:17:44.095 [HTTP] info: http server has closed
28/10/2020 05:17:44.096 [RPC] info: gRPC server completed shutdown
28/10/2020 05:17:44.097 [P2P] trace: Sent Disconnecting packet to 03ece33a30db1dbce4b62fa96a5e9541138a24997ef5672eebed2d332270e39542 (OzoneYellow): "{\"body\":{\"reason\":9},\"header\":{\"id\":\"95133be0-1917-11eb-b75b-73d0f0278756\"}}"
28/10/2020 05:17:44.109 [ORDERBOOK] debug: removed all orders for peer 03ece33a30db1dbce4b62fa96a5e9541138a24997ef5672eebed2d332270e39542 (OzoneYellow)
28/10/2020 05:17:44.118 [GLOBAL] info: XUD shutdown gracefully
```
@kilrau kilrau added this to the 1.3.1 milestone Nov 6, 2020
@LePremierHomme
Copy link
Contributor

Closed by #1994.

raladev added a commit that referenced this issue Nov 19, 2020
* feat: removeorder output changed to a more meaningful message (#1526)

* fix(p2p): don't reconnect peers when pool closed (#1965)

This ensures that we don't attempt to reconnect to peers that have
disconnected from us after we have started closing the p2p pool. This
may help prevent scenarios where we unintentionally attempt to
reconnect to peers after shutting down xud.

> Should be tested against [#1668 (comment)](#1668 (comment)) @raladev

re-connection after shutdown is disappeared, but my xud still can not be gracefully  terminated, it waits something:

```
28/10/2020 05:17:43.164 [CONNEXT] trace: sending request to /balance/0x69C3d485623bA3f382Fc0FB6756c4574d43C1618
^C28/10/2020 05:17:44.087 [GLOBAL] info: XUD is shutting down
28/10/2020 05:17:44.088 [LND-BTC] info: new status: Disconnected
28/10/2020 05:17:44.089 [LND-LTC] info: new status: Disconnected
28/10/2020 05:17:44.090 [CONNEXT] info: new status: Disconnected
28/10/2020 05:17:44.093 [P2P] debug: Peer 03ece33a30db1dbce4b62fa96a5e9541138a24997ef5672eebed2d332270e39542 (OzoneYellow): closing socket. reason: Shutdown
28/10/2020 05:17:44.095 [HTTP] info: http server has closed
28/10/2020 05:17:44.096 [RPC] info: gRPC server completed shutdown
28/10/2020 05:17:44.097 [P2P] trace: Sent Disconnecting packet to 03ece33a30db1dbce4b62fa96a5e9541138a24997ef5672eebed2d332270e39542 (OzoneYellow): "{\"body\":{\"reason\":9},\"header\":{\"id\":\"95133be0-1917-11eb-b75b-73d0f0278756\"}}"
28/10/2020 05:17:44.109 [ORDERBOOK] debug: removed all orders for peer 03ece33a30db1dbce4b62fa96a5e9541138a24997ef5672eebed2d332270e39542 (OzoneYellow)
28/10/2020 05:17:44.118 [GLOBAL] info: XUD shutdown gracefully
```

* feat(lnd): change gRPC client options

* fix(connext): not enough balance for closechannel (#1963)

This introduces better error handling for Connext when using
`closeChannel` to remove funds from a Connext channel and specifying an
amount to remove that is greater than the available balance.

* feat: reserved capacity checks on PlaceOrder (#1949)

This rejects orders that would put our total reserved balance over our
total capacity for either the outbound or inbound currency. The sum of
the inbound & outbound amounts for a newly placed order are added to
the amounts reserved by open orders, and if either of these amounts
exceed the corresponding capacity then the request to place the order is
rejected.

An exception to this are inbound limits for Connext currencies, since we
have the ability to dynamically request additional inbound collateral
via our "lazy collateral" approach.

It is still possible for market orders to cause our open orders to
exceed our capacity. This is a difficult problem to avoid entirely, as
the price that market orders will execute at is unknown until the
execution is complete. Even if we simulate the matching routine, we
won't know which matches will succeed until we attempt a swap.

Instead, we generously assume that market orders will execute at the
best quoted price for purposes of these capacity checks. For users that
simultaneously place limit orders and market orders for the same
currencies, it should be made clear that market orders may use up their
available balance needed for their limit orders to succeed.

Closes #1947.

* fix(cli): openchannel assertion error for string amount (#1950)

Fixes #1643.

* feat(swapclient): auto init wallets on xud unlock (#1973)

This adds a new feature to xud to automatically attempt to create a
wallet for any new swap client configured after an xud node has been
created. Effectively this only changes the behavior for lnd clients, as
this is already the existing behavior for Connext. The process for
initializing has now been standardized instead of the ad hoc approach
used previously.

If xud tries to unlock an lnd node and gets an error message indicating
that the wallet has not been created, then it will generate a client &
currency specific seed mnemonic using seedutil and call `InitWallet`
with that seed and the existing xud password, such that the wallet
funds and node identity for the new lnd client can be unlocked and
restored along with the rest of lnd.

Closes #1929.

* feat(rpc): runtime addcurrency for lnd & connext (#1746)

Co-authored-by: Le Premier Homme <interjoint1@gmail.com>

* refactor(rpc): rename reserved TradingLimits fields

This renames the `reservedOutbound` and `reservedInbound` fields on the
`TradingLimits` call to `reservedSell` and `reservedBuy` respectively.

* fix(rpc): no success if no channels to close (#1689) (#1942)

Co-authored-by: rsercano <ozdemirsercan27@gmail.com>
Co-authored-by: Daniel McNally <mcnallydp@gmail.com>

* fix: tls certificate check on startup (#1510)

* fix: alias missing in streamorders (#1725) (#1962)

* fix(lnd): handling hold invoice check errors (#1969)

This adds better error handling for when the test calls to verify lnd
hold invoices are available fail due to connectivity reasons. Previously
any error that occurred at this step would cause us to set lnd's status
to `NoHoldInvoiceSupport` including connection issues. There was also a
bug that caused us to try to set the status to connected even when a
hold invoice status check failed.

This could result in the unusual behavior of status going to
`Disconnected` upon a call failing due to the grpc `UNAVAILABLE` error
status, then being set to `NoHoldInvoiceSupport` and then to
`ConnectionVerified`. Now we only set `NoHoldInvoiceSupport` when the
test calls fail for a reason other than `UNAVAILABLE`, and we only set
the status to `ConnectionVerified` when the hold invoice calls succeed.

Closes #1968.

* feat(lnd): SendPaymentV2

This migrates the call we use to send payments with lnd from the
deprecated `SendPaymentSync` to `SendPaymentV2` which allows for multi
path payments, among other improvements. As part of this change, the
lnd proto files have been updated to their v0.11.x versions and the
version of lnd used in simulation tests has been updated to v0.11.1 as
well.

Closes #1590.

* Revert "feat: removeorder output changed to a more meaningful message (#1526)"

* fix: use regtest instead of regnet arg

This fixes a bug where the xud flag to set the network was incorrectly
configured as `regnet` when it should be `regtest` to match what xud
expects.

* fix(lnd): don't calculate negative capacities

This fixes a bug in the logic for calculating the inbound & outbound
amount/capacity totals. We subtract the channel reserve amounts from the
balances when determining how large a payment the channel could support,
however we should not end up with a negative number.

* feat: new grpc call for subscribring alerts such as low balance (#864)

* fix: changes removeorder output to a more meaningful message (#1526) (#1986)

Co-authored-by: rsercano <ozdemirsercan27@gmail.com>
Co-authored-by: Daniel McNally <mcnallydp@gmail.com>
Co-authored-by: Karl Ranna <karl@karlranna.com>
Co-authored-by: Le Premier Homme <interjoint1@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working P1 top priority
Projects
None yet
Development

No branches or pull requests

5 participants