Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(transport): various tcp transport races #1095

Merged
merged 6 commits into from
May 14, 2024
Merged

Conversation

arnetheduck
Copy link
Contributor

This PR is more or less a rewrite of the TCP transport to fix several race conditions and leaks that can happen during startup / shutdown as well as spurious exceptions leaking out of it

  • simplify close monitoring since ChronosStream already closes the underlying chronos client of a Connection
  • fix dialling cancellation
  • stop accepting incoming connections before closing existing ones (race)
  • close in-flight connections during TcpTransport.stop
  • make sure only Transport-related exceptions leak out of it

This PR is more or less a rewrite of the TCP transport to fix several
race conditions and leaks that can happen during startup / shutdown as
well as spurious exceptions leaking out of it

* simplify close monitoring since `ChronosStream` already closes the
underlying chronos client of a `Connection`
* fix dialling cancellation
* stop accepting incoming connections before closing existing ones
(race)
* close in-flight connections during `TcpTransport.stop`
* make sure only Transport-related exceptions leak out of it
@@ -81,16 +81,17 @@ proc dialAndUpgrade(
if dialed.dir != dir:
dialed.dir = dir
await transport.upgrade(dialed, peerId)
except CancelledError as exc:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please elaborate on why this is necessary and why we don't need to call close?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops, close should stay (ed6be85) - the point was to not return nil which makes the dialler try the next address instead of aborting the dialling

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we catch only LPError instead?

CancelledError, LPError], raw: true).} =

Copy link
Contributor Author

@arnetheduck arnetheduck May 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could, but for that to be safe, we would need to annotate dialAndUpgrade with raises annotations as well, so that changes in upgrade would be caught by the compiler - this would significantly increase the scope of this PR

Copy link
Contributor

@diegomrsantos diegomrsantos May 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code catches and raises CancelledError, I thought this would be equivalent to only catching LPError instead of CatchableError. upgrade - the proc in the try/except - only raises those two errors. But thinking more about it, you said we still want to call closewhen catching CancelledError, so my suggestion probably doesn't improve anything. Btw, did you forget to call close?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see what you mean now, didn't think about it. So even if the called proc is annotated, it isn't safe to trust it if the caller isn't annotated as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But thinking more about it, all errors in this project should inherit from LPError, shouldn't they? In theory, it should be fine then.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it's even more correct as LPError should represent all the expected errors we thought about and are fine ignoring in this case. All the others shouldn't probably be swallowed here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Broadly, it's best when every abstraction has its own hierarchy - ie failure to decode a multihash is not related to a transport being closed in any meaningful way so having both derived from lperror doesn't really have any underlying motivation except that they happen to be implemented in the same codebase.

As such, it's usually best if exceptions are mapped to the abstraction layer that they're operating at, and each layer translates the exceptions coming from other layers to their own level - ie "socket closed" becomes "transport closed" becomes "peer disconnected" as it travels through the layers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as noted though, changing exception types is usually a breaking change, so it needs to be done carefully - in this particular case, it needs to be done across the entire transport hierarchy at the same time.


trace "Listening on", address = ma
try:
await procCall Transport(self).start(supported)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're probably trying to set "running" only after all the code above ran without issues, but it seems strange to call a parent method only at the end. It will be confusing to see the start log only after the start happened and the associated logs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a weakness of the inheritance / method structure of the code - ie the semantics of running are such that we should not mark the transport as "running" until it has been fully started (all transports are running), but the way transport in general is implemented is that calling the base method is the way to "set" running.

The previous way of setting running to true before actually starting the listeners is part of the buggy behavior that led to race conditions.

refactoring this flow would require refactoring all transport implementations which is beyond the scope of this PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If nothing important is being done in the parent and we don't plan to change it, perhaps we can stop calling it and set the flag manually.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that requires refactoring all transports (ws/tor too) - something beyond the scope here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why refactoring? it's stop calling the parent and setting the flags in the child.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because to do it properly, one would remove the code from the base call (or really, change the structure so that the base contains functionality for this that doesn't involve inheritance) - then the other transports need to be updated - notably though, the base method needs refactoring anyway for raises, so it can be done together then.

it would be confusing to have some implementations call the base and some not

if not self.handles(ma):
trace "Invalid address detected, skipping!", address = ma
continue
trace "Starting TCP transport"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just realized there's a double log here as the parent start also logs this. I also believe this kind of long should be info, wdyt?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

broadly, we use any logging above trace level to provide a "side channel" when the API is too limited to transport the information to the caller.

the more a library sits at the core infrastructure level of an application, the less it should log because logging is part of the application responsibilities broadly - instead of logging, it's better for libraries to provide API where the application can choose to log certain events or not.

thus, any non-trace log in libp2p is really to be seen as a TODO to provide a better error reporting API to the consumer of libp2p - whether it is "correct" to log depends on use case and libp2p can provide greater flexibility to applications by providing the data needed to write good logs instead of doing the logging

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But couldn't the lib do both? The lib user can choose what level to use depending on their needs. Something like described here https://www.baeldung.com/hibernate-logging-levels.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doing both is usually not needed - ie if an API has a rich error handling system "built in", there's not so much value in logs because the information that you would log is available to the consumer of the library. this is not a hard rule though, only a note that different applications will have different logging needs.

it is of course possible to make everything dynamic and configurable, but such options come at the cost of performance and complexity in usage and implementation

@diegomrsantos
Copy link
Contributor

diegomrsantos commented May 8, 2024

What do you think about splitting this PR into smaller ones, perhaps one for each TCP proc? There are too many changes in the TCP transport file, GitHub doesn't even load them by default. I'd argue it's too risky to make all those modifications at once. If there's an issue, it might be difficult to understand the cause.

Comment on lines 199 to 200
doAssert self.clients[Direction.In].len == 0
doAssert self.clients[Direction.Out].len == 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to crash the application if this condition isn't true?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes - it would indicate a logic faiult in the flow (indeed, the fault that this PR seeks to solve, namely that of lingering connection leaks).

in an ideal world, we'd actually remove clients entirely - broadly, it's not needed but its removal is a much more significant change

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But the application has been used so far and this condition isn't always true, i.e., this doesn't put the application in an unusable state. I wonder if it'd be better to just log a warning or even an error, but let it continue running until it's fixed.

raise newTransportClosedError()

let ta = initTAddress(address).valueOr:
raise (ref TcpTransportError)(msg: "Unsupported address: " & $address)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

more typing and you can't set custom fields, so (ref Xxx) works in more cases when exceptions have rich error information

# end up with a race condition where `stop` returns but not all connections
# have been closed - we drop connections in this case in order not to leak
# them
if self.stopping:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This pattern is quite tricky and it's potentially unreliable to delegate to devs the responsibility to remember about those cases. Is there a way to automate it? Something happens when stop is called that cancels all pending awaited futures and the necessary cleanups happen in an except block.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could add those futures in a seq in the transport when they are created, and make them remove themselves when they complete. When we stop the transport we cancel all the futures in the seq.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

broadly, when we transition to 2.0 and the new gc, it will be possible to make this more ergonomic - until then, it's simply a pattern to learn and repeat

# The connection had errors / was closed before `await` returned control
await transp.closeWait()
debug "Cannot read remote address", exc = exc.msg
return nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we start returning nil here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it was always returning nil, by not returning anything (implicitly)

@diegomrsantos diegomrsantos changed the title fix various tcp transport races fix: various tcp transport races May 9, 2024
@diegomrsantos diegomrsantos changed the title fix: various tcp transport races fix(transport): various tcp transport races May 9, 2024
@codecov-commenter
Copy link

Codecov Report

Attention: Patch coverage is 84.69945% with 28 lines in your changes are missing coverage. Please review.

Project coverage is 84.51%. Comparing base (02c96fc) to head (1c65f04).
Report is 4 commits behind head on master.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1095      +/-   ##
==========================================
- Coverage   84.53%   84.51%   -0.03%     
==========================================
  Files          91       91              
  Lines       15517    15534      +17     
==========================================
+ Hits        13118    13129      +11     
- Misses       2399     2405       +6     
Files Coverage Δ
libp2p/dialer.nim 91.56% <100.00%> (+0.84%) ⬆️
libp2p/errors.nim 66.66% <ø> (ø)
libp2p/switch.nim 90.52% <100.00%> (-1.19%) ⬇️
libp2p/transports/tortransport.nim 83.23% <100.00%> (-0.10%) ⬇️
libp2p/transports/transport.nim 92.30% <100.00%> (+3.84%) ⬆️
libp2p/wire.nim 62.66% <66.66%> (+0.76%) ⬆️
libp2p/transports/tcptransport.nim 85.27% <84.02%> (+4.70%) ⬆️

... and 7 files with indirect coverage changes

@diegomrsantos
Copy link
Contributor

This is flaky on mac https://github.com/vacp2p/nim-libp2p/actions/runs/9029804098/job/24812857646?pr=1095.

It says an unhandled error was raised by testdaemon, but all tests passed. Do you know the proper way to fix that?

@arnetheduck
Copy link
Contributor Author

It says an unhandled error was raised by testdaemon, but all tests passed. Do you know the proper way to fix that?

no, but it would be good if this was fixed.

@arnetheduck arnetheduck enabled auto-merge (squash) May 12, 2024 07:54
@arnetheduck arnetheduck merged commit 3ca49a2 into master May 14, 2024
9 checks passed
@arnetheduck arnetheduck deleted the tcptrans-cleanup branch May 14, 2024 05:10
@diegomrsantos
Copy link
Contributor

It says an unhandled error was raised by testdaemon, but all tests passed. Do you know the proper way to fix that?

no, but it would be good if this was fixed.

It has been fixed here #1123

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: done
Development

Successfully merging this pull request may close these issues.

3 participants