Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tls-eio: allow cancelling reads #459

Closed
wants to merge 2 commits into from
Closed

Conversation

talex5
Copy link
Contributor

@talex5 talex5 commented Dec 15, 2022

Previously, Tls would treat any exception from a read as a permanent error, causing all future operations to fail. However, it is reasonable to cancel a read and then try again later. I extended the fuzz tests to check for this case.

Note that the Lwt version appears to have the same problem. Possibly we should only be catching Eio.Io exceptions in the first place, or maybe none at all.

The first commit cleans up the fuzz testing a bit (see the commit comment for details) and the second one adds support for cancellation.

One place I'm not sure about is when drain_handshake is called from reneg. Is it OK to abort there? I don't know what the function does and it's already marked with XXX. In the normal case (when drain_handshake is called during flow creation) it's safe because the caller never gets the Tls_eio.t.

I'm not sure whether we really want to encourage users to cancel reads and resume later, but this should at least fix the confusing Cancelled: Eio__core__Fiber.Not_first errors reported by @bikallem in #458.

- The mock_socket has changed slightly. Previously, it would request a
  size to read, do the read, and then wait for another size for the next
  read, which would typically just be the EOF for that request. This
  made it harder for the fuzzer to reach useful states. Now you can do a
  single transmit with a single test action.

- Simplified the write in Tls_eio slightly by using the new `Flow.write`
  operation.

- Only log a send when we actually send some data.

- Allow the fuzzer to perform all the actions needed for the Tls
  handshake in one go instead of requiring it to find them by searching.
  This provides it a way to get to the interesting cases quickly.
Previously, Tls would treat any exception from a read as a permanent
error, causing all future operations to fail. However, it is reasonable
to cancel a read and then try again later.

Note that the Lwt version appears to have the same problem. Possibly we
should only be catching `Eio.Io` exceptions in the first place, or maybe
none at all.

I also extended the fuzz tests to check for this case.
@talex5 talex5 marked this pull request as draft December 15, 2022 11:49
@talex5
Copy link
Contributor Author

talex5 commented Dec 15, 2022

afl-fuzz finds that this can fail, because a TLS read operation may need to perform a write. So when the application cancels a TLS read, it may actually end up cancelled a write, which we can't recover from.

Comment on lines 28 to 32
| exn ->
(match t.state with
| `Error _ | `Eof -> ()
| `Active _ -> t.state <- `Error exn) ;
raise exn
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it make sense to completely remove this bit instead? I believe this is trapping other exceptions in addition to Cancelled, such as 'Poisioned' and intermittently this https://github.com/ocaml/ocaml/blob/trunk/stdlib/effect.ml#L59

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume the idea is that the flow should stop working on any unexpected error. Otherwise, you might fail to read a packet, retry, and then hang (because the sender thinks you already have the data). Possibly we should call Printexc.get_raw_backtrace here and store that in t.state too. Then you'd know where the original error came from.

But note that the original exception is still raised, with the original backtrace, the first time. So the first error you get should include the correct location in all cases. It's just if you keep trying to use it after an error then future errors don't give the exact location.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But note that the original exception is still raised, with the original backtrace, the first time. So the first error you get should include the correct location in all cases. It's just if you keep trying to use it after an error then future errors don't give the exact location.

Indeed, that's what I expected. However, I was getting other exceptions being reported - as mentioned in the comment above - as being the cause of error. As a user of tls-eio, that led me to unproductive debugging sessions since the actual error was quite something else. I have a feeling that the interaction of Eio.Fiber.any/first, tls_eio exception handling here and the like is giving unpredictable exceptions. I was only able to move forward with identifying the correct issue after I removed this catch all exception handling. Removing the catch all exception handling also don't seem to impact any observable tls-eio behaviours, i.e. the tests pass and End_of_file exception is correctly reported to users of tls-eio.

@talex5
Copy link
Contributor Author

talex5 commented Feb 16, 2023

Closing, as we decided not to do this (#464 (comment)):

I agree, we can stick (at least for now) to "cancelling an operation on a flow cancels that flow". Again, I don't see a concrete use case to allow cancelling a read/write and afterwards using the flow for a different (or the same) purpose.

@talex5 talex5 closed this Feb 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

2 participants