Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dns client eio #458

Closed
wants to merge 7 commits into from
Closed

Dns client eio #458

wants to merge 7 commits into from

Conversation

bikallem
Copy link
Contributor

Fixes #457

Summary of Changes:

  1. Simplified t.state to just be Tls.Engine.state. Previously handling and re-raising exceptions was giving obfuscating errors in tls-eio, such as Cancelled: Eio__core__Fiber.Not_first or mutex poisioned and so forth. This and using them together with Eio.Fiber.any/first seemingly give wrong, unhelpful and unproductive errors. Therefore, the code deviates from tls_lwt style of state management. I believe the new state handling is much more direct and straightforward as preferred by eio and OCaml effects.

  2. I was also getting weird intermittent errors such as MACMismatch or RecordOverflow. From investigation, I concluded that the errors were due to the interleaving of read/write operations of the concurrent fibers, i.e. a read on tls flow means read/write on the underlying flow t.flow. Therefore the underlying read/write on t.flow should be protected with a mutex such that the read/write on t.flow is atomic.

  3. I have also updated some deprecated function usage such as Eio.Flow.read to Eio.Flow.single_read.

/cc @talex5 @hannesm

write_t t resp ;
raise (Tls_failure failure)
with
| ( End_of_file | Tls_alert _ | Tls_failure _ ) as ex ->
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure but it seems when handling exceptions in effects/eio, we need to be more explicit as to what we are handling. At one time exceptions such as this https://github.com/ocaml/ocaml/blob/trunk/stdlib/effect.ml#L59 was being trapped in here. Perhaps this should go in the eio error handling guidance ?


let write t cs = writev t [cs]

(*
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is no longer applicable since the only place we need to worry about interleaving read/write is when reading from the tls_eio flow.

blit res 0 buf 0 n ;
let rec read t flow_buf : int =

let write_application_data data =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from the name of this function, this sounds like "only application data will ever be written here", but AFAICT there could as well be handshake data that is being written. how did you choose the name?

eio/tls_eio.ml Outdated
Comment on lines 105 to 107
| `Send -> close_tls t ; Flow.shutdown t.flow `Send
| `All -> close_tls t ; Flow.shutdown t.flow `All
| `Receive -> close_tls t ; Flow.shutdown t.flow `Receive
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure about this one but I have added a corresponding close on the t.flow believing this version to be more semantically correct than the original. However, please advise if this is not the case.

push_linger t cs; drain_handshake t
if not (Tls.Engine.handshake_in_progress t.tls) then t
else
let application_data = read_t t in
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, I don't think the naming is correct here. the buffer returned by read_t is some encrypted TLS data, no? and this may be handshake, application data, alerts, heartbeat, ..

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops, sorry, read_t now indeed returns application data.

eio/tls_eio.ml Outdated

let copy_from t src =
try
while true do
let buf = Cstruct.create 4096 in
let got = Flow.read src buf in
let got = Flow.single_read src buf in
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This just replaces deprecated Flow.read

@hannesm
Copy link
Member

hannesm commented Dec 12, 2022

Thanks for your PR, unfortunately this is pretty complex, and since I don't know any eio guarantees, I'm not sure when I'll find time to properly review it (unlikely this year). But maybe @talex5 has an opinion about this PR, and if he approves the code, I can merge and release.

In general:

  • the other side-effecting pieces do not need a mutex due to careful design
  • in normal TLS operations, once a handshake is completed, the read and write are independent and can be done concurrently -- they also use different cipher states
  • only when a key update / renegotiation is processed, the cipher state needs to be mutated (and there may be need for a synchronisation point, but I'm not sure atm -- would need some further thoughts and review of RFCs / other implementations)

@talex5
Copy link
Contributor

talex5 commented Dec 12, 2022

I think it would be good to merge the fuzz tests (#456) before making any further changes to tls-eio. It would also be good to modify the fuzz tests (in this PR) to show the problem being fixed (e.g. by having the fuzz tester randomly cancel reads).

Previously handling and re-raising exceptions was giving obfuscating errors in tls-eio, such as Cancelled: Eio__core__Fiber.Not_first or mutex poisioned and so forth.

Cancelled can probably be ignored (I was planning to test that once the fuzz tests were merged), but I think breaking the Tls flow on other errors is reasonable.

From investigation, I concluded that the errors were due to the interleaving of read/write operations of the concurrent fibers, i.e. a read on tls flow means read/write on the underlying flow t.flow.

That's interesting. Since the fuzz tests don't detect this, I guess it works as long as the underlying flow orders writes correctly, which perhaps the real flow does not. Possibly we should fix it there instead.

@hannesm
Copy link
Member

hannesm commented Dec 12, 2022

What @talex5 says, but even more convincing would be a regression test (rather than a fuzz test) of the bad record mac behaviour -- if I understand both of you correctly, this should be easy since with eio everything can be mocked (including the scheduler)

@bikallem
Copy link
Contributor Author

What @talex5 says, but even more convincing would be a regression test (rather than a fuzz test) of the bad record mac behaviour -- if I understand both of you correctly, this should be easy since with eio everything can be mocked (including the scheduler)

Yes, a test case would indeed be nice. However, I am not too familiar with crowbar or property based testing in general; so it will probably be a while before I get to it. Alternately, I was thinking about using qcheck-lin - https://ocaml-multicore.github.io/multicoretests/0.1/qcheck-lin/. If I remember correctly @jmid mentioned (in the 2022 tarides retreat) that the tool has some advantages with regards to testing effects(eio) based concurrency algorithms.

As an aside, do either of you know if tla+ (https://learntla.com/) is a good candidate for uncovering concurrency issues such as what this PR is trying to address?

@hannesm
Copy link
Member

hannesm commented Dec 12, 2022

Dear @bikallem, I think the underlying issue is some sort of interleaving of fibers. I don't think the hammer of property-based testing or fuzz-testing is needed for reproduction. A normal unit test should suffice, it may need a custom "scheduler" (but here I don't know enough of eio and OCaml 5 to guide how the test could be setup).

As mentioned earlier, I'm not too convinced that "let's acquire a mutex for each flow, and use it for each read and write" is the way to go. I'd take a step back and ask: why would a tls flow be shared across fibers in the first place? But as mentioned, I tried to read the API docs of eio and couldn't find any guarantees / mentions of what is safe to do in parallel.

@hannesm
Copy link
Member

hannesm commented Dec 15, 2022

Dear @bikallem, thanks again for this PR. It would really benefit if you could focus on a single issue (and best option to provide a test case and a minimal change that fixes it), and not have lots of other changes in the same PR - esp. the close_notify tracking is nothing that I'm keen to have in eio only, as mentioned in #452 this is also TLS protocol version dependent (and should properly be done in the core tls library, not in one of the effectful layers).

I'd appreciate to first see some discussion about "why a mutex is needed" and what the desired parallelism story of eio is.

@hannesm
Copy link
Member

hannesm commented Dec 15, 2022

Also, please feel free to reach out via eMail if you'd like to schedule a more instant chat (i.e. a video meeting), where we can discuss your goals and the road towards them. :)

@bikallem bikallem marked this pull request as draft December 15, 2022 12:26
@bikallem
Copy link
Contributor Author

Dear @bikallem, thanks again for this PR. It would really benefit if you could focus on a single issue (and best option to provide a test case and a minimal change that fixes it), and not have lots of other changes in the same PR - esp. the close_notify tracking is nothing that I'm keen to have in eio only, as mentioned in #452 this is also TLS protocol version dependent (and should properly be done in the core tls library, not in one of the effectful layers).

Thanks for the feedback. Indeed, this PR was supposed to only fix issue #457. However, while attempting to write a test for the issue, I noticed another issue with my PR of not handling close_notify TLS message correctly. This had to be fixed in order to pass the CI. However, I agree close_notify is better handled by the engine. I will comment my thinking on the issue there rather than here to reduce noise. I am still working on the reproduction of the issue I faced with ocaml-dns-client. I will ping you once I am able to do it.

I have converted the PR into draft for now.

Many thanks to @talex5 for the fuzz test. I feel I am much more productive with such test harness.

bikallem and others added 7 commits December 23, 2022 10:25
Re-implement `read_into` to make read/write of tls as required an atomic
operation. A read_into call to tlw flow can't be interleaved with other
read/write on the underlying flow.

This commit also cleans up error/exception handling. Without this
commit, the exn raised by tls-eio were obfuscating and not helpful. For
example, I was getting `Eio_core.Not_first` error. This error is purely
internal to `eio` and should never leak to users of the lib.
- The mock_socket has changed slightly. Previously, it would request a
  size to read, do the read, and then wait for another size for the next
  read, which would typically just be the EOF for that request. This
  made it harder for the fuzzer to reach useful states. Now you can do a
  single transmit with a single test action.

- Simplified the write in Tls_eio slightly by using the new `Flow.write`
  operation.

- Only log a send when we actually send some data.

- Allow the fuzzer to perform all the actions needed for the Tls
  handshake in one go instead of requiring it to find them by searching.
  This provides it a way to get to the interesting cases quickly.
@bikallem
Copy link
Contributor Author

Closing this PR for now in support of #464

@bikallem bikallem closed this Feb 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

EIO: Getting `Tls Failure (Fatal MACMismatch)
3 participants