Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BasicExternalCluster interruption fails in some situations #483

Open
jd0-sag opened this issue Feb 9, 2017 · 4 comments
Open

BasicExternalCluster interruption fails in some situations #483

jd0-sag opened this issue Feb 9, 2017 · 4 comments

Comments

@jd0-sag
Copy link
Contributor

jd0-sag commented Feb 9, 2017

We recently saw a test hang and we suspect it is due to how BasicExternalCluster handles asynchronous failures of a server.

Galvan reported that it called forceShutdown but the test client continued (waiting to reconnect to the server).

Since BasicExternalCluster uses an in-process client, it relies on thread interruption to shut down the client when a server crashes unexpectedly. The interrupt is delivered to the client thread but we suspect that this thread is either in a path which doesn't expect interruption (and drops it) or is in an operation which doesn't expose interruption.

We need to construct a test which kills a server at an unexpected time to verify that we correctly handle this case (our existing tests which rely on this do so earlier in the run, prior to creating a connection). From there, we can find the actual underlying bug.

@jd0-sag
Copy link
Contributor Author

jd0-sag commented Feb 9, 2017

Looking at a thread dump, it appears as though the hang is in InFlightMessage.java:95. This loop handles the interruption, but continues waiting before re-raising the interrupt, not realizing that it is the loop being interrupted.

This probably means we need to make that method either throw the interruption or commute it into a RuntimeException. Since we don't want InterruptedException permeating the entire API, we will need to find the right place to commute it into RuntimeException.

@jd0-sag jd0-sag self-assigned this Feb 13, 2017
@jd0-sag
Copy link
Contributor Author

jd0-sag commented Feb 21, 2017

This change requires more consideration than we can give this close to release. We can re-evaluate later.

@jd0-sag
Copy link
Contributor Author

jd0-sag commented Apr 11, 2017

This actually represents a problem in how we handle the acknowledgements around sent messages. Even if we break out of this loop and throw the InterruptedException or a RuntimeException, we are still left in a state where the caller can't determine how to interpret what just happened (the message was probably sent but we don't know what the server did with it).

We may need to create a checked exception which contains the intermediary state of the message, allowing the caller to resume the wait for acks before eventually receiving the InvokeFuture<byte[]> which would have been returned after the ack loop exited.

Ultimately, I am not sure if we can fully do this while still allowing interruption to be honored, when looking at some of the other parts of the API without adding substantial complexity to exception flow. Cases like create() or destroy() introduce a few of these cases where we need to turn a synchronous call-return sort of operation into something which can be asynchronously resumed, cooperatively.

@ramsai1729 ramsai1729 assigned ramsai1729 and unassigned jd0-sag Jul 17, 2017
@ramsai1729
Copy link
Contributor

Adding notes about some options discussed in the last sync-up.

  1. don't loop, just throw InterruptedException but if user retries, it will be a duplicate request
  2. invoke() just returns a handle which can be used to call waitForAcks() explicitly instead but user needs to make one more explicit call
  3. call waitForAcks() during get() but fire and forget requests needs to call get() explicitly now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants