-
Notifications
You must be signed in to change notification settings - Fork 892
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Race condition between MediaDriver start up and CommonContext.isDriverActive #385
Comments
I've pushed a fix for this. There was a race. |
Two separate issues here. One is the volatile read. That clearly is needed. The |
Maybe it should return false rather than throw the exception? |
Well, My initial thoughts are:
Am I overthinking it? |
You almost want a trinary return. |
I'm thinking to be safe you spin like currently and then timeout. This is likely to be quick unless a real issue exists then you want to wait for the timeout anyway. |
I think a spin is fine. Just not sure how long. DriverTimeoutMs is a long one.... maybe @lbradstreet has a view on this. |
Spin and timeout is fine for us too. We initially delay our health checks, and our health checks include a retry parameter, so the timeout you choose probably won't affect us at all. |
Why iterate at all? The semantics of the call changes to "wait for driver to be active" rather than "is driver active". Assuming 0 was never used as CnC version, just return false if this value is returned as it means the driver is not active yet as the version number hasn't been written. |
@tom-smalls The issue here is that then another driver could assume no driver is active and delete the CnC file and start itself. |
Ahh, yes. I forgot the MediaDriver uses the same call during construction |
We are seeing one of our services fail to start up in our CI environment several times a day due to what we believe is a race condition between the MediaDriver starting, and writing it's contents to CncFileDescriptor (the service that fails is the first one we start after the MediaDriver).
CommonContext.isDriverActive is throwing an IllegalStateException because the CnC version it expects does not match the version in CncFileDescriptor; the version in the CncFileDescriptor is coming back as 0.
Looking at the code in MediaDriver.conclude(), we believe the issue lies here:
https://github.com/real-logic/aeron/blob/master/aeron-driver/src/main/java/io/aeron/driver/MediaDriver.java#L580
The file is being created and then populated, and we believe the MappedByteBuffer being passed into CommonContext.isDriverActive is not null but the data has not been written to it, hence the 0 CnC version.
We are looking at creating a PR where a temporary file is used for creation and population, and then copy to where the CncFileDescriptor should be.
We are running the Java (Client and MediaDriver) 1.3.0 version.
The text was updated successfully, but these errors were encountered: