Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Race condition between MediaDriver start up and CommonContext.isDriverActive #385

Closed
tom-smalls opened this issue Aug 4, 2017 · 11 comments

Comments

@tom-smalls
Copy link
Contributor

We are seeing one of our services fail to start up in our CI environment several times a day due to what we believe is a race condition between the MediaDriver starting, and writing it's contents to CncFileDescriptor (the service that fails is the first one we start after the MediaDriver).

CommonContext.isDriverActive is throwing an IllegalStateException because the CnC version it expects does not match the version in CncFileDescriptor; the version in the CncFileDescriptor is coming back as 0.

Looking at the code in MediaDriver.conclude(), we believe the issue lies here:

https://github.com/real-logic/aeron/blob/master/aeron-driver/src/main/java/io/aeron/driver/MediaDriver.java#L580

The file is being created and then populated, and we believe the MappedByteBuffer being passed into CommonContext.isDriverActive is not null but the data has not been written to it, hence the 0 CnC version.

We are looking at creating a PR where a temporary file is used for creation and population, and then copy to where the CncFileDescriptor should be.

We are running the Java (Client and MediaDriver) 1.3.0 version.

mjpt777 added a commit that referenced this issue Aug 4, 2017
@mjpt777
Copy link
Contributor

mjpt777 commented Aug 4, 2017

I've pushed a fix for this. There was a race.

@tmontgomery
Copy link
Contributor

Two separate issues here. One is the volatile read. That clearly is needed.

The isDriverActive check itself purposefully did a single check so it did not block and could be called in a loop (although it needed a volatile read just like the other). Also, driver startup would immediately return on this case as well. Now it is blocking for a period based solely on driverTimeoutMs. I'm not sure that is not going to cause some issues with existing code.

@mjpt777
Copy link
Contributor

mjpt777 commented Aug 4, 2017

Maybe it should return false rather than throw the exception?

@tmontgomery
Copy link
Contributor

Well, false means no driver active. Which could be a disaster as it kinda means go ahead and create one.

My initial thoughts are:

  • add a timeout param to spin/wait for. This should be low.
  • keep IllegalStateException for when version > 0, but not == current.
  • if still 0 after the timeout, then return 'true' and optionally logProgress "version 0 after timeout. Possibly retry?"
  • if (timeout >= driverTimeoutMs), then can throw DriverTimeoutException

Am I overthinking it?

@tmontgomery
Copy link
Contributor

You almost want a trinary return. ALIVE, DEAD, and Schrödinger

@mjpt777
Copy link
Contributor

mjpt777 commented Aug 4, 2017

I'm thinking to be safe you spin like currently and then timeout. This is likely to be quick unless a real issue exists then you want to wait for the timeout anyway.

@tmontgomery
Copy link
Contributor

I think a spin is fine. Just not sure how long. DriverTimeoutMs is a long one.... maybe @lbradstreet has a view on this.

@lbradstreet
Copy link
Contributor

Spin and timeout is fine for us too. We initially delay our health checks, and our health checks include a retry parameter, so the timeout you choose probably won't affect us at all.

@tom-smalls
Copy link
Contributor Author

Why iterate at all? The semantics of the call changes to "wait for driver to be active" rather than "is driver active".

Assuming 0 was never used as CnC version, just return false if this value is returned as it means the driver is not active yet as the version number hasn't been written.

@mjpt777
Copy link
Contributor

mjpt777 commented Aug 7, 2017

@tom-smalls The issue here is that then another driver could assume no driver is active and delete the CnC file and start itself.

@tom-smalls
Copy link
Contributor Author

Ahh, yes. I forgot the MediaDriver uses the same call during construction

@mjpt777 mjpt777 closed this as completed Aug 22, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants