-
-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Wait robustness #504
Wait robustness #504
Conversation
@@ -337,6 +342,32 @@ protected Integer getLivenessCheckPort() { | |||
} | |||
} | |||
|
|||
/** | |||
* @return the port on which to check if the container is ready |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be plural after this change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
/** | ||
* @return the port on which to check if the container is ready | ||
*/ | ||
protected List<Integer> getLivenessCheckPorts() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since the implementation uses Set as a type of collection, why don't we return Set here as well? Would make more sense, also List indicates the order
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
.filter(container -> container.getNames()[0].endsWith(linkableContainer.getContainerName())) | ||
.map(container -> new Link(container.getNames()[0], alias)) | ||
Set<Link> links = dockerClient.listContainersCmd() | ||
.withStatusFilter("running") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yes - forgot to mention this in the summary. It seems that we had a redundant step here. While we're planning to kill off links, it seemed like a good fix to make all the same.
if (null == port) { | ||
log.debug("Liveness check port of {} is empty. Not waiting.", container.getContainerName()); | ||
final List<Integer> externalLivenessCheckPorts = getLivenessCheckPorts(); | ||
if (null == externalLivenessCheckPorts || externalLivenessCheckPorts.isEmpty()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since getLivenessCheckPorts
returns a collection, it would be nice to mark it as @NotNull
to avoid such checks and make the contract more strict (nullable collection doesn't make a lot of sense anyway :) )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea!
|
||
private void tryPort(Integer internalPort) { | ||
String[][] commands = { | ||
{"/bin/sh", "-c", format("cat /proc/net/tcp | awk '{print $2}' | grep :%x && echo %s", internalPort, SUCCESS_MARKER)}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
|
||
@Before | ||
public void setUp() { | ||
nginx = new GenericContainer<>("nginx:1.9.4"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not as a rule? :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, I thought this avoided a problem but it doesn't!
@@ -149,9 +149,6 @@ protected void optionallyMapResourceParameterAsVolume(@NotNull String paramName, | |||
} | |||
} | |||
|
|||
@Override | |||
protected abstract Integer getLivenessCheckPort(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
breaks the binary compatibility
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method, with the same signature, is in GenericContainer
, so I was actually thinking that overriding it here makes no sense. (TBH I didn't release you could override a concrete method with an abstract one, so this gives me headaches 🤣 ). Happy to withdraw if you can see the problem more clearly than I can though.
Hmm, based on latest tests on Travis this hasn't improved test stability on CI (actually it's worsened, as I removed the 3x retries for starting VNC containers...). It looks like we have the original issue of the VNC recorder connecting too soon, with a Thinking out loud, maybe we should just do the port checks in-container all the time... |
Just to add, I'm running this apparently flaky test locally 100x (on a Mac) to see if there is any flakiness here. |
I left a single selenium test looping overnight in two different modes:
The Travis build on this branch was also not healthy. So, I suspect that our external check is not good enough, even when run from a linux host. It's not a userland proxy, but I'm wondering if there's still a race (I'm not sure of the mechanics of the iptables routing setup between containers). The latest commit, 9c716a8, makes a change that seems to be solid: Regardless of environment, always run the internal check, followed by the external check. So far, this works 100% of the time in both the modes I mentioned above. |
Also keep in mind, that there might be some bugs/race conditions in our Did you notice any problems with the |
Good point. Not noticed that issue with selenium but then again I’ve not personally used them on windows. Def worth reconsidering if the log wait adds value here or if it’s too risky while we have that bug.
It’s also made me think maybe the ‘wait for all sequentially’ strategy could be used to fix #455 too in some form...
|
We've monitored this behavior on Linux (Ubuntu and Fedora) as well, so it does not seem to be Windows specific. "wait for all sequentially" is a nice pragmatic solution, a shame that we would overshadow the underlying problem. |
496096b
to
9c716a8
Compare
protected Set<Integer> getLivenessCheckPorts() { | ||
final Set<Integer> result = new HashSet<>(); | ||
if (exposedPorts.size() > 0) { | ||
result.addAll(exposedPorts.stream() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we put these blocks into private methods? Something like getExposedPortsAsInt
and getBoundPortsAsInt
.
Set<Link> links = dockerClient.listContainersCmd().exec().stream() | ||
.filter(container -> container.getNames()[0].endsWith(linkableContainer.getContainerName())) | ||
.map(container -> new Link(container.getNames()[0], alias)) | ||
Set<Link> links = dockerClient.listContainersCmd() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like this in a private method as well (since I really have to look into the stream to get whats happening here). Maybe findLinksForX
(whatever X is)?
|
||
if (shouldCheckWithCommand()) { | ||
List<Integer> exposedPorts = container.getExposedPorts(); | ||
final Set<Integer> internalPorts = exposedPorts.stream() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again I would prefer to see a private method, getInternalPorts()
. Sorry for being so picky about private methods, that's a coding style I prefer, you don't have to follow if you aren't a fan of it, just think it makes it easier to get a high level understanding of what's happening, especially for people more unfamiliar with the codebase 😁
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, that's a good point so thanks for bringing it up (all three!)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will fix :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If only we had Extensions...!
@rnorth Of course, clearly a separate feature and concern (the formatting in your last post is off btw. ;) ). |
4b95228
to
0a313a2
Compare
We've been noticing some occasional random test failures particularly in the Selenium container test suites, in addition to troubling log messages where the VNC recording container has had to be restarted (within its retry budget of 3), and possible corrupt/cut-off video files (#466). The latter problem looks to be a race condition where Selenium started listening OK, but the VNC server was not always available yet.
So, I've done some refactoring. In brief:
We now allow a list of multiple startup liveness check ports to be defined for a container. This helps, e.g. for the browser containers, and lets us wait for both Selenium and VNC ports to be listening. I think this will help eliminate random flapping tests in this area.
Also, we now have a
WaitAllStrategy
that lets more than one wait strategy be used. Again for browser containers, we now wait for (a) a log message, and (b) the listening ports to be available.For cases where we check running state from within the container, I've added one additional command that can identify listening ports.
I've broken out some aspects of the wait strategies/port detection into separate classes and used this to help improve test coverage.