-
Notifications
You must be signed in to change notification settings - Fork 914
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rostest: fix flaky hztests #1661
Conversation
@@ -6,19 +6,6 @@ | |||
<param name="hztest1/hzerror" value="0.5" /> | |||
<param name="hztest1/test_duration" value="5.0" /> | |||
<param name="hztest1/wait_time" value="21.0" /> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is quite a long wait time, and also the default one, iirc - reduce to 5?
|
||
<!-- Below also works: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pointless comment, yes this looks like it also works, but why should this be in here?
oh the irony .. |
it would be nice if we could re-trigger the rostests a couple of times to see whether the changes result in more stable hztests |
Alright, out of 19 checks with the same code, I got a couple of fails, but no hztest-related ones.
Are those also some evergreens in here? |
The jenkins internal failures occur from time to time. I'm not aware of one of the other failures occurring regularly. I really appreciate this PR. It would be nice if the other flaky tests could be fixed as well in the future, but I would like this to be merged as soon as possible. |
@beetleskin I appreciate your testing effort and contribution. But please limit the amount of triggered PR jobs at the same time to a smaller number (<<10) otherwise this lets other jobs wait a long time before being processed.
Just for the record: this patch does not fix the flakyness of the test but retries them multiple times in the hope that they pass once. This patch adds the retry to 19 tests atm. The downside of this approach is that the time to run the tests increases. So this approach is not scalable. Actually fixing the flakiness of the test would be much better.
No. |
I had a look at the queue, there wasn't much going on; and I did it in 3 batches with hours inbetween. But you're right, I should reduce the number next time.
From the doc, this is the way to go:
.. and due to the nature of ROS, and the (CI-) system it runs on, we must expect them to fail.
For the case that a PR introduces changes to a published topic, such that the expected rate can not meet its expected rate, yes. We're trading a little bit of time for a much better ROC statistic. For the case that unrelated PRs fail due to those unstalbe tests, this will improve the whole CI process, and save developer nerves, reviewer time and CI hardware resources. And therefore: time.
Right, lets rewrite ros_core and ros_comm then ;) And I think this is as scalable as it gets. Given your CI system and rOS, the overhead of work generated by those flaky tests is much bigger than the potential (!) test-time increase.
That's unfortunate. |
Updated stats:
|
I am not saying this is not a feasible temporary workaround. I just mentioned that it would be a better solution to make the test non-flaky instead. E.g. 10min of additional CI time per job costs quite a bit when being done all the time. Obviously developer nerves and reviewer time are more valuable though.
The answer is probably different for each case. So I can't suggest something generic.
The test itself could be accept more relaxed times but obviously that is also what it aims to catch. It is hard to write non-flaky performance tests which pass in a variety of environments.
It is simply a huge endeavor with very little benefit - so as long as nobody requests this feature (and is also willing to cover for the necessary development time) I doubt that is going to happen. Contributions towards that direction are always welcome though 😉 |
So .. anything missing for merging this? |
I don't think, that this PR can be merged. The 50 'trigger ci hook' commits should not go into the master. |
According to @dirk-thomas, they're going to be squashed into one commit. |
However, @dirk-thomas might want to have a look at the generated merge-commit-messages: those commits might be part of the commit-msg body, similar to 40d3ca4 |
Thank you for these improvements! |
* rostest: fix flaky hztests * add retry to all hztests * fix concerns * fix more wrong retry-attributes
Lets hope this reliefs some CI evergreen fails.