Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rostest: fix flaky hztests #1661

Merged
merged 58 commits into from
Mar 27, 2019
Merged

rostest: fix flaky hztests #1661

merged 58 commits into from
Mar 27, 2019

Conversation

beetleskin
Copy link
Contributor

Lets hope this reliefs some CI evergreen fails.

@@ -6,19 +6,6 @@
<param name="hztest1/hzerror" value="0.5" />
<param name="hztest1/test_duration" value="5.0" />
<param name="hztest1/wait_time" value="21.0" />
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is quite a long wait time, and also the default one, iirc - reduce to 5?


<!-- Below also works:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pointless comment, yes this looks like it also works, but why should this be in here?

@beetleskin
Copy link
Contributor Author

oh the irony ..

@beetleskin
Copy link
Contributor Author

it would be nice if we could re-trigger the rostests a couple of times to see whether the changes result in more stable hztests

@beetleskin
Copy link
Contributor Author

beetleskin commented Mar 17, 2019

Alright, out of 19 checks with the same code, I got a couple of fails, but no hztest-related ones.

Are those also some evergreens in here?

@cwecht
Copy link
Contributor

cwecht commented Mar 18, 2019

The jenkins internal failures occur from time to time. I'm not aware of one of the other failures occurring regularly. I really appreciate this PR. It would be nice if the other flaky tests could be fixed as well in the future, but I would like this to be merged as soon as possible.

@dirk-thomas
Copy link
Member

dirk-thomas commented Mar 18, 2019

@beetleskin I appreciate your testing effort and contribution. But please limit the amount of triggered PR jobs at the same time to a smaller number (<<10) otherwise this lets other jobs wait a long time before being processed.

It would be nice if the other flaky tests could be fixed as well in the future

Just for the record: this patch does not fix the flakyness of the test but retries them multiple times in the hope that they pass once. This patch adds the retry to 19 tests atm. The downside of this approach is that the time to run the tests increases. So this approach is not scalable. Actually fixing the flakiness of the test would be much better.

is there some effort going on to use the pipeline plugin for jenkins?

No.

@beetleskin
Copy link
Contributor Author

beetleskin commented Mar 18, 2019

please limit the amount of triggered PR jobs at the same time to a smaller number (<<10)

I had a look at the queue, there wasn't much going on; and I did it in 3 batches with hours inbetween. But you're right, I should reduce the number next time.

Just for the record: this patch does not fix the flakyness of the test but retries them multiple times in the hope that they pass once.

From the doc, this is the way to go:

Number of times to retry the test before it is considered a failure. Default is 0. This option is useful for stochastic processes that are sometimes expected to fail.

.. and due to the nature of ROS, and the (CI-) system it runs on, we must expect them to fail.

This patch adds the retry to 19 tests atm. The downside of this approach is that the time to run the tests increases.

For the case that a PR introduces changes to a published topic, such that the expected rate can not meet its expected rate, yes. We're trading a little bit of time for a much better ROC statistic. For the case that unrelated PRs fail due to those unstalbe tests, this will improve the whole CI process, and save developer nerves, reviewer time and CI hardware resources. And therefore: time.
Despite that, I set some wait_time params to lower values (default is the rediculous value of 21, which, I assume, is chosen so high due to CI in the first place?). Especially for expecting a zero-rate, the node was waiting 21 seconds to confirm that...
I could reduce this param for other tests as well. Or reduce the default to something reasonable. Or both. What do you think?

So this approach is not scalable. Actually fixing the flakiness of the test would be much better.

Right, lets rewrite ros_core and ros_comm then ;)
Honestly, I don't see any way of guaranteeing those tests, even with dedicated hardware, realtime OS and whatever else is needed to make asynchronous inter-process-communication deterministic. ROS itself prevents to waste any thoughts on this. Or am I missing something? Do you think there is a better way, i.e. improve hztest itself?

And I think this is as scalable as it gets. Given your CI system and rOS, the overhead of work generated by those flaky tests is much bigger than the potential (!) test-time increase.

is there some effort going on to use the pipeline plugin for jenkins?
No.

That's unfortunate.

@beetleskin
Copy link
Contributor Author

beetleskin commented Mar 18, 2019

Updated stats:
Total tests: 54

@dirk-thomas
Copy link
Member

We're trading a little bit of time for a much better ROC statistic. For the case that unrelated PRs fail due to those unstalbe tests, this will improve the whole CI process, and save developer nerves, reviewer time and CI hardware resources. And therefore: time.

I am not saying this is not a feasible temporary workaround. I just mentioned that it would be a better solution to make the test non-flaky instead. E.g. 10min of additional CI time per job costs quite a bit when being done all the time. Obviously developer nerves and reviewer time are more valuable though.

I could reduce this param for other tests as well. Or reduce the default to something reasonable. Or both. What do you think?

The answer is probably different for each case. So I can't suggest something generic.

Do you think there is a better way, i.e. improve hztest itself?

The test itself could be accept more relaxed times but obviously that is also what it aims to catch. It is hard to write non-flaky performance tests which pass in a variety of environments.

is there some effort going on to use the pipeline plugin for jenkins?
No.
That's unfortunate.

It is simply a huge endeavor with very little benefit - so as long as nobody requests this feature (and is also willing to cover for the necessary development time) I doubt that is going to happen. Contributions towards that direction are always welcome though 😉

@beetleskin
Copy link
Contributor Author

So .. anything missing for merging this?

@cwecht
Copy link
Contributor

cwecht commented Mar 26, 2019

I don't think, that this PR can be merged. The 50 'trigger ci hook' commits should not go into the master.

@beetleskin
Copy link
Contributor Author

I don't think, that this PR can be merged. The 50 'trigger ci hook' commits should not go into the master.

According to @dirk-thomas, they're going to be squashed into one commit.

@beetleskin
Copy link
Contributor Author

I don't think, that this PR can be merged. The 50 'trigger ci hook' commits should not go into the master.

However, @dirk-thomas might want to have a look at the generated merge-commit-messages: those commits might be part of the commit-msg body, similar to 40d3ca4

@dirk-thomas
Copy link
Member

Thank you for these improvements!

@dirk-thomas dirk-thomas merged commit a2876a1 into ros:melodic-devel Mar 27, 2019
@beetleskin beetleskin deleted the fix_flaky_hztests branch March 27, 2019 17:03
tahsinkose pushed a commit to tahsinkose/ros_comm that referenced this pull request Apr 15, 2019
* rostest: fix flaky hztests

* add retry to all hztests

* fix concerns

* fix more wrong retry-attributes
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants