-
Notifications
You must be signed in to change notification settings - Fork 6.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tests: new sanitycheck tag "emu_time"; increase some timeouts #17107
Conversation
All checks are passing now. Review history of this comment for details about previous failed status. |
Quoting bug "Configure QEMU to run independent of the host clock zephyrproject-rtos#14173" We have been struggling for years with issues related to how QEMU attempts to synchronize guest timer interrupts with the host clock, for example zephyrproject-rtos#12553. The symptom is that heavily loaded sanitycheck runs have tests spuriously failing due to timing related issues. This creates noise in our CI runs which masks true bugs in our system which manifest only intermittently, causing real issues that will happen all the time at scale to be 'swept under the rug'; right now any time a test fails sanitycheck retries it a few times and only consecutive failures produce an error. There's also a lot of relevant information and more links in: "List of tests that keep failing sporadically" zephyrproject-rtos#12553 This new "emu_time" tag helps by letting users either select or exclude the tests that really need accurate time to pass and have a high chance to actually be impacted by this emulation issue. As an example, it's only with 'sanitycheck --exclude emu_time' that I could spot and file intermittent but non-emu_time issue zephyrproject-rtos#16915. As Andrew predicted above, it was drown in emu_time noise before that. Conversely, "--tag emu_time" can be used by developers focusing on fixing qemu's -icount feature, for instance zephyrproject-rtos#14173 or others. Even before qemu's -icount is fixed, Continuous Integration could be split in two separate runs: A. --tag emu_time and B. --exclude emu_time. Only A tests would be allowed retries which would stop hiding other, unrelated intermittent issues affecting B tests. This initial commit does not pretend to exhaustively tag all affected tests. However it's an already functional and useful start of 14 tests collected from and tested over weeks of local sanitycheck runs and _partially_ reviewed by qemu clock expert Andy Ross. This commit also increases the individual timeout of 7 tests that have been observed to consistently take much longer than the median (2-3s). This flags how unusually long these are, lets users temporarily reduce the very long 60s default timeout in their local workspace and finally should reduce the chance of them timing out on a heavily loaded system. Set their timeout to 3-4 times the duration observed in CI and locally. Signed-off-by: Marc Herbert <marc.herbert@intel.com>
0e8918a
to
e730517
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
na, we should fix the problem for real, not hide it under the rug.
This new tag:
|
@@ -4,4 +4,4 @@ common: | |||
tests: | |||
net.trickle: | |||
min_ram: 12 | |||
tags: net trickle | |||
tags: net trickle emu_time |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was some issues with Trickle test earlier (like months ago) but I though I had changed the tests slightly to "fix" the issues and I have not seen (at least in networking PRs) that the Trickle tests are failing. Do you have a link to sanitycheck log about the failure case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jukkar are you saying these tests do not require a reliable clock or not any more?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are you saying these tests do not require a reliable clock or not any more?
I was not saying that (at least not yet :), I was just wondering if this is needed because personally I have not seen issues with Trickle tests lately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This emu_time tag is purely about marking tests that require a reliable clock. Have you observed clock issues in general? For instance have you seen any other test in this list failing sometimes with timing related errors?
- If not then your system somehow provides tests a reliable clock. See references in the commit message.
- If you do observe on your system random timing failures spread across various tests then I may have mistaken a past failure of this trickle test for an emu_time issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed this emu_time tag on tests/net/trickle
and it does fail intermittently at fairly recent commit 8f1780f. Caught this one just now:
starting test - test_trickle_2_wait
Trickle 1 0x00400040 callback called
Assertion failed at ZEPHYR_BASE/tests/net/trickle/src/main.c:171:
test_trickle_2_wait: (cb_2_called is false)
Trickle 2 no timeout
FAIL - test_trickle_2_wait
I don't know these tests. @andyross , @jukkar, anyone: does this assert look like a clock-related / emu_time
issue or something else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know these tests. @andyross , @jukkar, anyone: does this assert look like a clock-related /
emu_time
issue or something else?
The test is clock related. While running the test in mps2_an385, I noticed that the test emits error print. I sent a #17289 that fixes the error printing but it does not touch the clock issues.
I could not reproduce the issue you mentioned after running the sanitychecker or doing "make run" quite many times in my host.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I kept running this particular test locally and regularly and to be fair it seems to fail rarely. Now maybe it's just because the couple workarounds recently merged in sanitycheck made all emu_time
tests fail more rarely...
I could not reproduce the issue you mentioned after running the sanitychecker or doing "make run" quite many times in my host.
Unless we're not talking about the same "make run", this runs only one test at a time AFAIK. For sanitycheck it depends on many factors, see @andyross analysis referenced above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only A tests would be allowed retries which would stop hiding
other, unrelated intermittent issues affecting B tests.
Great!! Is there related logic implemented in CI to make this happen?
@nashif I think we should merge this, @wentongwu is of course working on fixing QEMU but that may be some ways off, and we may not be able to get it working for all the QEMU architecture variants for 2.0. We can always remove these tags once the final fix is in.
I don't know. I've used sanitycheck a million times but I'm not familiar with the CI-only code. I mention this in the commit message because 1. there's already some (indiscriminate) retry logic, so CI is hopefully "half-way there", 2. Isolating flaky tests is the well known practice: https://martinfowler.com/articles/nonDeterminism.html#Quarantine
There's plenty of literature about flaky tests, here's other ones: https://testing.googleblog.com/2016/05/flaky-tests-at-google-and-how-we.html |
still think this is a bad idea and not sure who is going to be the user of this. Tags actually are not great in this case, because you are going to exclude everyone, when we have some architecture that have no issues with timing. The qemu timing issue happens mostly in CI, when running locally, if something fails, you always have the option to rerun the tests and get them all pass on second run, so again, not sure who is going to be the user of this. I'd rather exclude tests selectively than use such a tag. |
Definitely approve of the tag. Ideally we would augment the "--only-failed" feature to be able to use it (maybe that works already; not sure what the precedence is of filters vs. --only-failed). Long term, yes, we want deterministic tests. But realistically we're going to be struggling with issues like this repeatedly, and having a framework for handling it can't be a bad thing. |
Fair point. Which architectures? The failures I observed seemed spread across all qemus but I didn't look closely.
Some of my runs last a day (I don't have a powerful system for every OS) so waiting a second day and manually comparing the sets of failures isn't really an option. Unless you're referring to In any case an indiscriminate retry is dropping on the floor other intermittent failures for other, unrelated causes. Known as the "broken window" in CI. Even with --exclude emu_time I'm still seeing a couple dozens of other tests failing intermittently. I filed a couple bugs already but I'm not going to keep bothering unless a little bit is done to help with the categorization. I'll just keep a big fat "flaky" tag for myself and done. Once PR #14593 reaches some level of completion I will hopefully not need wide and frequent test runs any more so it'll be even easier for me to ignore intermittent failures.
I miss the conceptual difference between a list of tests and a tag (and no one has to use either)
Anyone or anything interested in accurate test reports.
I found this interesting: https://testing.googleblog.com/2016/05/flaky-tests-at-google-and-how-we.html
Maybe it's because engineers have more time to write more (stress) tests when they run out of bugs to fix. The number of engineers is finite, the number of bugs not so sure. Especially not when concurency's involved. |
The addition of the tag alone doesn't hurt anyone and might let CI make intelligent decisions about when to retry and when not to. However, adjusting the timeout values in this PR might need to be debated. That's an active change to the tests regardless of whether the tag is used or not. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However, adjusting the timeout values in this PR might need to be debated.
I'll split it out, it is a mostly unrelated change. I put them together because of the proximity and version control inconvenience but there aren't that many tests where I changed both after all.
That's an active change to the tests regardless of whether the tag is used or not.
Not sure what you mean by "active". As the current default is 60 it seemed pretty minor to me.
@@ -4,4 +4,4 @@ common: | |||
tests: | |||
net.trickle: | |||
min_ram: 12 | |||
tags: net trickle | |||
tags: net trickle emu_time |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I kept running this particular test locally and regularly and to be fair it seems to fail rarely. Now maybe it's just because the couple workarounds recently merged in sanitycheck made all emu_time
tests fail more rarely...
I could not reproduce the issue you mentioned after running the sanitychecker or doing "make run" quite many times in my host.
Unless we're not talking about the same "make run", this runs only one test at a time AFAIK. For sanitycheck it depends on many factors, see @andyross analysis referenced above.
@@ -2,7 +2,8 @@ sample: | |||
name: CMSIS_RTOS_V2 Synchronization | |||
tests: | |||
sample.portability.cmsis_rtos_v2.timer_synchronization: | |||
tags: cmsis_rtos_v2_synchronization | |||
timeout: 60 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the default is 60, why would we need it expressly set here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because I don't use 60 seconds as a default, it's way too long. Most tests complete in 5-10 seconds, 15 seconds default is enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@marc-hb why are you doing that? sanitycheck detects when the tests complets, all changing the default to a shorter time is gonna do is increase the likelihood your run will fail
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are you doing that?
Simple: when some random test that normally takes 5 seconds hangs forever, I prefer to wait 15 seconds than 60.
all changing the default to a shorter time is gonna do is increase the likelihood your run will fail
E pur si muove. I've been using a shorter default and the very few tests in this PR are the only ones that needed a longer timeout. Does this mean the 60s default is too conservative? I think so.
I thought it would be uncontroversial to flag the very few, long duration tests in the entire suite but forget it, this is not the point of this PR and as I said above I'll remove them (when I get back from vacation).
(Now I just remembered about the existing "slow" tag, didn't look at that yet)
@@ -1,3 +1,5 @@ | |||
tests: | |||
kernel.memory_pool: | |||
tags: kernel mem_pool | |||
timeout: 60 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same with default of 60 set here.
@@ -1,4 +1,5 @@ | |||
tests: | |||
kernel.semaphore: | |||
timeout: 60 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find description of the issue here detailed and argumentation convincing, so +1.
And vice-versa, I find comments like:
na, we should fix the problem for real, not hide it under the rug.
to be rather unproductive. Translated into plain language, it's "Go and fix all the issues which we carefully accumulated here for ages - at once. If not, we'd rather continue to cultivate those issues further."
Does not seem to be going anywhere, closing. Rebase and reopen with new data if you think this still needs to go in. |
Quoting bug "Configure QEMU to run independent of the host clock #14173"
There's also a lot of relevant information and more links in:
"List of tests that keep failing sporadically" #12553
This new "emu_time" tag helps by letting users either select or exclude
the tests that really need accurate time to pass and have a high chance
to actually be impacted by this emulation issue. As an example, it's
only with 'sanitycheck --exclude emu_time' that I could spot and file
intermittent but non-emu_time issue #16915. As Andrew predicted above,
it was drown in emu_time noise before that.
Conversely, "--tag emu_time" can be used by developers focusing on
fixing qemu's -icount feature, for instance #14173 or others.
Even before qemu's -icount is fixed, Continuous Integration could be
split in two separate runs: A. --tag emu_time and B. --exclude
emu_time. Only A tests would be allowed retries which would stop hiding
other, unrelated intermittent issues affecting B tests.
This initial commit does not pretend to exhaustively tag all affected
tests. However it's an already functional and useful start of 14 tests
collected from and tested over weeks of local sanitycheck runs and
partially reviewed by qemu clock expert Andy Ross.
This commit also increases the individual timeout of 7 tests that have
been observed to consistently take much longer than the
median (2-3s). This flags how unusually long these are, lets users
temporarily reduce the very long 60s default timeout in their local
workspace and finally should reduce the chance of them timing out on a
heavily loaded system. Set their timeout to 3-4 times the duration
observed in CI and locally.
Signed-off-by: Marc Herbert marc.herbert@intel.com