Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tests: new sanitycheck tag "emu_time"; increase some timeouts #17107

Closed

Conversation

marc-hb
Copy link
Collaborator

@marc-hb marc-hb commented Jun 27, 2019

Quoting bug "Configure QEMU to run independent of the host clock #14173"

We have been struggling for years with issues related to how QEMU
attempts to synchronize guest timer interrupts with the host clock,
for example #12553. The symptom is that heavily loaded sanitycheck
runs have tests spuriously failing due to timing related issues.

This creates noise in our CI runs which masks true bugs in our system
which manifest only intermittently, causing real issues that will
happen all the time at scale to be 'swept under the rug'; right now
any time a test fails sanitycheck retries it a few times and only
consecutive failures produce an error.

There's also a lot of relevant information and more links in:
"List of tests that keep failing sporadically" #12553

This new "emu_time" tag helps by letting users either select or exclude
the tests that really need accurate time to pass and have a high chance
to actually be impacted by this emulation issue. As an example, it's
only with 'sanitycheck --exclude emu_time' that I could spot and file
intermittent but non-emu_time issue #16915. As Andrew predicted above,
it was drown in emu_time noise before that.

Conversely, "--tag emu_time" can be used by developers focusing on
fixing qemu's -icount feature, for instance #14173 or others.

Even before qemu's -icount is fixed, Continuous Integration could be
split in two separate runs: A. --tag emu_time and B. --exclude
emu_time. Only A tests would be allowed retries which would stop hiding
other, unrelated intermittent issues affecting B tests.

This initial commit does not pretend to exhaustively tag all affected
tests. However it's an already functional and useful start of 14 tests
collected from and tested over weeks of local sanitycheck runs and
partially reviewed by qemu clock expert Andy Ross.

This commit also increases the individual timeout of 7 tests that have
been observed to consistently take much longer than the
median (2-3s). This flags how unusually long these are, lets users
temporarily reduce the very long 60s default timeout in their local
workspace and finally should reduce the chance of them timing out on a
heavily loaded system. Set their timeout to 3-4 times the duration
observed in CI and locally.

Signed-off-by: Marc Herbert marc.herbert@intel.com

@zephyrbot zephyrbot added area: Networking area: Samples Samples area: Tests Issues related to a particular existing or missing test area: Kernel labels Jun 27, 2019
@zephyrbot
Copy link
Collaborator

zephyrbot commented Jun 27, 2019

All checks are passing now.

Review history of this comment for details about previous failed status.
Note that some checks might have not completed yet.

Quoting bug "Configure QEMU to run independent of the host clock zephyrproject-rtos#14173"

  We have been struggling for years with issues related to how QEMU
  attempts to synchronize guest timer interrupts with the host clock,
  for example zephyrproject-rtos#12553. The symptom is that heavily loaded sanitycheck
  runs have tests spuriously failing due to timing related issues.

  This creates noise in our CI runs which masks true bugs in our system
  which manifest only intermittently, causing real issues that will
  happen all the time at scale to be 'swept under the rug'; right now
  any time a test fails sanitycheck retries it a few times and only
  consecutive failures produce an error.

There's also a lot of relevant information and more links in:
"List of tests that keep failing sporadically" zephyrproject-rtos#12553

This new "emu_time" tag helps by letting users either select or exclude
the tests that really need accurate time to pass and have a high chance
to actually be impacted by this emulation issue. As an example, it's
only with 'sanitycheck --exclude emu_time' that I could spot and file
intermittent but non-emu_time issue zephyrproject-rtos#16915. As Andrew predicted above,
it was drown in emu_time noise before that.

Conversely, "--tag emu_time" can be used by developers focusing on
fixing qemu's -icount feature, for instance zephyrproject-rtos#14173 or others.

Even before qemu's -icount is fixed, Continuous Integration could be
split in two separate runs: A. --tag emu_time and B. --exclude
emu_time. Only A tests would be allowed retries which would stop hiding
other, unrelated intermittent issues affecting B tests.

This initial commit does not pretend to exhaustively tag all affected
tests. However it's an already functional and useful start of 14 tests
collected from and tested over weeks of local sanitycheck runs and
_partially_ reviewed by qemu clock expert Andy Ross.

This commit also increases the individual timeout of 7 tests that have
been observed to consistently take much longer than the
median (2-3s). This flags how unusually long these are, lets users
temporarily reduce the very long 60s default timeout in their local
workspace and finally should reduce the chance of them timing out on a
heavily loaded system. Set their timeout to 3-4 times the duration
observed in CI and locally.

Signed-off-by: Marc Herbert <marc.herbert@intel.com>
@marc-hb marc-hb force-pushed the sanitycheck-emu-time branch from 0e8918a to e730517 Compare June 27, 2019 02:13
@marc-hb marc-hb added area: Sanitycheck Sanitycheck has been renamed to Twister Enhancement Changes/Updates/Additions to existing features labels Jun 27, 2019
Copy link
Member

@nashif nashif left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

na, we should fix the problem for real, not hide it under the rug.

@marc-hb
Copy link
Collaborator Author

marc-hb commented Jun 27, 2019

This new tag:

  1. doesn't hide anything,
  2. exposes non-qemu issues that have been hidden for years and are still routinely being hidden now. See commit message.

@marc-hb marc-hb requested review from wentongwu and ceolin June 27, 2019 02:22
@@ -4,4 +4,4 @@ common:
tests:
net.trickle:
min_ram: 12
tags: net trickle
tags: net trickle emu_time
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was some issues with Trickle test earlier (like months ago) but I though I had changed the tests slightly to "fix" the issues and I have not seen (at least in networking PRs) that the Trickle tests are failing. Do you have a link to sanitycheck log about the failure case?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jukkar are you saying these tests do not require a reliable clock or not any more?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are you saying these tests do not require a reliable clock or not any more?

I was not saying that (at least not yet :), I was just wondering if this is needed because personally I have not seen issues with Trickle tests lately.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This emu_time tag is purely about marking tests that require a reliable clock. Have you observed clock issues in general? For instance have you seen any other test in this list failing sometimes with timing related errors?

  • If not then your system somehow provides tests a reliable clock. See references in the commit message.
  • If you do observe on your system random timing failures spread across various tests then I may have mistaken a past failure of this trickle test for an emu_time issue.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed this emu_time tag on tests/net/trickle and it does fail intermittently at fairly recent commit 8f1780f. Caught this one just now:

starting test - test_trickle_2_wait
Trickle 1 0x00400040 callback called

    Assertion failed at ZEPHYR_BASE/tests/net/trickle/src/main.c:171:
                 test_trickle_2_wait: (cb_2_called is false)

Trickle 2 no timeout
FAIL - test_trickle_2_wait

I don't know these tests. @andyross , @jukkar, anyone: does this assert look like a clock-related / emu_time issue or something else?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know these tests. @andyross , @jukkar, anyone: does this assert look like a clock-related / emu_time issue or something else?

The test is clock related. While running the test in mps2_an385, I noticed that the test emits error print. I sent a #17289 that fixes the error printing but it does not touch the clock issues.
I could not reproduce the issue you mentioned after running the sanitychecker or doing "make run" quite many times in my host.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kept running this particular test locally and regularly and to be fair it seems to fail rarely. Now maybe it's just because the couple workarounds recently merged in sanitycheck made all emu_time tests fail more rarely...

I could not reproduce the issue you mentioned after running the sanitychecker or doing "make run" quite many times in my host.

Unless we're not talking about the same "make run", this runs only one test at a time AFAIK. For sanitycheck it depends on many factors, see @andyross analysis referenced above.

Copy link
Contributor

@andrewboie andrewboie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only A tests would be allowed retries which would stop hiding
other, unrelated intermittent issues affecting B tests.

Great!! Is there related logic implemented in CI to make this happen?

@nashif I think we should merge this, @wentongwu is of course working on fixing QEMU but that may be some ways off, and we may not be able to get it working for all the QEMU architecture variants for 2.0. We can always remove these tags once the final fix is in.

@marc-hb marc-hb requested review from mbolivar and aescolar June 28, 2019 17:40
@marc-hb
Copy link
Collaborator Author

marc-hb commented Jun 28, 2019

Only A tests would be allowed retries which would stop hiding
other, unrelated intermittent issues affecting B tests.

Great!! Is there related logic implemented in CI to make this happen?

I don't know. I've used sanitycheck a million times but I'm not familiar with the CI-only code. I mention this in the commit message because 1. there's already some (indiscriminate) retry logic, so CI is hopefully "half-way there", 2. Isolating flaky tests is the well known practice: https://martinfowler.com/articles/nonDeterminism.html#Quarantine

Initially people will look at the failure report and notice that the failures are in non-deterministic tests, but soon they'll lose the discipline to do that. Once that discipline is lost, then a failure in the healthy deterministic tests will get ignored too. At that point you've lost the whole game and might as well get rid of all the tests.

There's plenty of literature about flaky tests, here's other ones: https://testing.googleblog.com/2016/05/flaky-tests-at-google-and-how-we.html
https://www.thoughtworks.com/insights/blog/no-more-flaky-tests-go-team

@nashif
Copy link
Member

nashif commented Jun 28, 2019

still think this is a bad idea and not sure who is going to be the user of this. Tags actually are not great in this case, because you are going to exclude everyone, when we have some architecture that have no issues with timing.

The qemu timing issue happens mostly in CI, when running locally, if something fails, you always have the option to rerun the tests and get them all pass on second run, so again, not sure who is going to be the user of this. I'd rather exclude tests selectively than use such a tag.

@andyross
Copy link
Contributor

Definitely approve of the tag. Ideally we would augment the "--only-failed" feature to be able to use it (maybe that works already; not sure what the precedence is of filters vs. --only-failed). Long term, yes, we want deterministic tests. But realistically we're going to be struggling with issues like this repeatedly, and having a framework for handling it can't be a bad thing.

@marc-hb
Copy link
Collaborator Author

marc-hb commented Jun 29, 2019

you are going to exclude everyone, when we have some architecture that have no issues with timing.

Fair point. Which architectures? The failures I observed seemed spread across all qemus but I didn't look closely.
(This tag should obviously not be used when testing hardware, it's pretty clear from the name.)

when running locally, if something fails, you always have the option to rerun the tests and get them all pass on second run,

Some of my runs last a day (I don't have a powerful system for every OS) so waiting a second day and manually comparing the sets of failures isn't really an option. Unless you're referring to --only-failed too? I haven't looked into it yet, I will.

In any case an indiscriminate retry is dropping on the floor other intermittent failures for other, unrelated causes. Known as the "broken window" in CI. Even with --exclude emu_time I'm still seeing a couple dozens of other tests failing intermittently. I filed a couple bugs already but I'm not going to keep bothering unless a little bit is done to help with the categorization. I'll just keep a big fat "flaky" tag for myself and done. Once PR #14593 reaches some level of completion I will hopefully not need wide and frequent test runs any more so it'll be even easier for me to ignore intermittent failures.

I'd rather exclude tests selectively than use such a tag.

I miss the conceptual difference between a list of tests and a tag (and no one has to use either)

not sure who is going to be the user of this

Anyone or anything interested in accurate test reports.

Long term, yes, we want deterministic tests.

I found this interesting: https://testing.googleblog.com/2016/05/flaky-tests-at-google-and-how-we.html

We have invested a lot of effort in removing flakiness from tests, but overall the insertion rate is about the same as the fix rate, meaning we are stuck with a certain rate of tests that provide value, but occasionally produce a flaky result.

Maybe it's because engineers have more time to write more (stress) tests when they run out of bugs to fix. The number of engineers is finite, the number of bugs not so sure. Especially not when concurency's involved.

@marc-hb marc-hb requested a review from mike-scott July 9, 2019 21:04
@mike-scott
Copy link
Contributor

The addition of the tag alone doesn't hurt anyone and might let CI make intelligent decisions about when to retry and when not to. However, adjusting the timeout values in this PR might need to be debated. That's an active change to the tests regardless of whether the tag is used or not.

Copy link
Collaborator Author

@marc-hb marc-hb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, adjusting the timeout values in this PR might need to be debated.

I'll split it out, it is a mostly unrelated change. I put them together because of the proximity and version control inconvenience but there aren't that many tests where I changed both after all.

That's an active change to the tests regardless of whether the tag is used or not.

Not sure what you mean by "active". As the current default is 60 it seemed pretty minor to me.

@@ -4,4 +4,4 @@ common:
tests:
net.trickle:
min_ram: 12
tags: net trickle
tags: net trickle emu_time
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kept running this particular test locally and regularly and to be fair it seems to fail rarely. Now maybe it's just because the couple workarounds recently merged in sanitycheck made all emu_time tests fail more rarely...

I could not reproduce the issue you mentioned after running the sanitychecker or doing "make run" quite many times in my host.

Unless we're not talking about the same "make run", this runs only one test at a time AFAIK. For sanitycheck it depends on many factors, see @andyross analysis referenced above.

@@ -2,7 +2,8 @@ sample:
name: CMSIS_RTOS_V2 Synchronization
tests:
sample.portability.cmsis_rtos_v2.timer_synchronization:
tags: cmsis_rtos_v2_synchronization
timeout: 60
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the default is 60, why would we need it expressly set here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because I don't use 60 seconds as a default, it's way too long. Most tests complete in 5-10 seconds, 15 seconds default is enough.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@marc-hb why are you doing that? sanitycheck detects when the tests complets, all changing the default to a shorter time is gonna do is increase the likelihood your run will fail

Copy link
Collaborator Author

@marc-hb marc-hb Jul 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are you doing that?

Simple: when some random test that normally takes 5 seconds hangs forever, I prefer to wait 15 seconds than 60.

all changing the default to a shorter time is gonna do is increase the likelihood your run will fail

E pur si muove. I've been using a shorter default and the very few tests in this PR are the only ones that needed a longer timeout. Does this mean the 60s default is too conservative? I think so.

I thought it would be uncontroversial to flag the very few, long duration tests in the entire suite but forget it, this is not the point of this PR and as I said above I'll remove them (when I get back from vacation).

(Now I just remembered about the existing "slow" tag, didn't look at that yet)

@@ -1,3 +1,5 @@
tests:
kernel.memory_pool:
tags: kernel mem_pool
timeout: 60
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same with default of 60 set here.

@@ -1,4 +1,5 @@
tests:
kernel.semaphore:
timeout: 60
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto.

Copy link
Contributor

@pfalcon pfalcon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find description of the issue here detailed and argumentation convincing, so +1.

And vice-versa, I find comments like:

na, we should fix the problem for real, not hide it under the rug.

to be rather unproductive. Translated into plain language, it's "Go and fix all the issues which we carefully accumulated here for ages - at once. If not, we'd rather continue to cultivate those issues further."

@mbolivar mbolivar removed their request for review August 17, 2019 03:29
@marc-hb marc-hb mentioned this pull request Oct 3, 2019
@nashif
Copy link
Member

nashif commented Dec 18, 2019

Does not seem to be going anywhere, closing. Rebase and reopen with new data if you think this still needs to go in.

@nashif nashif closed this Dec 18, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: Kernel area: Networking area: Samples Samples area: Sanitycheck Sanitycheck has been renamed to Twister area: Tests Issues related to a particular existing or missing test Enhancement Changes/Updates/Additions to existing features
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants