Revert scheduler changes - restore BYT/CHT functionality #3245

plbossart · 2020-07-28T15:24:11Z

Baytrail + Cherrytrail have been plagued by instability issues for several weeks, and the trace is no longer functional.

By bisecting to restore the trace functionality, I came across this report pointing at e003afe / PR #3057

e003afe582e41c4a8a52b21d45dbc83c166021a5 is the first bad commit
commit e003afe582e41c4a8a52b21d45dbc83c166021a5
Author: Janusz Jankowski <janusz.jankowski@linux.intel.com>
Date:   Fri Jun 5 20:33:47 2020 +0200

    schedule: edf: cycles budget
    
    Add possibility to set budget for EDF tasks.
    It's needed to ensure that some long running task will
    not stall other tasks.
    
    Signed-off-by: Janusz Jankowski <janusz.jankowski@linux.intel.com>

 src/include/sof/schedule/edf_schedule.h | 10 ++++++
 src/schedule/edf_schedule.c             | 56 ++++++++++++++++++++++++++++++++-
 2 files changed, 65 insertions(+), 1 deletion(-)

Reverting this series from @jajanusz restores the DMA trace on Baytrail/Cherrytrail, and in addition I cannot reproduce the stability issues with -EIO errors.

I won't pretend I know what these patches do but they certainly have an effect on these devices, so pushing these reverts to test my results with CI.

This reverts commit 443319a. Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>

This reverts commit e8dc076. Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>

This reverts commit e003afe. Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>

plbossart · 2020-07-28T16:29:12Z

The CI results speak for themselves, it's the first time we have reliable results, with only the known issues with Cyan kmod load/unload tests.

plbossart · 2020-07-29T15:10:14Z

Gah, this is really annoying.
No comments from firmware folks, and a broken unrelated test for APL / DMIC with logs that I cannot see (spinning circle with no results).

If I don't get any feedback by 3pm my time I will merge this, and file a bug to track the APL/DMIC regression. We need to stabilize our CI ASAP.

plbossart · 2020-07-29T20:50:32Z

I may break all processes here but I need to merge this to get the CI on track. Thanks for your understanding if I end-up breaking something else. The APL DMIC issue is filed here: #3258

jajanusz · 2020-07-30T10:33:19Z

@plbossart to get CI on track maybe fix it in the fisrt place, how can I verify if it breaks CI if it's broken and useless at all for any testing, there are false negatives/positives, missing logs all the time ...

mwasko · 2020-07-30T12:05:48Z

@plbossart BYT/CHT are not officially supported by the SOF. If you want to sustain the SOF functionality for that targets then please assign dedicated support that will validate the PRs on BYT/CHT and provide feedback before the changes get merged. We cannot allow for situation like this, where PR is under review for a month and problem with BYT/CHT is raised after the merge. This cause unnecessary tension and rapid reverts.

plbossart · 2020-07-30T15:13:44Z

@mwasko @jajanusz I understand your frustration, this situation is not acceptable indeed. CI has been broken for too many weeks - maybe months even, and it makes the job of developers and maintainers harder. Merging code because of 'usual issues' or no reports at all is not a good practice. So yes we can vent as much as you want, the reality is that our development process for merges to the mainline is plain broken.

We are currently doing a massive effort to harden CI tests and make sure we have a clear pass/fail result for all platforms. Unfortunately, in this case this PR removes the trace functionality so we cannot even debug what the problem might be. This also shows that we didn't have a formal test that checks the trace was functional, which is another miss.

Put differently, look at the half-full glass: we now have all tests passing for kernel CI, and only a limited number of daily tests that need attention. It should now be much better to submit code and get validation results.

To improve things, I will also kindly ask that you work on your own firmware CI, which indicated a regression on this PR but with logs that cannot be seen, so that's not helping either. See https://sof-ci.01.org/sof-pr-viewer/#/build/PR3245/build5301944

mwasko · 2020-08-03T08:27:53Z

@plbossart I agree that we need to focus on hardening and improving CI, but the revert you have performed has nothing to do with it. The CI test failure you are referring to is just a test that was not started at all due to network connectivity problem at validation environment. It can happen sporadically in any CI test execution. We will work to improve it and the progress will be tracked under https://github.com/thesofproject/sof-test/issues/294

The real reason for the revert was the BYT/CHT functionality restore so I will repeat my question, how are you planning @plbossart to cover CI for these targets?

We do not have BYT/CHT in FW CI (not supported) and we cannot guarantee the compatibility of any PR with these targets. Are you planning to revert each patch that will cause regression on BYT/CHT?

plbossart · 2020-08-03T17:41:12Z

We do not have BYT/CHT in FW CI (not supported) and we cannot guarantee the compatibility of any PR with these targets. Are you planning to revert each patch that will cause regression on BYT/CHT?

The kernel CI covers BYT/CHT so we should know ahead of time when things are broken.
But we also need to change the maintainers' review level so that 'usual CI errors' are no longer an acceptable reason for merging. We've been too lax in the last weeks and that is the point that creates instability.

mwasko · 2020-08-04T07:43:25Z

But we also need to change the maintainers' review level so that 'usual CI errors' are no longer an acceptable reason for merging.

To do so we need to first make sure that we got repeatable successful CI test results.

I have checked https://github.com/thesofproject/sof/issues/3170 and the problem was reported before the scheduler changes https://github.com/thesofproject/sof/pull/3057 got integrated. What is more I see in comments that the issue continue to reproduce after your revert and is still not resolved - can trigger random failures in CI test results.

The BYT check-alsabat-headset-playback.sh CI test continue to fail randomly as well.

This looks to me that the revert you have performed was unnecessary and actually didn't fix anything. I will recommend to restore the PR on master unless we got technical evidence that the proposed changes cause the regression.

plbossart · 2020-08-04T14:55:02Z

But we also need to change the maintainers' review level so that 'usual CI errors' are no longer an acceptable reason for merging.

To do so we need to first make sure that we got repeatable successful CI test results.

I have checked https://github.com/thesofproject/sof/issues/3170 and the problem was reported before the scheduler changes https://github.com/thesofproject/sof/pull/3057 got integrated. What is more I see in comments that the issue continue to reproduce after your revert and is still not resolved - can trigger random failures in CI test results.

The BYT check-alsabat-headset-playback.sh CI test continue to fail randomly as well.

This looks to me that the revert you have performed was unnecessary and actually didn't fix anything. I will recommend to restore the PR on master unless we got technical evidence that the proposed changes cause the regression.

OMG, you are being really obtuse here. Scheduling changes -> no trace, it's been reported by multiple people and trace is restored now. The errors with alsa-bat are a separate issue due to underflows.

lgirdwood · 2020-08-04T20:55:32Z

@mwasko fwiw, there is a new jenkins CI feature that spits out a weekly email showing which CI tests/DUTs are occasionally failing (due to either FW bugs, SW bugs, CI bugs or DUT stability) and the fail rates. I think it's ready to roll out now. @mengdonglin can we add @mwasko and @slawblauciak to the CI weekly email so they can be kept informed of the tests/DUT that do ocasionally fail. A consistent test failure will lead to a BUG being reported.

mwasko · 2020-08-05T07:25:38Z

OMG, you are being really obtuse here. Scheduling changes -> no trace, it's been reported by multiple people and trace is restored now. The errors with alsa-bat are a separate issue due to underflows.

@plbossart if you would be more precise at a first place then we could avoid at least part of the conversation. Along the thread you are pointing on DMA tracing, instability issues, CI failures, FW CI missing logs and you admit that you even do not understand what the submitted PR (reverted) is actually doing.

I am just clarifying that the only valid point is a missing DMA tracing on BYT which is actually not reported as a CI test failure. The reverted PR has just revealed weakness of BYT timers implementation which should be addressed by BYT maintainers.

Lets track the current status and continue discussion at https://github.com/thesofproject/sof/pull/3268

@plbossart please also watch your language and stick to the technical discussion

plbossart · 2020-08-05T15:18:33Z

@plbossart please also watch your language and stick to the technical discussion

I apologize if you were offended by my wording. I am glad we are aligned on a common understanding of the issue and fixing CI.

plbossart added 3 commits July 28, 2020 10:12

Revert "idc: set budget for IDC task"

8709745

This reverts commit 443319a. Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>

Revert "drivers: cavs: set budget for IPC task"

8ae9689

This reverts commit e8dc076. Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>

Revert "schedule: edf: cycles budget"

e946a13

This reverts commit e003afe. Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>

plbossart requested review from dbaluta, jajanusz, lbetlej, lgirdwood, mmaka1, mrajwa and tlauda as code owners July 28, 2020 15:24

marc-hb mentioned this pull request Jul 28, 2020

cavs: budgeting for EDF tasks #3057

Merged

plbossart mentioned this pull request Jul 29, 2020

[BUG] CI regression with APL DMIC 16kHz #3258

Closed

plbossart merged commit 1b49c6a into thesofproject:master Jul 29, 2020

plbossart mentioned this pull request Jul 30, 2020

Missing FW logs on CI failures thesofproject/sof-test#294

Closed

marc-hb mentioned this pull request Jul 31, 2020

[BUG] error trace debugfs entry (etrace) is always empty #3265

Closed

jajanusz mentioned this pull request Aug 4, 2020

cavs: budgeting for EDF tasks #3268

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Revert scheduler changes - restore BYT/CHT functionality #3245

Revert scheduler changes - restore BYT/CHT functionality #3245

plbossart commented Jul 28, 2020 •

edited by marc-hb

Loading

plbossart commented Jul 28, 2020

plbossart commented Jul 29, 2020

plbossart commented Jul 29, 2020

jajanusz commented Jul 30, 2020 •

edited

Loading

mwasko commented Jul 30, 2020

plbossart commented Jul 30, 2020

mwasko commented Aug 3, 2020

plbossart commented Aug 3, 2020

mwasko commented Aug 4, 2020

plbossart commented Aug 4, 2020 •

edited

Loading

lgirdwood commented Aug 4, 2020

mwasko commented Aug 5, 2020

plbossart commented Aug 5, 2020

Revert scheduler changes - restore BYT/CHT functionality #3245

Revert scheduler changes - restore BYT/CHT functionality #3245

Conversation

plbossart commented Jul 28, 2020 • edited by marc-hb Loading

plbossart commented Jul 28, 2020

plbossart commented Jul 29, 2020

plbossart commented Jul 29, 2020

jajanusz commented Jul 30, 2020 • edited Loading

mwasko commented Jul 30, 2020

plbossart commented Jul 30, 2020

mwasko commented Aug 3, 2020

plbossart commented Aug 3, 2020

mwasko commented Aug 4, 2020

plbossart commented Aug 4, 2020 • edited Loading

lgirdwood commented Aug 4, 2020

mwasko commented Aug 5, 2020

plbossart commented Aug 5, 2020

plbossart commented Jul 28, 2020 •

edited by marc-hb

Loading

jajanusz commented Jul 30, 2020 •

edited

Loading

plbossart commented Aug 4, 2020 •

edited

Loading