-
Notifications
You must be signed in to change notification settings - Fork 324
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Revert scheduler changes - restore BYT/CHT functionality #3245
Revert scheduler changes - restore BYT/CHT functionality #3245
Conversation
This reverts commit 443319a. Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>
This reverts commit e8dc076. Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>
This reverts commit e003afe. Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>
The CI results speak for themselves, it's the first time we have reliable results, with only the known issues with Cyan kmod load/unload tests. |
Gah, this is really annoying. If I don't get any feedback by 3pm my time I will merge this, and file a bug to track the APL/DMIC regression. We need to stabilize our CI ASAP. |
I may break all processes here but I need to merge this to get the CI on track. Thanks for your understanding if I end-up breaking something else. The APL DMIC issue is filed here: #3258 |
@plbossart to get CI on track maybe fix it in the fisrt place, how can I verify if it breaks CI if it's broken and useless at all for any testing, there are false negatives/positives, missing logs all the time ... |
@plbossart BYT/CHT are not officially supported by the SOF. If you want to sustain the SOF functionality for that targets then please assign dedicated support that will validate the PRs on BYT/CHT and provide feedback before the changes get merged. We cannot allow for situation like this, where PR is under review for a month and problem with BYT/CHT is raised after the merge. This cause unnecessary tension and rapid reverts. |
@mwasko @jajanusz I understand your frustration, this situation is not acceptable indeed. CI has been broken for too many weeks - maybe months even, and it makes the job of developers and maintainers harder. Merging code because of 'usual issues' or no reports at all is not a good practice. So yes we can vent as much as you want, the reality is that our development process for merges to the mainline is plain broken. We are currently doing a massive effort to harden CI tests and make sure we have a clear pass/fail result for all platforms. Unfortunately, in this case this PR removes the trace functionality so we cannot even debug what the problem might be. This also shows that we didn't have a formal test that checks the trace was functional, which is another miss. Put differently, look at the half-full glass: we now have all tests passing for kernel CI, and only a limited number of daily tests that need attention. It should now be much better to submit code and get validation results. To improve things, I will also kindly ask that you work on your own firmware CI, which indicated a regression on this PR but with logs that cannot be seen, so that's not helping either. See https://sof-ci.01.org/sof-pr-viewer/#/build/PR3245/build5301944 |
@plbossart I agree that we need to focus on hardening and improving CI, but the revert you have performed has nothing to do with it. The CI test failure you are referring to is just a test that was not started at all due to network connectivity problem at validation environment. It can happen sporadically in any CI test execution. We will work to improve it and the progress will be tracked under https://github.com/thesofproject/sof-test/issues/294 The real reason for the revert was the BYT/CHT functionality restore so I will repeat my question, how are you planning @plbossart to cover CI for these targets? We do not have BYT/CHT in FW CI (not supported) and we cannot guarantee the compatibility of any PR with these targets. Are you planning to revert each patch that will cause regression on BYT/CHT? |
The kernel CI covers BYT/CHT so we should know ahead of time when things are broken. |
To do so we need to first make sure that we got repeatable successful CI test results. I have checked https://github.com/thesofproject/sof/issues/3170 and the problem was reported before the scheduler changes https://github.com/thesofproject/sof/pull/3057 got integrated. What is more I see in comments that the issue continue to reproduce after your revert and is still not resolved - can trigger random failures in CI test results. The BYT check-alsabat-headset-playback.sh CI test continue to fail randomly as well. This looks to me that the revert you have performed was unnecessary and actually didn't fix anything. I will recommend to restore the PR on master unless we got technical evidence that the proposed changes cause the regression. |
OMG, you are being really obtuse here. Scheduling changes -> no trace, it's been reported by multiple people and trace is restored now. The errors with alsa-bat are a separate issue due to underflows. |
@mwasko fwiw, there is a new jenkins CI feature that spits out a weekly email showing which CI tests/DUTs are occasionally failing (due to either FW bugs, SW bugs, CI bugs or DUT stability) and the fail rates. I think it's ready to roll out now. @mengdonglin can we add @mwasko and @slawblauciak to the CI weekly email so they can be kept informed of the tests/DUT that do ocasionally fail. A consistent test failure will lead to a BUG being reported. |
@plbossart if you would be more precise at a first place then we could avoid at least part of the conversation. Along the thread you are pointing on DMA tracing, instability issues, CI failures, FW CI missing logs and you admit that you even do not understand what the submitted PR (reverted) is actually doing. I am just clarifying that the only valid point is a missing DMA tracing on BYT which is actually not reported as a CI test failure. The reverted PR has just revealed weakness of BYT timers implementation which should be addressed by BYT maintainers. Lets track the current status and continue discussion at https://github.com/thesofproject/sof/pull/3268 @plbossart please also watch your language and stick to the technical discussion |
I apologize if you were offended by my wording. I am glad we are aligned on a common understanding of the issue and fixing CI. |
Baytrail + Cherrytrail have been plagued by instability issues for several weeks, and the trace is no longer functional.
By bisecting to restore the trace functionality, I came across this report pointing at e003afe / PR #3057
Reverting this series from @jajanusz restores the DMA trace on Baytrail/Cherrytrail, and in addition I cannot reproduce the stability issues with -EIO errors.
I won't pretend I know what these patches do but they certainly have an effect on these devices, so pushing these reverts to test my results with CI.