-
Notifications
You must be signed in to change notification settings - Fork 131
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ASoC: SOF: loader: release_firmware() on failures to avoid batching #2954
ASoC: SOF: loader: release_firmware() on failures to avoid batching #2954
Conversation
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would extend the patch with:
diff --git a/sound/soc/sof/core.c b/sound/soc/sof/core.c
index 6fcbcd8cb8d8..82eaa2d197dd 100644
--- a/sound/soc/sof/core.c
+++ b/sound/soc/sof/core.c
@@ -450,7 +450,6 @@ int snd_sof_device_remove(struct device *dev)
ret);
snd_sof_ipc_unregister_clients(sdev);
- snd_sof_fw_unload(sdev);
snd_sof_ipc_free(sdev);
snd_sof_free_debug(sdev);
snd_sof_free_trace(sdev);
@@ -473,8 +472,7 @@ int snd_sof_device_remove(struct device *dev)
snd_sof_remove(sdev);
/* release firmware */
- release_firmware(pdata->fw);
- pdata->fw = NULL;
+ snd_sof_fw_unload(sdev);
return 0;
}
sound/soc/sof/loader.c
Outdated
@@ -714,6 +714,7 @@ int snd_sof_load_firmware_raw(struct snd_sof_dev *sdev) | |||
ret = request_firmware(&plat_data->fw, fw_filename, sdev->dev); | |||
|
|||
if (ret < 0) { | |||
plat_data->fw = NULL; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you sure you need this?
the sdev->pdata is devm_kzalloced during module loading. Furthermore if we are here the plat_data->fw must have been NULL.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and on request_firmware() failure the plat_data->fw is set to NULL.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was too lazy and paranoid to trust request_firmware
but now I see it's even documented.
sound/soc/sof/loader.c
Outdated
/* Don't cache a firmware that failed to boot. TODO: fully | ||
* support module unloading at runtime too. | ||
*/ | ||
if (sdev->pdata->fw) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to check, release_firmware() is doing it internally.
@marc-hb, I can not reproduce the issue you are fixing or the description is not correct. |
@marc-hb, this is what I get on tgl-h:
It fails as it should when the firmware is not present. |
Thanks for testing this. I thoroughly tested this commit and the description is exactly what I'm experiencing. You can see the loader code has some platform-specific code, maybe that's why? I need to test again with your changes anyway, then I will provide an exact SOF commit and |
@marc-hb, I have tested without this commit in hope to reproduce the issue it is fixing |
Sorry, I meant: I thoroughly tested with and without this commit and the description is exactly what I'm experiencing on my APL Up^2 |
@marc-hb, I also have an up^2 around but I need to set it up to test on that as well. |
Done, see updated commit message. I've also been using the modprobe scripts at https://github.com/thesofproject/sof-test/tree/1e95bb742af/tools/kmod, please make sure you do the same. |
The No APL_UP2_HDA was apparently available at the time of this run. Everything else in that run is green.
Do you have I'm not overly surprised by differences in behavior because as I mentioned above the loader code for APL is slightly different and because Now does this "caching" also imply "ignore any new firmware file"? Maybe, maybe not. Maybe there is some SOF-independent error handling bug there in request_firmware() but if releasing firmware on failure avoids that bug then so be it - it's the right thing to do anyway. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I understand what this is doing.
- "Unload and reload modules to try loading it again." - and you did try unloading all SOF modules? Where can the old firmware be kept cached then?
- As far as I understand when you add
panic()
to firmware startup code, the driver still probes fine, only itssof_probe_continue()
fails, right? So, then when you unload the driver itssnd_sof_device_remove()
is called and it callsrelease_firmware()
. After your patchsnd_sof_fw_unload()
gets called twice: first time whensof_probe_continue()
jumps torun_err
, second time when you unload the driver. And nowsof_probe_continue()
is no longer empty, it callsrelease_firmware()
both times, second time with aNULL
argument.
Am I missing anything?
@lyakh, @marc-hb, that is the case if CONFIG_SND_SOC_SOF_PROBE_WORK_QUEUE=y, otherwise the When the work is used and the firmware fails to boot then there is no point of keeping it loaded, it can be released. The module will remain loaded, yes, we can not trigger it again to try to load again. The only way to try with a new firmware to rmmod and modprobe the sound modules. I think the change is OK but I still fail to see how the |
I think we need to rootcause why the behaviour is different before we can proceed with this. I'm also getting "error: request firmware intel/sof/sof-tgl.ri failed err: -2" if I rename the FW file after error. I have "CONFIG_FW_CACHE=y" as well. |
@marc-hb If you set CONFIG_FW_CACHE=n , does the problem disappear? |
@marc-hb are you still working on this or is this still open? more than 2 weeks since last update... |
This fix is still saving my life regularly so investigating further and answering review comments is still relatively high on my TODO list. However as you've noticed it's not at the very top either. Feel free to close if that helps you managing PRs in this repo, I'm not managing my TODO list in github and I can easily re-open before submitting a new version. Or how about this middle ground: downgrading this to a draft? Drafts are very easy to exclude for you: |
No, I reproduced with and without CONFIG_FW_CACHE
I got to the bottom of this: it's not because of caching but because of "batching", an undocumented feature implemented in The only checkpatch warning at https://sof-ci.01.org/linuxpr/PR2954/build5971/checkpatch/ is The only thing I changed in this force-push is the commit message, the code is exactly the same as last time. I re-used the same old base to avoid range-diff noise (another thing not supported by github) however this was retested again on a much newer base, more specifically base commit 132b2ee https://sof-ci.01.org/linuxpr/PR2954/build5971/devicetest/?model=TGLH_RVP_HDA&testcase=check-suspend-resume-5 is the usual rtcwake TIMEOUT https://sof-ci.01.org/linuxpr/PR2954/build5971/devicetest/?model=APL_UP2_NOCODEC&testcase=check-alsabat-headset-playback is some unrelated alsabat failure Everything else is green. |
@marc-hb, thank you for the thorough investigation and explanation! I think there is a reason why
We load the firmware and then we hold on to it as long as the driver is loaded. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @marc-hb, looks good.
* TODO: fully support module unloading at runtime too. | ||
*/ | ||
release_firmware(sdev->pdata->fw); | ||
sdev->pdata->fw = NULL; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in terms of what it achieves today , this is fine. But when and if we do start supporting module unload at runtime, won't this need to be moved before snd_sof_remove()?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ranj063 I don't understand your comment sorry. Do you mean release_firmware()
should be invoked before snd_sof_remove()
in the future? I don't see why. As @ujfalusi wrote, the only reason we need the struct firmware
again is to resume after suspend (especially on platforms without IMR resume), even for that it shouldn't be needed eventually. So releasing struct firmware
before or after some releasing other things does not matter, does it? Of course the sooner the better but I don't see any required order or dependency.
I asked the compiler to list all places where we use struct firmware
, here is the complete list as of commit 132b2ee
- loader.c
snd_sof_load_firmware_raw()
snd_sof_load_firmware_memcpy()
snd_sof_device_remove() or snd_sof_fw_unload() # this PR
- hda-loader.c
hda_dsp_cl_boot_firmware()
hda_dsp_cl_boot_firmware_iccmax()
It's not used anywhere else with my current .config
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure why the order of release_firmware() would matter, in snd_sof_device_remove()
the firmware is not needed, we could have drop it first thing or last thing. It does not really matter.
@ranj063, what does "module unload at runtime" means? Do you want to be able to remove the modules while audio is playing or something else?
Are we preparing for hot-plugable DSPs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from what I understood, I think runtime module unload means you load/unload FW modules based on topology. But I could be wrong. And in any cae, we're a bit far from supporting this at the moment. So this PR is good to go for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No objections on the patch proper but if @marc-hb can clean-up the commit message and comments that would be good. they are a bit too verbose for upstream reviews. All you need to mention if the 'request batching' and why it results in the wrong file being used, it's not necessary to include reproduction information outside of the "SOF team".
sound/soc/sof/loader.c
Outdated
/* TODO: support module unloading at runtime */ | ||
/* Don't cache a firmware that failed to boot. | ||
* TODO: fully support module unloading at runtime too. | ||
*/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we don't have such plans at the moment, let's remove this to avoid issues when upstreaming.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused by this existing TODO, should I leave it as is? Even though I'm adding code below it? Or you asking to remove the comment? But you wrote "we don't have such plans".
I'm confused because we're testing this TODO regularly and it is what this PR fixes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By editing an existing comment, you make it unclear what exactly Intel will do.
This patch only releases firmware on errors. You can either
a) remove the TODO in a separate patch, or
b) keep it as is for now.
My point was: don't edit something that you didn't modify, and the 'fully' statement is not useful.
Invoke release_firmware() when the firmware fails to boot in sof_probe_continue(). The request_firmware() framework must be informed of failures in sof_probe_continue() otherwise its internal "batching" feature (different from caching) cached the firmware image forever. Attempts to correct the file in /lib/firmware/ were then silently and confusingly ignored until the next reboot. Unloading the drivers did not help because from their disconnected perspective the firmware had failed so there was nothing to release. Also leverage the new snd_sof_fw_unload() function to simplify the snd_sof_device_remove() function. Signed-off-by: Marc Herbert <marc.herbert@intel.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @marc-hb, LGTM. This was a nasty issue to look into.
Yes of course. See below.
In "batched requests".
The drivers appear in
No, when I unload the drivers after a panic then
No, only the first time because there's no device, see above.
I don't understand this bit. |
let's give @ranj063 a chance to comment, if no objections we can merge. |
Just discovered this older GitHub issue: |
OLD, very verbose description below. For the much shorter commit message look at the commit itself.
While this does not seem enough to recover from any sort of DSP crash,
it does help recover from certain ones in my testing and the ability to
try a different firmware file is obviously a minimum that saves
considerable confusion and loss of time.
A simple reproduction of the issue addressed:
below. Try and fail to load that firmware.
SOF firmware version is printed instead!
Trying and failing to alternate between a failing and a working firmware
is another reproduction.
Note this issue is not related to CONFIG_FW_CACHE and happens with and
without caching. It it caused by an undocumented "request batching"
feature implemented in function alloc_lookup_fw_priv(). In other words,
the firmware loader does not know that the first request failed due
to the lack of release_firmware() call. As it assumes the first attempt
was successful, any further attempt is "batched" and re-uses the same
file. This is very clear from these logs:
Note the old
fw_priv
pointer anddata
despite the new firmware file.v2: leverage the now non-empty snd_sof_fw_unload() function to simplify
the code in snd_sof_device_remove(); no functional change in
snd_sof_device_remove(). Thanks Péter Ujfalusi.
To reproduce, make the following change on top of SOF commit
5d9d2dd or a0f789d and test with an Up Squared board:
Signed-off-by: Marc Herbert marc.herbert@intel.com