-
Notifications
You must be signed in to change notification settings - Fork 131
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ASoC:topology: bug fix oops cause by pointer dereference. #72
Conversation
when the tplg setting is incorrect in some parameters, the snd_kcontrol{} instance will not be allocated. when we do the module reload, the panic will be hit. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, needs a small comment update.
sound/soc/soc-topology.c
Outdated
int j; | ||
|
||
/* in some case:tplg incorrect configuration. | ||
* the kcontrol{} is not allocated successfully |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would change both comments to
/* kcontrol not guaranteed to be created so validate */
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. I will update it. thanks
Maybe a better option would be to report an error in the first place when the widget parameters are not correct, e.g. by something like the (untested) patch below. dapm/topology don't use return values that are available. diff --git a/sound/soc/soc-dapm.c b/sound/soc/soc-dapm.c
index 6d54128e44b4..d23234a4406c 100644
--- a/sound/soc/soc-dapm.c
+++ b/sound/soc/soc-dapm.c
@@ -3048,6 +3048,7 @@ int snd_soc_dapm_new_widgets(struct snd_soc_card *card)
{
struct snd_soc_dapm_widget *w;
unsigned int val;
+ int ret = -EINVAL;
mutex_lock_nested(&card->dapm_mutex, SND_SOC_DAPM_CLASS_INIT);
@@ -3070,23 +3071,26 @@ int snd_soc_dapm_new_widgets(struct snd_soc_card *card)
case snd_soc_dapm_switch:
case snd_soc_dapm_mixer:
case snd_soc_dapm_mixer_named_ctl:
- dapm_new_mixer(w);
+ ret = dapm_new_mixer(w);
break;
case snd_soc_dapm_mux:
case snd_soc_dapm_demux:
- dapm_new_mux(w);
+ ret = dapm_new_mux(w);
break;
case snd_soc_dapm_pga:
case snd_soc_dapm_out_drv:
- dapm_new_pga(w);
+ ret = dapm_new_pga(w);
break;
case snd_soc_dapm_dai_link:
- dapm_new_dai_link(w);
+ ret = dapm_new_dai_link(w);
break;
default:
break;
}
+ if (ret < 0)
+ return ret;
+
/* Read the initial power state from the device */
if (w->reg >= 0) {
soc_dapm_read(w->dapm, w->reg, &val);
diff --git a/sound/soc/soc-topology.c b/sound/soc/soc-topology.c
index ac3bbc142432..3d89189c1c29 100644
--- a/sound/soc/soc-topology.c
+++ b/sound/soc/soc-topology.c
@@ -1719,9 +1719,11 @@ static int soc_tplg_dapm_complete(struct soc_tplg *tplg)
}
ret = snd_soc_dapm_new_widgets(card);
- if (ret < 0)
+ if (ret < 0) {
dev_err(tplg->dev, "ASoC: failed to create new widgets %d\n",
ret);
+ return ret;
+ }
return 0;
} |
This solution seems better than mine. Can you send this PR? I will close mine.
|
Can we just print the error info in the snd_soc_dapm_new_widgets() function? |
If you agree not return directly. we had better stay in snd_soc_dapm_new_widgets() to finish all of the widget configuration. even there is an error in several widgets. we just print the error info in the function when we detect it to remind the developer. |
@zhigang-wu I don't have time to test this and I don't have a rt5651 board, can you try and resubmit a proposal. From my perspective, it's better to reject a bad topology upfront, and we need to make sure there is no memory leak on any error handling path. Thanks! |
@plbossart |
@plbossart the code to log: int snd_soc_dapm_new_widgets(struct snd_soc_card *card)
this is the log: ug 11 19:06:22 wzg-byt kernel: [ 13.810223] bytcr_rt5651 bytcr_rt5651: ==wzg:err create widget:ret=-22, id=12 |
@zhigang-wu I am not sure I understand your point. Are you saying that with our current solution there were tons of unreported errors? If yes, that's not good and needs to be root-caused. The suggestion that we need to avoid this error handling because it "breaks" our solution is not quite right, it just means our solution is already broken and only works by accident, not by design. |
@plbossart this means in the bycr_rt5651.c file. in the pre-defined widget table below, these widget is fail to allocate the widget. static const struct snd_soc_dapm_widget byt_rt5651_widgets[] = { |
@plbossart Aug 11 19:06:22 wzg-byt kernel: [ 13.810342] rt5651 i2c-10EC5651:00: ==wzg:err create widget:ret=-22, id=19 this means: in the rt5651.c file. the pre-defined widget in the rt5651_dapm_widgets[] table are failure during the widget allocation. this is the root cause. |
I have a different understanding of the word 'root cause', you need to dig deeper. It's unclear to me why standard widgets that are not SOF specific can't be added? It could very well be that you've discovered a problem that we've had forever, but these sort of widgets are pretty standard in all machine drivers... |
@plbossart |
@plbossart |
@plbossart I checked the history of the patch about the byt_rt5651_widgets[] in the bytcr_rt5651.c Aug 11 19:06:21 wzg-byt kernel: [ 11.171477] bytcr_rt5651 bytcr_rt5651: ==wzg:err create widget:ret=-22, id=12, num-kctl=0, name=Headphone |
@zhigang-wu This is a red herring (a wrong lead) The error that you are reporting does not mean that these widgets are useless, there's something else you haven't looked into. -22 means -EINVAL, and that's the value I initialized in the example code. Change it to zero so that the default: case will not report any issues... These widgets are used in ALL machine drivers, and you can see for yourself that the platform_clock_control() function is invoked when a stream is started/stopped. |
@plbossart Yes, the pre-defined widget in byt_rt5651_widgets[] table are useful. but they did not get the kcontrol{} in the snd_soc_dapm_new_widgets() function, for the widget->id is not supported. |
@zhigang-wu please change the example code I provided to this: int snd_soc_dapm_new_widgets(struct snd_soc_card *card)
{
struct snd_soc_dapm_widget *w;
unsigned int val;
int ret = 0;
mutex_lock_nested(&card->dapm_mutex, SND_SOC_DAPM_CLASS_INIT);
list_for_each_entry(w, &card->widgets, list)
{
if (w->new)
continue;
if (w->num_kcontrols) {
w->kcontrols = kzalloc(w->num_kcontrols *
sizeof(struct snd_kcontrol *),
GFP_KERNEL);
if (!w->kcontrols) {
mutex_unlock(&card->dapm_mutex);
return -ENOMEM;
}
}
switch(w->id) {
case snd_soc_dapm_switch:
case snd_soc_dapm_mixer:
case snd_soc_dapm_mixer_named_ctl:
ret = dapm_new_mixer(w);
break;
case snd_soc_dapm_mux:
case snd_soc_dapm_demux:
ret = dapm_new_mux(w);
break;
case snd_soc_dapm_pga:
case snd_soc_dapm_out_drv:
ret = dapm_new_pga(w);
break;
case snd_soc_dapm_dai_link:
ret = dapm_new_dai_link(w);
break;
default:
break;
}
if (ret < 0)
dev_err(w->dapm->dev, "==wzg:err create widget:ret=%d, id=%d, num-kctl=%d\n",
ret, w->id, w->num_kcontrols); |
I changed the code like this below, if the w->id is not supported, the ret value will be set with -65536. int snd_soc_dapm_new_widgets(struct snd_soc_card *card)
you can find the log it printed: [ 14.910574] sof-audio sof-audio: ==wzg:err create widget:ret=-65535, id=28, num-kctl=0, name=ssp0 Rx |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code looks fine, but change commit message to
ASoC: topology: fix pointer dereference during topology free
Some kcontrols may not be properly created during topology load. Check these during remove() prior to usage.
SoB
@zhigang-wu The dereference fix code looks ok, but are we missing some other patch for checking and reporting kcontrol results during topology load as discussed with @plbossart. Will this be in a new PR ? |
@lgirdwood |
@zhigang-wu you should report any errors where they are discovered. So if they are discovered in snd_soc_dapm_new_widgets() then you should report errors here too. |
@zhigang-wu please try without setting an error value for the default case (as I suggested but you changed my code...). It's not clear to me why you want this to be an error when all the widgets listed are perfectly legit - their definition can be found in drivers or topology. |
c616d4a
to
c465461
Compare
@plbossart |
@plbossart
|
@lgirdwood |
@plbossart
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not clear on the problem statement still and if the solution can be accepted upstream.
sound/soc/soc-dapm.c
Outdated
break; | ||
default: | ||
need_free = 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this needed? Why do you need to free the kcontrols in that case? You will need to add a strong explanation to explain why this is needed - since it's going to affect a lot of platforms using this code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, I will add the comments here to explain it.
there are several cases:
|
If we free the pointer in new_widgets() function, we did not need to add more pointer check in remove_widget() to avoid the oops issue. |
@zhigang-wu I don't understand what 'supported' widget means. There is no such concept, we use widgets that are part of an existing ASoC list, so your explanations are impossible to follow. I believe you are treating the default case as an error, but I don't think it's always the case; the code , so your systematic free is likely to have negative side effects. |
@plbossart in this oops case: the "siggen" widget is not processed, the kcontrol for this widget is not allocated. There is some unclear things, I need you to confirm:
|
@plbossart @lgirdwood
in this situation, the "siggen" widget will not be allocated the kcontrol instance. we did not need to |
@@ -3068,23 +3069,42 @@ int snd_soc_dapm_new_widgets(struct snd_soc_card *card) | |||
case snd_soc_dapm_switch: | |||
case snd_soc_dapm_mixer: | |||
case snd_soc_dapm_mixer_named_ctl: | |||
dapm_new_mixer(w); | |||
ret = dapm_new_mixer(w); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Checking the return is good, but dapm_new_mixer, dapm_new_mux etc should free any resources they allocate on failure. likewise dapm_create_or_share_kcontrol() will also need to be check that it frees any resource it allocates on failure. So please
- Check return values and propagate the return values up the call stack.
- Free any resources in the function that allocates them on any failure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- I checked the code in the dapm_create_or_share_kcontrol(), when it detected the error and before return, it has already finished free operation. in this functin, it has only two allocation operation for long_name and kcontrol.
like the snd_soc_dapm_add_path(), it also do this job already when return with error. - we only care about the w->kcontrols itself. when the sub-function return error. it will free the allocated w->kcontrols.
- But the updated PR also has the problem: when w->num_kcontrols is not equal to 0. but the code goto the default case. it will cause the w->kcontrols will be allocated. but never be processed further.
I think we have to free it when enter the default case.
@zhigang-wu my general comment is that we check return values and send them up the call stack and free any resources in the functions that allocate them. |
@lgirdwood |
@lgirdwood
But this semms still has some problems. I will do more research on this PR. |
I tested it with the v4.18 kernel. I am not sure whether it is the new issue or not. |
sound/soc/soc-dapm.c
Outdated
@@ -3054,37 +3110,32 @@ int snd_soc_dapm_new_widgets(struct snd_soc_card *card) | |||
if (w->new) | |||
continue; | |||
|
|||
if (w->num_kcontrols) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Keep allocation here for w->kcontrols,
sound/soc/soc-dapm.c
Outdated
break; | ||
default: | ||
break; | ||
} | ||
|
||
if (ret < 0) { | ||
mutex_unlock(&card->dapm_mutex); | ||
return ret; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
free w->kcontrols here.
check the return value to free the kcontrols instance to avoid oops caused by the pointer dereference. Signed-off-by: Wu Zhigang <zhigang.wu@linux.intel.com>
break; | ||
case snd_soc_dapm_dai_link: | ||
dapm_new_dai_link(w); | ||
ret = dapm_new_dai_link(w); | ||
break; | ||
default: | ||
break; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should the code in the default case:
if (w->num_kcontrols) {
kfree(w->kcontrols);
w->kcontrols = NULL;
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, do it outside default case otherwise we wont kfree() all resources.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lgirdwood
what should i do if we are in this case?
the w->num_kcontrols != 0, the default case will be entered.
in this case, the w->kcontrols will be allocated. but it is not processed by sub-function, because the default case is entered. then question comes: should we kfree this allocated w->kcontrols at this time?
If yes, we have to kfree it. then I will roll back to the previous version: adding the flag in the default case. when the flag is set in the default. we will do the kfree.
if no, I think we can not cover the "siggen" case, which cause our panic when in tplg free stage.
its w->num_kcontrols = 1, but "snd_soc_dapm_siggen" is not in this switch{}, it will enter the default case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zhigang-wu I need you to improve the current patch to consider all error paths in the function. default case is a valid for other widget types so cant be used as a means of freeing resources. Just walk through this function line by line and see where things could fail and then ask where do I recover this ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lgirdwood that function adds new widgets by walking a linked list of them. If adding one of them fails, looks like previously successfully added widgets are kept. Without this patch in such a case the function could return an error, if allocation failed, or 0, if initialisation failed. I can see the following possibilities in case of a partial success:
- free all so far successfully added widgets, using snd_soc_dapm_free_widget() and return an error
- keep all successfully added widgets and return 0
- keep and return an error seems illogical to me
Which one would you prefer? 1 seems the most consistent to me. Also note, that most users of the function don't even check its return code...
break; | ||
case snd_soc_dapm_dai_link: | ||
dapm_new_dai_link(w); | ||
ret = dapm_new_dai_link(w); | ||
break; | ||
default: | ||
break; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zhigang-wu I need you to improve the current patch to consider all error paths in the function. default case is a valid for other widget types so cant be used as a means of freeing resources. Just walk through this function line by line and see where things could fail and then ask where do I recover this ?
BugLink: https://bugs.launchpad.net/bugs/1821607 commit baef1c9 upstream. Using the batch API from the interconnect driver sometimes leads to a KASAN error due to an access to freed memory. This is easier to trigger with threadirqs on the kernel commandline. BUG: KASAN: use-after-free in rpmh_tx_done+0x114/0x12c Read of size 1 at addr fffffff51414ad84 by task irq/110-apps_rs/57 CPU: 0 PID: 57 Comm: irq/110-apps_rs Tainted: G W 4.19.10 #72 Call trace: dump_backtrace+0x0/0x2f8 show_stack+0x20/0x2c __dump_stack+0x20/0x28 dump_stack+0xcc/0x10c print_address_description+0x74/0x240 kasan_report+0x250/0x26c __asan_report_load1_noabort+0x20/0x2c rpmh_tx_done+0x114/0x12c tcs_tx_done+0x450/0x768 irq_forced_thread_fn+0x58/0x9c irq_thread+0x120/0x1dc kthread+0x248/0x260 ret_from_fork+0x10/0x18 Allocated by task 385: kasan_kmalloc+0xac/0x148 __kmalloc+0x170/0x1e4 rpmh_write_batch+0x174/0x540 qcom_icc_set+0x8dc/0x9ac icc_set+0x288/0x2e8 a6xx_gmu_stop+0x320/0x3c0 a6xx_pm_suspend+0x108/0x124 adreno_suspend+0x50/0x60 pm_generic_runtime_suspend+0x60/0x78 __rpm_callback+0x214/0x32c rpm_callback+0x54/0x184 rpm_suspend+0x3f8/0xa90 pm_runtime_work+0xb4/0x178 process_one_work+0x544/0xbc0 worker_thread+0x514/0x7d0 kthread+0x248/0x260 ret_from_fork+0x10/0x18 Freed by task 385: __kasan_slab_free+0x12c/0x1e0 kasan_slab_free+0x10/0x1c kfree+0x134/0x588 rpmh_write_batch+0x49c/0x540 qcom_icc_set+0x8dc/0x9ac icc_set+0x288/0x2e8 a6xx_gmu_stop+0x320/0x3c0 a6xx_pm_suspend+0x108/0x124 adreno_suspend+0x50/0x60 cr50_spi spi5.0: SPI transfer timed out pm_generic_runtime_suspend+0x60/0x78 __rpm_callback+0x214/0x32c rpm_callback+0x54/0x184 rpm_suspend+0x3f8/0xa90 pm_runtime_work+0xb4/0x178 process_one_work+0x544/0xbc0 worker_thread+0x514/0x7d0 kthread+0x248/0x260 ret_from_fork+0x10/0x18 The buggy address belongs to the object at fffffff51414ac80 which belongs to the cache kmalloc-512 of size 512 The buggy address is located 260 bytes inside of 512-byte region [fffffff51414ac80, fffffff51414ae80) The buggy address belongs to the page: page:ffffffbfd4505200 count:1 mapcount:0 mapping:fffffff51e00c680 index:0x0 compound_mapcount: 0 flags: 0x4000000000008100(slab|head) raw: 4000000000008100 ffffffbfd4529008 ffffffbfd44f9208 fffffff51e00c680 raw: 0000000000000000 0000000000200020 00000001ffffffff 0000000000000000 page dumped because: kasan: bad access detected Memory state around the buggy address: fffffff51414ac80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fffffff51414ad00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb >fffffff51414ad80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ^ fffffff51414ae00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fffffff51414ae80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc The batch API sets the same completion for each rpmh message that's sent and then loops through all the messages and waits for that single completion declared on the stack to be completed before returning from the function and freeing the message structures. Unfortunately, some messages may still be in process and 'stuck' in the TCS. At some later point, the tcs_tx_done() interrupt will run and try to process messages that have already been freed at the end of rpmh_write_batch(). This will in turn access the 'needs_free' member of the rpmh_request structure and cause KASAN to complain. Furthermore, if there's a message that's completed in rpmh_tx_done() and freed immediately after the complete() call is made we'll be racing with potentially freed memory when accessing the 'needs_free' member: CPU0 CPU1 ---- ---- rpmh_tx_done() complete(&compl) wait_for_completion(&compl) kfree(rpm_msg) if (rpm_msg->needs_free) <KASAN warning splat> Let's fix this by allocating a chunk of completions for each message and waiting for all of them to be completed before returning from the batch API. Alternatively, we could wait for the last message in the batch, but that may be a more complicated change because it looks like tcs_tx_done() just iterates through the indices of the queue and completes each message instead of tracking the last inserted message and completing that first. Fixes: c8790cb ("drivers: qcom: rpmh: add support for batch RPMH request") Cc: Lina Iyer <ilina@codeaurora.org> Cc: "Raju P.L.S.S.S.N" <rplsssn@codeaurora.org> Cc: Matthias Kaehlcke <mka@chromium.org> Cc: Evan Green <evgreen@chromium.org> Cc: stable@vger.kernel.org Reviewed-by: Lina Iyer <ilina@codeaurora.org> Reviewed-by: Evan Green <evgreen@chromium.org> Signed-off-by: Stephen Boyd <swboyd@chromium.org> Signed-off-by: Bjorn Andersson <bjorn.andersson@linaro.org> Signed-off-by: Andy Gross <andy.gross@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Seth Forshee <seth.forshee@canonical.com>
free it when error happend in the sub-function.
the two steps can avoid the oops caused by pointer dereference.
Signed-off-by: Wu Zhigang zhigang.wu@linux.intel.com