ASoC: SOF: pm: don't return error when CTX_SAVE ipc fails #460

ranj063 · 2018-12-17T22:25:21Z

CTX_SAVE ipc is sent right before the DSP is powered off.
Returning an error if this ipc fails causes the resume
sequence to fail as well. Instead just report the error
and continue to power off the DSP. This is only a
temporary workaround until the real issue with
CTX_SAVE ipc's timing out is root caused.

Signed-off-by: Ranjani Sridharan ranjani.sridharan@linux.intel.com

plbossart

that sounds either too good to be true or fishy?

plbossart · 2018-12-17T23:46:13Z

sound/soc/sof/pm.c

@@ -318,10 +318,13 @@ static int sof_suspend(struct device *dev, bool runtime_suspend)
 /* notify DSP of upcoming power down */
 ret = sof_send_pm_ipc(sdev, SOF_IPC_PM_CTX_SAVE);
 if (ret < 0) {
+ /*
+ * DSP will be powered off right after.
+ * So just report the error and continue.


That sounds like kicking the can down the read. If the IPC is not successful, doesn't it indicate trouble ahead on resume?

@plbossart this PR is by no means meant to address the underlying problem. I am only avoiding the problem that occurs when this IPC timesout for reasons not known so far. So it is only a workaround until the root cause is found. We're seeing repeated IPC timeouts on this particular iPC and if it times out, we reach a point of no recovery today.

plbossart · 2018-12-18T00:23:00Z

On 12/17/18 6:19 PM, Ranjani Sridharan wrote: ***@***.**** commented on this pull request. ------------------------------------------------------------------------ In sound/soc/sof/pm.c <#460 (comment)>: > @@ -318,10 +318,13 @@ static int sof_suspend(struct device *dev, bool runtime_suspend) /* notify DSP of upcoming power down */ ret = sof_send_pm_ipc(sdev, SOF_IPC_PM_CTX_SAVE); if (ret < 0) { + /* + * DSP will be powered off right after. + * So just report the error and continue. @plbossart <https://github.com/plbossart> this PR is by no means meant to address the underlying problem. I am only avoiding the problem that occurs when this IPC timesout for reasons not known so far. So it is only a workaround until the root cause is found. We're seeing repeated IPC timeouts on this particular iPC and if it times out, we reach a point of no recovery today

I have no issues merging it then, but please update the PR to make it clear in the commit message and comment that this is a FIXME and that future improvements are needed to solve this for real.

…

CTX_SAVE ipc is sent right before the DSP is powered off. Returning an error if this ipc fails causes the resume sequence to fail as well. Instead just report the error and continue to power off the DSP. This is only a temporary workaround until the real issue with CTX_SAVE ipc's timing out is root caused. Signed-off-by: Ranjani Sridharan <ranjani.sridharan@linux.intel.com>

ranj063 · 2018-12-18T01:33:31Z

@plbossart fixed the comment and commit message now.

keyonjie

I think this is a good workaround, our suspending is a special case, power off can recover kinds of errors we met and make driver more robust.

Furthermore, we can even add more ADSP reset(suspend/resume) at critical issues happening, e.g. DSP panic, IPC timeout, ...

keyonjie · 2018-12-18T01:28:59Z

sound/soc/sof/pm.c

+ /*
+ * DSP will be powered off right after.
+ * So just report the error and continue.
+ */
 dev_err(sdev->dev,
 "error: ctx_save ipc error during suspend %d\n",


"but we will continue power off Audio DSP..."

plbossart · 2018-12-18T02:04:29Z

@keyonjie "Furthermore, we can even add more ADSP reset(suspend/resume) at critical issues happening, e.g. DSP panic, IPC timeout, ..." -> this is what I meant in #452. If we are stuck, timed-out, etc, we reset and restart the firmware

The port->hsr is used in the hsr_handle_frame(), which is a callback of rx_handler. hsr master and slaves are initialized in hsr_add_port(). This function initializes several pointers, which includes port->hsr after registering rx_handler. So, in the rx_handler routine, un-initialized pointer would be used. In order to fix this, pointers should be initialized before registering rx_handler. Test commands: ip netns del left ip netns del right modprobe -rv veth modprobe -rv hsr killall ping modprobe hsr ip netns add left ip netns add right ip link add veth0 type veth peer name veth1 ip link add veth2 type veth peer name veth3 ip link add veth4 type veth peer name veth5 ip link set veth1 netns left ip link set veth3 netns right ip link set veth4 netns left ip link set veth5 netns right ip link set veth0 up ip link set veth2 up ip link set veth0 address fc:00:00:00:00:01 ip link set veth2 address fc:00:00:00:00:02 ip netns exec left ip link set veth1 up ip netns exec left ip link set veth4 up ip netns exec right ip link set veth3 up ip netns exec right ip link set veth5 up ip link add hsr0 type hsr slave1 veth0 slave2 veth2 ip a a 192.168.100.1/24 dev hsr0 ip link set hsr0 up ip netns exec left ip link add hsr1 type hsr slave1 veth1 slave2 veth4 ip netns exec left ip a a 192.168.100.2/24 dev hsr1 ip netns exec left ip link set hsr1 up ip netns exec left ip n a 192.168.100.1 dev hsr1 lladdr \ fc:00:00:00:00:01 nud permanent ip netns exec left ip n r 192.168.100.1 dev hsr1 lladdr \ fc:00:00:00:00:01 nud permanent for i in {1..100} do ip netns exec left ping 192.168.100.1 & done ip netns exec left hping3 192.168.100.1 -2 --flood & ip netns exec right ip link add hsr2 type hsr slave1 veth3 slave2 veth5 ip netns exec right ip a a 192.168.100.3/24 dev hsr2 ip netns exec right ip link set hsr2 up ip netns exec right ip n a 192.168.100.1 dev hsr2 lladdr \ fc:00:00:00:00:02 nud permanent ip netns exec right ip n r 192.168.100.1 dev hsr2 lladdr \ fc:00:00:00:00:02 nud permanent for i in {1..100} do ip netns exec right ping 192.168.100.1 & done ip netns exec right hping3 192.168.100.1 -2 --flood & while : do ip link add hsr0 type hsr slave1 veth0 slave2 veth2 ip a a 192.168.100.1/24 dev hsr0 ip link set hsr0 up ip link del hsr0 done Splat looks like: [ 120.954938][ C0] general protection fault, probably for non-canonical address 0xdffffc0000000006: 0000 [thesofproject#1]I [ 120.957761][ C0] KASAN: null-ptr-deref in range [0x0000000000000030-0x0000000000000037] [ 120.959064][ C0] CPU: 0 PID: 1511 Comm: hping3 Not tainted 5.6.0-rc5+ thesofproject#460 [ 120.960054][ C0] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006 [ 120.962261][ C0] RIP: 0010:hsr_addr_is_self+0x65/0x2a0 [hsr] [ 120.963149][ C0] Code: 44 24 18 70 73 2f c0 48 c1 eb 03 48 8d 04 13 c7 00 f1 f1 f1 f1 c7 40 04 00 f2 f2 f2 4 [ 120.966277][ C0] RSP: 0018:ffff8880d9c09af0 EFLAGS: 00010206 [ 120.967293][ C0] RAX: 0000000000000006 RBX: 1ffff1101b38135f RCX: 0000000000000000 [ 120.968516][ C0] RDX: dffffc0000000000 RSI: ffff8880d17cb208 RDI: 0000000000000000 [ 120.969718][ C0] RBP: 0000000000000030 R08: ffffed101b3c0e3c R09: 0000000000000001 [ 120.972203][ C0] R10: 0000000000000001 R11: ffffed101b3c0e3b R12: 0000000000000000 [ 120.973379][ C0] R13: ffff8880aaf80100 R14: ffff8880aaf800f2 R15: ffff8880aaf80040 [ 120.974410][ C0] FS: 00007f58e693f740(0000) GS:ffff8880d9c00000(0000) knlGS:0000000000000000 [ 120.979794][ C0] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 120.980773][ C0] CR2: 00007ffcb8b38f29 CR3: 00000000afe8e001 CR4: 00000000000606f0 [ 120.981945][ C0] Call Trace: [ 120.982411][ C0] <IRQ> [ 120.982848][ C0] ? hsr_add_node+0x8c0/0x8c0 [hsr] [ 120.983522][ C0] ? rcu_read_lock_held+0x90/0xa0 [ 120.984159][ C0] ? rcu_read_lock_sched_held+0xc0/0xc0 [ 120.984944][ C0] hsr_handle_frame+0x1db/0x4e0 [hsr] [ 120.985597][ C0] ? hsr_nl_nodedown+0x2b0/0x2b0 [hsr] [ 120.986289][ C0] __netif_receive_skb_core+0x6bf/0x3170 [ 120.992513][ C0] ? check_chain_key+0x236/0x5d0 [ 120.993223][ C0] ? do_xdp_generic+0x1460/0x1460 [ 120.993875][ C0] ? register_lock_class+0x14d0/0x14d0 [ 120.994609][ C0] ? __netif_receive_skb_one_core+0x8d/0x160 [ 120.995377][ C0] __netif_receive_skb_one_core+0x8d/0x160 [ 120.996204][ C0] ? __netif_receive_skb_core+0x3170/0x3170 [ ... ] Reported-by: syzbot+fcf5dd39282ceb27108d@syzkaller.appspotmail.com Fixes: c5a7591 ("net/hsr: Use list_head (and rcu) instead of array for slave devices.") Signed-off-by: Taehee Yoo <ap420073@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>

ranj063 requested review from bardliao, keyonjie, lgirdwood and xiulipan December 17, 2018 22:25

ranj063 force-pushed the fix/ctx_save_ipc branch from f1a88e9 to b333a9d Compare December 17, 2018 22:56

plbossart reviewed Dec 17, 2018

View reviewed changes

ranj063 force-pushed the fix/ctx_save_ipc branch from b333a9d to 9ae9ada Compare December 18, 2018 01:32

keyonjie reviewed Dec 18, 2018

View reviewed changes

plbossart approved these changes Dec 18, 2018

View reviewed changes

plbossart merged commit 007f8fe into thesofproject:topic/sof-dev Dec 18, 2018

ranj063 deleted the fix/ctx_save_ipc branch March 22, 2019 17:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ASoC: SOF: pm: don't return error when CTX_SAVE ipc fails #460

ASoC: SOF: pm: don't return error when CTX_SAVE ipc fails #460

ranj063 commented Dec 17, 2018 •

edited

Loading

plbossart left a comment

plbossart Dec 17, 2018

ranj063 Dec 18, 2018

plbossart commented Dec 18, 2018 via email

ranj063 commented Dec 18, 2018

keyonjie left a comment

keyonjie Dec 18, 2018

plbossart commented Dec 18, 2018

ASoC: SOF: pm: don't return error when CTX_SAVE ipc fails #460

ASoC: SOF: pm: don't return error when CTX_SAVE ipc fails #460

Conversation

ranj063 commented Dec 17, 2018 • edited Loading

plbossart left a comment

Choose a reason for hiding this comment

plbossart Dec 17, 2018

Choose a reason for hiding this comment

ranj063 Dec 18, 2018

Choose a reason for hiding this comment

plbossart commented Dec 18, 2018 via email

ranj063 commented Dec 18, 2018

keyonjie left a comment

Choose a reason for hiding this comment

keyonjie Dec 18, 2018

Choose a reason for hiding this comment

plbossart commented Dec 18, 2018

ranj063 commented Dec 17, 2018 •

edited

Loading