-
Notifications
You must be signed in to change notification settings - Fork 177
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[201911][patch] mlxsw: i2c: Prevent transaction execution for special chip st… #278
[201911][patch] mlxsw: i2c: Prevent transaction execution for special chip st… #278
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please document the problem, and why it’s needed only for the 201911 branch?
execution for special chip states | ||
|
||
Do not run transaction in cases chip is in reset or in-filed update states. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is that commit upstream? Please explain the TMP.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is not and it cannot be upstream. this is a way for bug fix and we wont be able to upstream it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not understand. Please elaborate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Upstream mlxsw driver doesn't support ISSU, so we can't submit ISSU related filtering to upstream.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only 4.19(?) or all versions? The policy of SONiC is that long term all patches have to be upstream to lessen the maintenance burden especially when updating to newer Linux kernel releases.
Can you please add that to the patch description?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the repository, I did git grep -i issu
, and nothing showed up regarding ISSU. Where can I find more information?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Within SONiC, ISSU may be called warm reboot or warmboot. Some more details are here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you: ISSU = In-Service-Software-Upgrade
mutex_unlock(&mlxsw_i2c->cmd.lock); | ||
+ if (*status == MLXSW_CMD_STATUS_FW_ISSU || | ||
+ *status == MLXSW_CMD_STATUS_RUNNING_RESET) { | ||
+ mlxsw_i2c->status = *status; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it assumed that after it gets into either MLXSW_CMD_STATUS_FW_ISSU
or MLXSW_CMD_STATUS_RUNNING_RESET
state, it won't get out of this state or there will be a kexec to a new kernel/image? Otherwise, unless I'm missing something, future i2c commands won't get executed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The intention is to not run any I2C transaction ASIC in cases chip reset or in-service update states is detected.
Firmware is not accessible and will reject transaction with the relevant status "RUNNING_RESET" or "FW_ISSU_ONGOING".
In case transaction is failed do to one of these reasons, stop sending transactions.
In such state driver is about to be removed since it cannot continue running after reset or in-service update.
And re-probed again after reset or in-service update is completed.
For ISSU system will be rebooted by kexec.
For chip reset ASIC I2C driver is handled in user space by system udev event (removing/probing).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also want to make sure that this is also covering the non-reboot scenarios - config-reload, cold restart of syncd/swss etc.
Please confirm that after these non-reboot actions, the i2c commands still get executed.
case MLXSW_CMD_STATUS_RUNNING_RESET: | ||
return "RUNNING_RESET"; | ||
+ case MLXSW_CMD_STATUS_FW_ISSU: | ||
+ return "FW_ISSUE_ONGOING"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like a typo here? FW_ISSUE_ONGOING
vs FW_ISSU_ONGOING
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be fixed.
|
||
cmd_fail: | ||
mutex_unlock(&mlxsw_i2c->cmd.lock); | ||
+ if (*status == MLXSW_CMD_STATUS_FW_ISSU || |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, in what scenario will the status be MLXSW_CMD_STATUS_RUNNING_RESET
?
I understand why ISSU (warmboot) case is covered here, but don't understand why RESET is also covered.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In case of reset I2C chip driver should be removed and re-probed. Since state check for ISSU is added it's logical alos to handle reset state as well.
Please dont merge yet. |
Perhaps also add config-reload, load_minigraph or any other scenario (syncd cold reboot) which might trigger this. |
@roysr-nv , @stepanblyschak once all tests are done and we are good to go with the fix please move it to ready for review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please squash both commits, and add more information for the ignorant, that ISSU is not supported upstream (all versions), and when upstreaming is going to happen.
18c351e
to
2475799
Compare
@paulmenzel Updated per your request |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for updating the merge/pull request description. Please update also the commit message in the patch file.
+ if (status == MLXSW_CMD_STATUS_FW_ISSU || | ||
+ status == MLXSW_CMD_STATUS_RUNNING_RESET) { | ||
+ mlxsw_i2c->status = status; | ||
+ dev_info(dev, "FW status=%x(%s))\n", status, mlxsw_cmd_status_str(status)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks more like a debug message instead of an informational message. Could you please elaborate/improve the wording?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why? It will print something like
mlxsw_minimal 2-0048: FW status=27(FW_ISSU_ONGOING)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is a user (non-developer) supposed to do with that message?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you prefer something like:
mlxsw_minimal 2-0048: FW status=27(FW_ISSU_ONGOING): Access to device is not allowed in this state
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I would consider that an improvement.
From 392127480c91a26dae2449288229d4957213dcc1 Mon Sep 17 00:00:00 2001 | ||
From: Vadim Pasternak <vadimp@nvidia.com> | ||
Date: Tue, 31 May 2022 10:27:38 +0300 | ||
Subject: [PATCH v4.9 ISSU WA 1/1] TMP: mlxsw: i2c: Prevent transaction |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why the TMP tag?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the TMP tag needed (as in, will this bug fix be removed in the future), or is it expected to be permanent?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is expected to be permanent.
We want to mark it as downstream patch.
There are a lot of not up-streamed patches in Sonic kernel, But there is no convention in Sonic to tag such patches.
"TMP" in patch name could be replaced with "DS" or "Downstream" or with something like that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would rather have "Sonic-specific" or "Sonic-only" in the patch rather than TMP.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Sonic-specific" sounds better to me. Can we close with this name?
I suggest to use this tag for all new-coming patches (not only from Nvidia), which are not going to be up-streamed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe such a tag is unnecessary. The README suggests, that patches are not wanted, that are not going to be upstreamed. The only exception is:
Platform specific kernel modules which are impossible or very difficult to be built out of kernel tree.
Why were the patches accepted to be in SONiC in the first place. I really thought these were only backports, and it’s not made easy for reviewers to find information about the feature (ISSU), and why it’s not going to be upstreamed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no ISSU feature in up-stream driver, so this patch can't be accepted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that there are a number of patches here that should be upstreamed. However, there can still be patches that are not useful to be upstreamed, such as patches that would break the general usage of the Linux kernel, or some hack that really shouldn't be upstreamed.
In this case, since this is functionality that is not supported by the upstream driver, it wouldn't make sense to commit this bug fix for functionality that's not even present in the upstream Linux kernel. If this changes (i.e. ISSU/warmboot/equivalent functionality) becomes available in the upstream kernel, then this should be upstreamed as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, but I still do not understand. Does SONiC add a special module/patch to support ISSU in it’s Linux kernel build? Where are those patches? I could apply this patch here (for current SONiC branch) to the upstream Linux kernel source just fine.
$ git am ~/src/sonic-linux-kernel/patch/0064-TMP-mlxsw-i2c-Prevent-transaction-execution-for.patch # nvidia-stepan/202106-prevent-transaction
Wende an: TMP: mlxsw: i2c: Prevent transaction execution for special chip states
MLXSW_CMD_STATUS_FW_ISSU
sounds to me like it’s support in the firmware and independent from the Linux kernel, so should be submitted upstream.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since upstream doesn’t support ISSU, it means nobody can set FW to ISSU and FW will never get MLXSW_CMD_STATUS_FW_ISSU. That’s we it can’t be up-streamed.
…ates This patch fixes an issue encountered during SONiC warm-reboot: during ISSU start the kernel reports that CPU stall is detected: ``` INFO: rcu_sched detected stalls on CPUs/tasks ``` This patch won't be upstreamed due to the upstream driver not supporting ISSU. Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
2475799
to
59eea9f
Compare
Updated with latest patch from @vadimp-nvidia |
@lguohan is there anything else open related this PR or we can approve and signoff? |
Is the TMP tag needed in the patch name and the title (as in, will this bug fix be removed in the future), or is it expected to be permanent? |
execution for special chip states | ||
|
||
Do not run transaction in cases chip is in reset or in-filed update states. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the repository, I did git grep -i issu
, and nothing showed up regarding ISSU. Where can I find more information?
From 392127480c91a26dae2449288229d4957213dcc1 Mon Sep 17 00:00:00 2001 | ||
From: Vadim Pasternak <vadimp@nvidia.com> | ||
Date: Tue, 31 May 2022 10:27:38 +0300 | ||
Subject: [PATCH v4.9 ISSU WA 1/1] TMP: mlxsw: i2c: Prevent transaction |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe such a tag is unnecessary. The README suggests, that patches are not wanted, that are not going to be upstreamed. The only exception is:
Platform specific kernel modules which are impossible or very difficult to be built out of kernel tree.
Why were the patches accepted to be in SONiC in the first place. I really thought these were only backports, and it’s not made easy for reviewers to find information about the feature (ISSU), and why it’s not going to be upstreamed.
Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
@paulmenzel @saiarcot895 @vadimp-nvidia I removed the TMP tag from the commit message |
…ates
Do not run transaction in cases chip is in reset or in-filed update states.
This patch fixes an issue encountered during SONiC warm-reboot: during
ISSU start the kernel reports that CPU stall is detected:
This patch won't be upstreamed due to the upstream driver not supporting
ISSU.
Signed-off-by: Stepan Blyschak stepanb@nvidia.com