Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[201911][patch] mlxsw: i2c: Prevent transaction execution for special chip st… #278

Merged

Conversation

stepanblyschak
Copy link
Contributor

@stepanblyschak stepanblyschak commented Jun 2, 2022

…ates

Do not run transaction in cases chip is in reset or in-filed update states.

This patch fixes an issue encountered during SONiC warm-reboot: during
ISSU start the kernel reports that CPU stall is detected:

INFO: rcu_sched detected stalls on CPUs/tasks

This patch won't be upstreamed due to the upstream driver not supporting
ISSU.

Signed-off-by: Stepan Blyschak stepanb@nvidia.com

@stepanblyschak stepanblyschak marked this pull request as draft June 2, 2022 10:08
@stepanblyschak stepanblyschak marked this pull request as ready for review June 2, 2022 13:07
@prsunny prsunny requested a review from lguohan June 2, 2022 15:33
Copy link
Contributor

@paulmenzel paulmenzel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please document the problem, and why it’s needed only for the 201911 branch?

execution for special chip states

Do not run transaction in cases chip is in reset or in-filed update states.

Copy link
Contributor

@paulmenzel paulmenzel Jun 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that commit upstream? Please explain the TMP.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is not and it cannot be upstream. this is a way for bug fix and we wont be able to upstream it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not understand. Please elaborate.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Upstream mlxsw driver doesn't support ISSU, so we can't submit ISSU related filtering to upstream.

Copy link
Contributor

@paulmenzel paulmenzel Jun 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only 4.19(?) or all versions? The policy of SONiC is that long term all patches have to be upstream to lessen the maintenance burden especially when updating to newer Linux kernel releases.

Can you please add that to the patch description?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the repository, I did git grep -i issu, and nothing showed up regarding ISSU. Where can I find more information?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Within SONiC, ISSU may be called warm reboot or warmboot. Some more details are here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you: ISSU = In-Service-Software-Upgrade

mutex_unlock(&mlxsw_i2c->cmd.lock);
+ if (*status == MLXSW_CMD_STATUS_FW_ISSU ||
+ *status == MLXSW_CMD_STATUS_RUNNING_RESET) {
+ mlxsw_i2c->status = *status;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it assumed that after it gets into either MLXSW_CMD_STATUS_FW_ISSU or MLXSW_CMD_STATUS_RUNNING_RESET state, it won't get out of this state or there will be a kexec to a new kernel/image? Otherwise, unless I'm missing something, future i2c commands won't get executed.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The intention is to not run any I2C transaction ASIC in cases chip reset or in-service update states is detected.
Firmware is not accessible and will reject transaction with the relevant status "RUNNING_RESET" or "FW_ISSU_ONGOING".
In case transaction is failed do to one of these reasons, stop sending transactions.
In such state driver is about to be removed since it cannot continue running after reset or in-service update.
And re-probed again after reset or in-service update is completed.

For ISSU system will be rebooted by kexec.
For chip reset ASIC I2C driver is handled in user space by system udev event (removing/probing).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also want to make sure that this is also covering the non-reboot scenarios - config-reload, cold restart of syncd/swss etc.
Please confirm that after these non-reboot actions, the i2c commands still get executed.

case MLXSW_CMD_STATUS_RUNNING_RESET:
return "RUNNING_RESET";
+ case MLXSW_CMD_STATUS_FW_ISSU:
+ return "FW_ISSUE_ONGOING";

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a typo here? FW_ISSUE_ONGOING vs FW_ISSU_ONGOING?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be fixed.


cmd_fail:
mutex_unlock(&mlxsw_i2c->cmd.lock);
+ if (*status == MLXSW_CMD_STATUS_FW_ISSU ||

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, in what scenario will the status be MLXSW_CMD_STATUS_RUNNING_RESET?
I understand why ISSU (warmboot) case is covered here, but don't understand why RESET is also covered.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case of reset I2C chip driver should be removed and re-probed. Since state check for ISSU is added it's logical alos to handle reset state as well.

@liat-grozovik
Copy link
Collaborator

Please dont merge yet.
We should run also fast and cold reboot tests to ensure these reboots are also clean with the fix.
Please wait for Nvidia/me confirmation on it.

@vaibhavhd
Copy link

Please dont merge yet. We should run also fast and cold reboot tests to ensure these reboots are also clean with the fix. Please wait for Nvidia/me confirmation on it.

Perhaps also add config-reload, load_minigraph or any other scenario (syncd cold reboot) which might trigger this.

@vaibhavhd vaibhavhd marked this pull request as draft June 14, 2022 01:16
@liat-grozovik
Copy link
Collaborator

@roysr-nv , @stepanblyschak once all tests are done and we are good to go with the fix please move it to ready for review.

Copy link
Contributor

@paulmenzel paulmenzel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please squash both commits, and add more information for the ignorant, that ISSU is not supported upstream (all versions), and when upstreaming is going to happen.

@stepanblyschak stepanblyschak marked this pull request as ready for review June 16, 2022 11:27
@stepanblyschak stepanblyschak force-pushed the 201911-prevent-transaction branch 2 times, most recently from 18c351e to 2475799 Compare June 17, 2022 10:04
@stepanblyschak
Copy link
Contributor Author

@paulmenzel Updated per your request

Copy link
Contributor

@paulmenzel paulmenzel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for updating the merge/pull request description. Please update also the commit message in the patch file.

+ if (status == MLXSW_CMD_STATUS_FW_ISSU ||
+ status == MLXSW_CMD_STATUS_RUNNING_RESET) {
+ mlxsw_i2c->status = status;
+ dev_info(dev, "FW status=%x(%s))\n", status, mlxsw_cmd_status_str(status));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks more like a debug message instead of an informational message. Could you please elaborate/improve the wording?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why? It will print something like
mlxsw_minimal 2-0048: FW status=27(FW_ISSU_ONGOING)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is a user (non-developer) supposed to do with that message?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you prefer something like:
mlxsw_minimal 2-0048: FW status=27(FW_ISSU_ONGOING): Access to device is not allowed in this state

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I would consider that an improvement.

From 392127480c91a26dae2449288229d4957213dcc1 Mon Sep 17 00:00:00 2001
From: Vadim Pasternak <vadimp@nvidia.com>
Date: Tue, 31 May 2022 10:27:38 +0300
Subject: [PATCH v4.9 ISSU WA 1/1] TMP: mlxsw: i2c: Prevent transaction
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the TMP tag?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the TMP tag needed (as in, will this bug fix be removed in the future), or is it expected to be permanent?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is expected to be permanent.
We want to mark it as downstream patch.
There are a lot of not up-streamed patches in Sonic kernel, But there is no convention in Sonic to tag such patches.
"TMP" in patch name could be replaced with "DS" or "Downstream" or with something like that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather have "Sonic-specific" or "Sonic-only" in the patch rather than TMP.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Sonic-specific" sounds better to me. Can we close with this name?
I suggest to use this tag for all new-coming patches (not only from Nvidia), which are not going to be up-streamed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe such a tag is unnecessary. The README suggests, that patches are not wanted, that are not going to be upstreamed. The only exception is:

Platform specific kernel modules which are impossible or very difficult to be built out of kernel tree.

Why were the patches accepted to be in SONiC in the first place. I really thought these were only backports, and it’s not made easy for reviewers to find information about the feature (ISSU), and why it’s not going to be upstreamed.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no ISSU feature in up-stream driver, so this patch can't be accepted.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that there are a number of patches here that should be upstreamed. However, there can still be patches that are not useful to be upstreamed, such as patches that would break the general usage of the Linux kernel, or some hack that really shouldn't be upstreamed.

In this case, since this is functionality that is not supported by the upstream driver, it wouldn't make sense to commit this bug fix for functionality that's not even present in the upstream Linux kernel. If this changes (i.e. ISSU/warmboot/equivalent functionality) becomes available in the upstream kernel, then this should be upstreamed as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, but I still do not understand. Does SONiC add a special module/patch to support ISSU in it’s Linux kernel build? Where are those patches? I could apply this patch here (for current SONiC branch) to the upstream Linux kernel source just fine.

$ git am ~/src/sonic-linux-kernel/patch/0064-TMP-mlxsw-i2c-Prevent-transaction-execution-for.patch # nvidia-stepan/202106-prevent-transaction
Wende an: TMP: mlxsw: i2c: Prevent transaction execution for special chip states

MLXSW_CMD_STATUS_FW_ISSU sounds to me like it’s support in the firmware and independent from the Linux kernel, so should be submitted upstream.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since upstream doesn’t support ISSU, it means nobody can set FW to ISSU and FW will never get MLXSW_CMD_STATUS_FW_ISSU. That’s we it can’t be up-streamed.

…ates

This patch fixes an issue encountered during SONiC warm-reboot: during
ISSU start the kernel reports that CPU stall is detected:

```
INFO: rcu_sched detected stalls on CPUs/tasks
```

This patch won't be upstreamed due to the upstream driver not supporting
ISSU.

Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
@stepanblyschak stepanblyschak force-pushed the 201911-prevent-transaction branch from 2475799 to 59eea9f Compare June 20, 2022 11:57
@stepanblyschak
Copy link
Contributor Author

Updated with latest patch from @vadimp-nvidia

@liat-grozovik
Copy link
Collaborator

@lguohan is there anything else open related this PR or we can approve and signoff?
once this is done we need to have it on all other branches. PRs are already avaialble.

@saiarcot895
Copy link
Contributor

Is the TMP tag needed in the patch name and the title (as in, will this bug fix be removed in the future), or is it expected to be permanent?

execution for special chip states

Do not run transaction in cases chip is in reset or in-filed update states.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the repository, I did git grep -i issu, and nothing showed up regarding ISSU. Where can I find more information?

From 392127480c91a26dae2449288229d4957213dcc1 Mon Sep 17 00:00:00 2001
From: Vadim Pasternak <vadimp@nvidia.com>
Date: Tue, 31 May 2022 10:27:38 +0300
Subject: [PATCH v4.9 ISSU WA 1/1] TMP: mlxsw: i2c: Prevent transaction
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe such a tag is unnecessary. The README suggests, that patches are not wanted, that are not going to be upstreamed. The only exception is:

Platform specific kernel modules which are impossible or very difficult to be built out of kernel tree.

Why were the patches accepted to be in SONiC in the first place. I really thought these were only backports, and it’s not made easy for reviewers to find information about the feature (ISSU), and why it’s not going to be upstreamed.

@stepanblyschak stepanblyschak changed the title [patch] mlxsw: i2c: Prevent transaction execution for special chip st… [201911][patch] mlxsw: i2c: Prevent transaction execution for special chip st… Jun 30, 2022
Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
@stepanblyschak
Copy link
Contributor Author

@paulmenzel @saiarcot895 @vadimp-nvidia I removed the TMP tag from the commit message

@saiarcot895 saiarcot895 merged commit 23fc702 into sonic-net:201911 Jul 5, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants