Fix issue: sometimes PFC WD unable to create zero buffer pool #2164

stephenxs · 2022-03-02T07:04:04Z

What I did

Fix issue: sometimes PFC WD is unable to create zero buffer pool.
On some platforms, an ingress/egress zero buffer profile will be applied on the PG and queue which are under PFC storm. The zero buffer profile is created based on zero buffer pool. However, sometimes it fails to create zero buffer pool due to too many buffer pools existing in the system.
Sometimes, there is a zero buffer pool existing on the system for reclaiming buffer. In that case, we can leverage it to create zero buffer profile for PFC WD.

Why I did it

Fix the issue via sharing the zero buffer pool between PFC WD and buffer orchagent

How I verified it

Manually test
Run PFC WD test and PFC WD warm reboot test
Run unit test

Details if related
The detailed flow is like this:
PFC Storm detected:

If there is a zero pool in PFC WD's cache, just create the zero buffer profile based on it
Otherwise, fetching the zero pool from buffer orchagent
- If got one, create the zero buffer profile based on it
- Otherwise,
  - create a zero buffer pool
  - notify the zero buffer pool about the buffer orch
- In both cases, PFC WD should notify buffer orch to increase the reference number of the zero buffer pool.

Buffer orchagent:

When creating the zero buffer pool,
- check whether there is one. if yes, skip the SAI API create_buffer_pool
- increase the reference number.
Before removing the zero buffer pool, decrease and check the reference number. if it is zero (after decreased), skip SAI API destroy_buffer_pool.
When PFC WD decrease reference number: remove the zero buffer pool if the reference number becomes zero

Notes
We do not leverage the object_reference_map infrastructure to track the dependency because:

it assumes the dependency will eventually be removed if an object is removed. that's NOT true in this scenario because the PFC storm can last for a relatively long time and even cross warm reboot.
the interfaces differ.

Sometimes PFCWD failed to apply zero buffer profile because because of creating zero buffer pool failed. In fact, we do not need to create a zero buffer pool because dynamic_th -8 indicates no shared buffer can be used by the profile, which guarantees the PG/queue will NOT occupy any buffer. Solution: try fetching an existing buffer pool and create zero buffer profile on top of it Signed-off-by: Stephen Sun <stephens@nvidia.com>

Signed-off-by: Stephen Sun <stephens@nvidia.com>

But they do not share the zero profile Signed-off-by: Stephen Sun <stephens@nvidia.com>

Signed-off-by: Stephen Sun <stephens@nvidia.com>

stephenxs · 2022-03-03T13:08:36Z

/azpw run

mssonicbld · 2022-03-03T13:08:38Z

/AzurePipelines run

azure-pipelines · 2022-03-03T13:08:48Z

Azure Pipelines successfully started running 1 pipeline(s).

stephenxs · 2022-03-07T02:53:57Z

/azpw run

mssonicbld · 2022-03-07T02:53:58Z

/AzurePipelines run

azure-pipelines · 2022-03-07T02:54:08Z

Azure Pipelines successfully started running 1 pipeline(s).

orchagent/bufferorch.cpp

Signed-off-by: Stephen Sun <stephens@nvidia.com>

liat-grozovik · 2022-03-08T07:59:28Z

@neethajohn kindly reminder to review and signoff

stephenxs · 2022-03-08T08:11:16Z

it can be cherry-picked to 202111 smoothly (tested on 91d7558)

stephenxs · 2022-03-14T00:35:36Z

/azpw run

mssonicbld · 2022-03-14T00:35:38Z

/AzurePipelines run

azure-pipelines · 2022-03-14T00:35:47Z

Azure Pipelines successfully started running 1 pipeline(s).

liat-grozovik · 2022-03-15T13:46:42Z

@neethajohn kindly reminder to signoff this PR. This is an issue for 202111

orchagent/pfcactionhandler.cpp

Signed-off-by: Stephen Sun <stephens@nvidia.com>

liat-grozovik · 2022-03-16T07:58:58Z

/azp run Azure.sonic-swss

azure-pipelines · 2022-03-16T07:59:07Z

Azure Pipelines successfully started running 1 pipeline(s).

What I did Fix issue: sometimes PFC WD is unable to create zero buffer pool. On some platforms, an ingress/egress zero buffer profile will be applied on the PG and queue which are under PFC storm. The zero buffer profile is created based on zero buffer pool. However, sometimes it fails to create zero buffer pool due to too many buffer pools existing in the system. Sometimes, there is a zero buffer pool existing on the system for reclaiming buffer. In that case, we can leverage it to create zero buffer profile for PFC WD. Why I did it Fix the issue via sharing the zero buffer pool between PFC WD and buffer orchagent How I verified it Manually test Run PFC WD test and PFC WD warm reboot test Run unit test Details if related The detailed flow is like this: PFC Storm detected: If there is a zero pool in PFC WD's cache, just create the zero buffer profile based on it Otherwise, fetching the zero pool from buffer orchagent If got one, create the zero buffer profile based on it Otherwise, create a zero buffer pool notify the zero buffer pool about the buffer orch In both cases, PFC WD should notify buffer orch to increase the reference number of the zero buffer pool. Buffer orchagent: When creating the zero buffer pool, check whether there is one. if yes, skip the SAI API create_buffer_pool increase the reference number. Before removing the zero buffer pool, decrease and check the reference number. if it is zero (after decreased), skip SAI API destroy_buffer_pool. When PFC WD decrease reference number: remove the zero buffer pool if the reference number becomes zero Notes We do not leverage the object_reference_map infrastructure to track the dependency because: it assumes the dependency will eventually be removed if an object is removed. that's NOT true in this scenario because the PFC storm can last for a relatively long time and even cross warm reboot. the interfaces differ. Signed-off-by: Stephen Sun <stephens@nvidia.com>

This is to backport PR #2164 to 202012 branch. - What I did Fix issue: sometimes PFC WD is unable to create zero buffer pool. On some platforms, an ingress/egress zero buffer profile will be applied on the PG and queue which are under PFC storm. The zero buffer profile is created based on zero buffer pool. However, sometimes it fails to create zero buffer pool due to too many buffer pools existing in the system. Sometimes, there is a zero buffer pool existing on the system for reclaiming buffer. In that case, we can leverage it to create zero buffer profile for PFC WD. Why I did it Fix the issue via sharing the zero buffer pool between PFC WD and buffer orchagent - How I verified it Manually test Run PFC WD test and PFC WD warm reboot test Run unit test - Details if related The detailed flow is like this. PFC Storm detected: 1. If there is a zero pool in PFC WD's cache, just create the zero buffer profile based on it 2. Otherwise, fetching the zero pool from buffer orchagent - If got one, create the zero buffer profile based on it - Otherwise, - create a zero buffer pool - notify the zero buffer pool about the buffer orch - In both cases, PFC WD should notify buffer orch to increase the reference number of the zero buffer pool. Buffer orchagent: - When creating the zero buffer pool, - check whether there is one. if yes, skip the SAI API create_buffer_pool - increase the reference number. - Before removing the zero buffer pool, decrease and check the reference number. if it is zero (after decreased), skip SAI API destroy_buffer_pool. - When PFC WD decrease reference number: remove the zero buffer pool if the reference number becomes zero - Notes We do not leverage the `object_reference_map` infrastructure to track the dependency because: - it assumes the dependency will eventually be removed if an object is removed. that's NOT true in this scenario because the PFC storm can last for a relatively long time and even cross warm reboot. - the interfaces differ.

…storm is detected (#2304) What I did Avoid dropping traffic that is ingressing the port/pg that is in storm. The code changes in this PR avoid creating the ingress zero pool and profile and does not attach any zero profile to the ingress pg when pfcwd is triggered Revert changes related to #1480 where the retry mechanism was added to BufferOrch which caches the task retries and while the PG is locked by PfcWdZeroBufferHandler. Revert changes related to #2164 in PfcWdZeroBufferHandler & ZeroBufferProfile & BufferOrch. Updated UT's accordingly How I verified it UT's. Ran the sonic-mgmt test with these changes sonic-net/sonic-mgmt#5665 and verified if they've passed. Signed-off-by: Vivek Reddy Karri <vkarri@nvidia.com>

…storm is detected (sonic-net#2304) What I did Avoid dropping traffic that is ingressing the port/pg that is in storm. The code changes in this PR avoid creating the ingress zero pool and profile and does not attach any zero profile to the ingress pg when pfcwd is triggered Revert changes related to sonic-net#1480 where the retry mechanism was added to BufferOrch which caches the task retries and while the PG is locked by PfcWdZeroBufferHandler. Revert changes related to sonic-net#2164 in PfcWdZeroBufferHandler & ZeroBufferProfile & BufferOrch. Updated UT's accordingly How I verified it UT's. Ran the sonic-mgmt test with these changes sonic-net/sonic-mgmt#5665 and verified if they've passed. Signed-off-by: Vivek Reddy Karri <vkarri@nvidia.com>

…net#2164) What I did Fix issue: sometimes PFC WD is unable to create zero buffer pool. On some platforms, an ingress/egress zero buffer profile will be applied on the PG and queue which are under PFC storm. The zero buffer profile is created based on zero buffer pool. However, sometimes it fails to create zero buffer pool due to too many buffer pools existing in the system. Sometimes, there is a zero buffer pool existing on the system for reclaiming buffer. In that case, we can leverage it to create zero buffer profile for PFC WD. Why I did it Fix the issue via sharing the zero buffer pool between PFC WD and buffer orchagent How I verified it Manually test Run PFC WD test and PFC WD warm reboot test Run unit test Details if related The detailed flow is like this: PFC Storm detected: If there is a zero pool in PFC WD's cache, just create the zero buffer profile based on it Otherwise, fetching the zero pool from buffer orchagent If got one, create the zero buffer profile based on it Otherwise, create a zero buffer pool notify the zero buffer pool about the buffer orch In both cases, PFC WD should notify buffer orch to increase the reference number of the zero buffer pool. Buffer orchagent: When creating the zero buffer pool, check whether there is one. if yes, skip the SAI API create_buffer_pool increase the reference number. Before removing the zero buffer pool, decrease and check the reference number. if it is zero (after decreased), skip SAI API destroy_buffer_pool. When PFC WD decrease reference number: remove the zero buffer pool if the reference number becomes zero Notes We do not leverage the object_reference_map infrastructure to track the dependency because: it assumes the dependency will eventually be removed if an object is removed. that's NOT true in this scenario because the PFC storm can last for a relatively long time and even cross warm reboot. the interfaces differ. Signed-off-by: Stephen Sun <stephens@nvidia.com>

…storm is detected (sonic-net#2304) What I did Avoid dropping traffic that is ingressing the port/pg that is in storm. The code changes in this PR avoid creating the ingress zero pool and profile and does not attach any zero profile to the ingress pg when pfcwd is triggered Revert changes related to sonic-net#1480 where the retry mechanism was added to BufferOrch which caches the task retries and while the PG is locked by PfcWdZeroBufferHandler. Revert changes related to sonic-net#2164 in PfcWdZeroBufferHandler & ZeroBufferProfile & BufferOrch. Updated UT's accordingly How I verified it UT's. Ran the sonic-mgmt test with these changes sonic-net/sonic-mgmt#5665 and verified if they've passed. Signed-off-by: Vivek Reddy Karri <vkarri@nvidia.com>

stephenxs added 4 commits February 28, 2022 01:53

Solution 2: pfcwd track zero pool and profile

b0d421f

Signed-off-by: Stephen Sun <stephens@nvidia.com>

3rd solution. PFCWD and bufferOrch share the zero pool

f75df4a

But they do not share the zero profile Signed-off-by: Stephen Sun <stephens@nvidia.com>

Fix review comments

ce6606e

Signed-off-by: Stephen Sun <stephens@nvidia.com>

stephenxs requested a review from neethajohn March 2, 2022 07:04

stephenxs marked this pull request as ready for review March 3, 2022 22:30

stephenxs requested a review from prsunny as a code owner March 3, 2022 22:30

stephenxs commented Mar 8, 2022

View reviewed changes

orchagent/bufferorch.cpp Outdated Show resolved Hide resolved

Fix review comments

7382b31

Signed-off-by: Stephen Sun <stephens@nvidia.com>

liat-grozovik added the Bug 🐛 label Mar 8, 2022

stephenxs added the Request for 202111 Branch label Mar 8, 2022

Merge branch 'master' into fix-pfcwd

64f66dd

neethajohn reviewed Mar 15, 2022

View reviewed changes

orchagent/pfcactionhandler.cpp Outdated Show resolved Hide resolved

Test zero profile and zero pool before removing them

e7dfad1

Signed-off-by: Stephen Sun <stephens@nvidia.com>

stephenxs requested a review from neethajohn March 16, 2022 01:08

neethajohn approved these changes Mar 16, 2022

View reviewed changes

neethajohn merged commit ad65b0a into sonic-net:master Mar 16, 2022

stephenxs deleted the fix-pfcwd branch March 16, 2022 23:16

stephenxs mentioned this pull request Mar 16, 2022

[202012] Fix issue: sometimes PFC WD unable to create zero buffer pool #2185

Merged

judyjoseph added the Included in 202111 Branch label Mar 20, 2022

vivekrnv mentioned this pull request Jun 1, 2022

[PFC_WD] Avoid applying ZeroBuffer Profiles to ingress PG when a PFC storm is detected #2304

Merged

vivekrnv mentioned this pull request Aug 1, 2022

[202111] [PFC_WD] Avoid applying ZeroBuffer Profiles to ingress PG when a PFC storm is detected #2402

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix issue: sometimes PFC WD unable to create zero buffer pool #2164

Fix issue: sometimes PFC WD unable to create zero buffer pool #2164

stephenxs commented Mar 2, 2022 •

edited

Loading

stephenxs commented Mar 3, 2022

mssonicbld commented Mar 3, 2022

azure-pipelines bot commented Mar 3, 2022

stephenxs commented Mar 7, 2022

mssonicbld commented Mar 7, 2022

azure-pipelines bot commented Mar 7, 2022

liat-grozovik commented Mar 8, 2022

stephenxs commented Mar 8, 2022

stephenxs commented Mar 14, 2022

mssonicbld commented Mar 14, 2022

azure-pipelines bot commented Mar 14, 2022

liat-grozovik commented Mar 15, 2022

liat-grozovik commented Mar 16, 2022

azure-pipelines bot commented Mar 16, 2022

Fix issue: sometimes PFC WD unable to create zero buffer pool #2164

Fix issue: sometimes PFC WD unable to create zero buffer pool #2164

Conversation

stephenxs commented Mar 2, 2022 • edited Loading

stephenxs commented Mar 3, 2022

mssonicbld commented Mar 3, 2022

azure-pipelines bot commented Mar 3, 2022

stephenxs commented Mar 7, 2022

mssonicbld commented Mar 7, 2022

azure-pipelines bot commented Mar 7, 2022

liat-grozovik commented Mar 8, 2022

stephenxs commented Mar 8, 2022

stephenxs commented Mar 14, 2022

mssonicbld commented Mar 14, 2022

azure-pipelines bot commented Mar 14, 2022

liat-grozovik commented Mar 15, 2022

liat-grozovik commented Mar 16, 2022

azure-pipelines bot commented Mar 16, 2022

stephenxs commented Mar 2, 2022 •

edited

Loading