-
Notifications
You must be signed in to change notification settings - Fork 529
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix issue: sometimes PFC WD unable to create zero buffer pool #2164
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sometimes PFCWD failed to apply zero buffer profile because because of creating zero buffer pool failed. In fact, we do not need to create a zero buffer pool because dynamic_th -8 indicates no shared buffer can be used by the profile, which guarantees the PG/queue will NOT occupy any buffer. Solution: try fetching an existing buffer pool and create zero buffer profile on top of it Signed-off-by: Stephen Sun <stephens@nvidia.com>
Signed-off-by: Stephen Sun <stephens@nvidia.com>
But they do not share the zero profile Signed-off-by: Stephen Sun <stephens@nvidia.com>
Signed-off-by: Stephen Sun <stephens@nvidia.com>
/azpw run |
/AzurePipelines run |
Azure Pipelines successfully started running 1 pipeline(s). |
/azpw run |
/AzurePipelines run |
Azure Pipelines successfully started running 1 pipeline(s). |
stephenxs
commented
Mar 8, 2022
Signed-off-by: Stephen Sun <stephens@nvidia.com>
@neethajohn kindly reminder to review and signoff |
it can be cherry-picked to 202111 smoothly (tested on 91d7558) |
/azpw run |
/AzurePipelines run |
Azure Pipelines successfully started running 1 pipeline(s). |
@neethajohn kindly reminder to signoff this PR. This is an issue for 202111 |
neethajohn
reviewed
Mar 15, 2022
Signed-off-by: Stephen Sun <stephens@nvidia.com>
neethajohn
approved these changes
Mar 16, 2022
/azp run Azure.sonic-swss |
Azure Pipelines successfully started running 1 pipeline(s). |
judyjoseph
pushed a commit
that referenced
this pull request
Mar 20, 2022
What I did Fix issue: sometimes PFC WD is unable to create zero buffer pool. On some platforms, an ingress/egress zero buffer profile will be applied on the PG and queue which are under PFC storm. The zero buffer profile is created based on zero buffer pool. However, sometimes it fails to create zero buffer pool due to too many buffer pools existing in the system. Sometimes, there is a zero buffer pool existing on the system for reclaiming buffer. In that case, we can leverage it to create zero buffer profile for PFC WD. Why I did it Fix the issue via sharing the zero buffer pool between PFC WD and buffer orchagent How I verified it Manually test Run PFC WD test and PFC WD warm reboot test Run unit test Details if related The detailed flow is like this: PFC Storm detected: If there is a zero pool in PFC WD's cache, just create the zero buffer profile based on it Otherwise, fetching the zero pool from buffer orchagent If got one, create the zero buffer profile based on it Otherwise, create a zero buffer pool notify the zero buffer pool about the buffer orch In both cases, PFC WD should notify buffer orch to increase the reference number of the zero buffer pool. Buffer orchagent: When creating the zero buffer pool, check whether there is one. if yes, skip the SAI API create_buffer_pool increase the reference number. Before removing the zero buffer pool, decrease and check the reference number. if it is zero (after decreased), skip SAI API destroy_buffer_pool. When PFC WD decrease reference number: remove the zero buffer pool if the reference number becomes zero Notes We do not leverage the object_reference_map infrastructure to track the dependency because: it assumes the dependency will eventually be removed if an object is removed. that's NOT true in this scenario because the PFC storm can last for a relatively long time and even cross warm reboot. the interfaces differ. Signed-off-by: Stephen Sun <stephens@nvidia.com>
liat-grozovik
pushed a commit
that referenced
this pull request
Apr 2, 2022
This is to backport PR #2164 to 202012 branch. - What I did Fix issue: sometimes PFC WD is unable to create zero buffer pool. On some platforms, an ingress/egress zero buffer profile will be applied on the PG and queue which are under PFC storm. The zero buffer profile is created based on zero buffer pool. However, sometimes it fails to create zero buffer pool due to too many buffer pools existing in the system. Sometimes, there is a zero buffer pool existing on the system for reclaiming buffer. In that case, we can leverage it to create zero buffer profile for PFC WD. Why I did it Fix the issue via sharing the zero buffer pool between PFC WD and buffer orchagent - How I verified it Manually test Run PFC WD test and PFC WD warm reboot test Run unit test - Details if related The detailed flow is like this. PFC Storm detected: 1. If there is a zero pool in PFC WD's cache, just create the zero buffer profile based on it 2. Otherwise, fetching the zero pool from buffer orchagent - If got one, create the zero buffer profile based on it - Otherwise, - create a zero buffer pool - notify the zero buffer pool about the buffer orch - In both cases, PFC WD should notify buffer orch to increase the reference number of the zero buffer pool. Buffer orchagent: - When creating the zero buffer pool, - check whether there is one. if yes, skip the SAI API create_buffer_pool - increase the reference number. - Before removing the zero buffer pool, decrease and check the reference number. if it is zero (after decreased), skip SAI API destroy_buffer_pool. - When PFC WD decrease reference number: remove the zero buffer pool if the reference number becomes zero - Notes We do not leverage the `object_reference_map` infrastructure to track the dependency because: - it assumes the dependency will eventually be removed if an object is removed. that's NOT true in this scenario because the PFC storm can last for a relatively long time and even cross warm reboot. - the interfaces differ.
neethajohn
pushed a commit
that referenced
this pull request
Jun 24, 2022
…storm is detected (#2304) What I did Avoid dropping traffic that is ingressing the port/pg that is in storm. The code changes in this PR avoid creating the ingress zero pool and profile and does not attach any zero profile to the ingress pg when pfcwd is triggered Revert changes related to #1480 where the retry mechanism was added to BufferOrch which caches the task retries and while the PG is locked by PfcWdZeroBufferHandler. Revert changes related to #2164 in PfcWdZeroBufferHandler & ZeroBufferProfile & BufferOrch. Updated UT's accordingly How I verified it UT's. Ran the sonic-mgmt test with these changes sonic-net/sonic-mgmt#5665 and verified if they've passed. Signed-off-by: Vivek Reddy Karri <vkarri@nvidia.com>
yxieca
pushed a commit
that referenced
this pull request
Jun 25, 2022
…storm is detected (#2304) What I did Avoid dropping traffic that is ingressing the port/pg that is in storm. The code changes in this PR avoid creating the ingress zero pool and profile and does not attach any zero profile to the ingress pg when pfcwd is triggered Revert changes related to #1480 where the retry mechanism was added to BufferOrch which caches the task retries and while the PG is locked by PfcWdZeroBufferHandler. Revert changes related to #2164 in PfcWdZeroBufferHandler & ZeroBufferProfile & BufferOrch. Updated UT's accordingly How I verified it UT's. Ran the sonic-mgmt test with these changes sonic-net/sonic-mgmt#5665 and verified if they've passed. Signed-off-by: Vivek Reddy Karri <vkarri@nvidia.com>
vivekrnv
added a commit
to vivekrnv/sonic-swss
that referenced
this pull request
Aug 1, 2022
…storm is detected (sonic-net#2304) What I did Avoid dropping traffic that is ingressing the port/pg that is in storm. The code changes in this PR avoid creating the ingress zero pool and profile and does not attach any zero profile to the ingress pg when pfcwd is triggered Revert changes related to sonic-net#1480 where the retry mechanism was added to BufferOrch which caches the task retries and while the PG is locked by PfcWdZeroBufferHandler. Revert changes related to sonic-net#2164 in PfcWdZeroBufferHandler & ZeroBufferProfile & BufferOrch. Updated UT's accordingly How I verified it UT's. Ran the sonic-mgmt test with these changes sonic-net/sonic-mgmt#5665 and verified if they've passed. Signed-off-by: Vivek Reddy Karri <vkarri@nvidia.com>
preetham-singh
pushed a commit
to preetham-singh/sonic-swss
that referenced
this pull request
Aug 6, 2022
…net#2164) What I did Fix issue: sometimes PFC WD is unable to create zero buffer pool. On some platforms, an ingress/egress zero buffer profile will be applied on the PG and queue which are under PFC storm. The zero buffer profile is created based on zero buffer pool. However, sometimes it fails to create zero buffer pool due to too many buffer pools existing in the system. Sometimes, there is a zero buffer pool existing on the system for reclaiming buffer. In that case, we can leverage it to create zero buffer profile for PFC WD. Why I did it Fix the issue via sharing the zero buffer pool between PFC WD and buffer orchagent How I verified it Manually test Run PFC WD test and PFC WD warm reboot test Run unit test Details if related The detailed flow is like this: PFC Storm detected: If there is a zero pool in PFC WD's cache, just create the zero buffer profile based on it Otherwise, fetching the zero pool from buffer orchagent If got one, create the zero buffer profile based on it Otherwise, create a zero buffer pool notify the zero buffer pool about the buffer orch In both cases, PFC WD should notify buffer orch to increase the reference number of the zero buffer pool. Buffer orchagent: When creating the zero buffer pool, check whether there is one. if yes, skip the SAI API create_buffer_pool increase the reference number. Before removing the zero buffer pool, decrease and check the reference number. if it is zero (after decreased), skip SAI API destroy_buffer_pool. When PFC WD decrease reference number: remove the zero buffer pool if the reference number becomes zero Notes We do not leverage the object_reference_map infrastructure to track the dependency because: it assumes the dependency will eventually be removed if an object is removed. that's NOT true in this scenario because the PFC storm can last for a relatively long time and even cross warm reboot. the interfaces differ. Signed-off-by: Stephen Sun <stephens@nvidia.com>
preetham-singh
pushed a commit
to preetham-singh/sonic-swss
that referenced
this pull request
Aug 6, 2022
…storm is detected (sonic-net#2304) What I did Avoid dropping traffic that is ingressing the port/pg that is in storm. The code changes in this PR avoid creating the ingress zero pool and profile and does not attach any zero profile to the ingress pg when pfcwd is triggered Revert changes related to sonic-net#1480 where the retry mechanism was added to BufferOrch which caches the task retries and while the PG is locked by PfcWdZeroBufferHandler. Revert changes related to sonic-net#2164 in PfcWdZeroBufferHandler & ZeroBufferProfile & BufferOrch. Updated UT's accordingly How I verified it UT's. Ran the sonic-mgmt test with these changes sonic-net/sonic-mgmt#5665 and verified if they've passed. Signed-off-by: Vivek Reddy Karri <vkarri@nvidia.com>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What I did
Fix issue: sometimes PFC WD is unable to create zero buffer pool.
On some platforms, an ingress/egress zero buffer profile will be applied on the PG and queue which are under PFC storm. The zero buffer profile is created based on zero buffer pool. However, sometimes it fails to create zero buffer pool due to too many buffer pools existing in the system.
Sometimes, there is a zero buffer pool existing on the system for reclaiming buffer. In that case, we can leverage it to create zero buffer profile for PFC WD.
Why I did it
Fix the issue via sharing the zero buffer pool between PFC WD and buffer orchagent
How I verified it
Manually test
Run PFC WD test and PFC WD warm reboot test
Run unit test
Details if related
The detailed flow is like this:
PFC Storm detected:
Buffer orchagent:
Notes
We do not leverage the
object_reference_map
infrastructure to track the dependency because: