-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Orchagent received notifications in an order different from the order in which other daemons sent them during dynamic buffer test #5157
Labels
Triaged
this issue has been triaged
Comments
stephenxs
changed the title
Orchagent received notifications in an order different from the order in which other daemons sent them
Orchagent received notifications in an order different from the order in which other daemons sent them during dynamic buffer test
Aug 12, 2020
I am not clear on what is the dependency logic you are planning to add. Is it retry logic (task_need_retry)? |
Yes.
|
daall
added a commit
that referenced
this issue
Oct 16, 2020
[swss] [acl] Replace IP_PROTOCOL with NEXT_HEADER for IPv6 ACL tables (#1458) [acl] Refactor port OID retrieval into aclorch (#1462) Fix issue #5157 by identifying the dependency among objects and avoiding releasing an object still being referenced (#1440) [mock tests] Update MockDBConnector to match new swsscommon interface (#1465) [swss-common] netlink: Setting nl_socket buffer size to 3M from 2M (#391) Added support in Swig file to cast Selectable object to Subscriber Table object (#394) [warm reboot] Warm Reboot Support for EVPN VXLAN (#350) Implement DBInterface/SonicV2Connector in C++ (#387) Fix memory leak if a RedisCommand object were to be reused (#392) Signed-off-by: Danny Allen <daall@microsoft.com>
Fixed by PR #1440 |
santhosh-kt
pushed a commit
to santhosh-kt/sonic-buildimage
that referenced
this issue
Feb 25, 2021
[swss] [acl] Replace IP_PROTOCOL with NEXT_HEADER for IPv6 ACL tables (sonic-net#1458) [acl] Refactor port OID retrieval into aclorch (sonic-net#1462) Fix issue sonic-net#5157 by identifying the dependency among objects and avoiding releasing an object still being referenced (sonic-net#1440) [mock tests] Update MockDBConnector to match new swsscommon interface (sonic-net#1465) [swss-common] netlink: Setting nl_socket buffer size to 3M from 2M (sonic-net#391) Added support in Swig file to cast Selectable object to Subscriber Table object (sonic-net#394) [warm reboot] Warm Reboot Support for EVPN VXLAN (sonic-net#350) Implement DBInterface/SonicV2Connector in C++ (sonic-net#387) Fix memory leak if a RedisCommand object were to be reused (sonic-net#392) Signed-off-by: Danny Allen <daall@microsoft.com>
theasianpianist
pushed a commit
to theasianpianist/sonic-buildimage
that referenced
this issue
Feb 5, 2022
…and avoiding releasing an object still being referenced (sonic-net#1440) * Fix issue sonic-net#5157 by identifying the dependency among objects and avoiding releasing an object still being referenced The issue is caused by the OA receives notification in an different order in which they were sent. OA doesn't have any dependency check try notifying sai-redis to release an object which is still being referenced, which causes sai-redis complain and the object leaks. The idea is to introduce a mechanism to identify the dependency thus preventing a referenced object from being released. 1. Introduce a new type representing the dependency among variant type of objects, including the following fields: - m_objsDependingOnMe, a set representing the objects that references the current object. eg. BUFFER_PROFILE.ingress_lossless_profile references BUFFER_POOL.ingress_lossless_pool - m_objsReferencingByMe, a map from a field of the current object's to the object name it references. 2. When a field of an object A has been updated with referencing another object B, - obj[A.m_objsReferencingByMe[field name]].m_objsDependingOnMe.remove(A) - A.m_objsReferencingByMe[field name] = B 3. When a an object A is about to be removed, - if obj.m_objsDependingOnMe isn't empty set, return task_need_retry else execute the normal remove flow. Signed-off-by: Stephen Sun <stephens@nvidia.com>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Description
Orchagent received notifications in an order different from the order in which other daemons sent them.
This issue can be reproduced in dynamic buffer calculation test.
This is a new feature which will come soon. In the feature, there is a logic that the buffer profile will be dynamically created when a new tuple of speed, cable length occurs and removed when it isn't referenced any longer.
One of the test cases has the following flow:
- port's speed updated, causing port down and routing entries withdrawn. After that, the cable length of the port has been updated, causing buffer profile on the port updated.
- buffermgrd create a new buffer profile and replaces the port's old profile on lossless PG with the newly created one.
- the old profile is removed
After step 2 being executed, the old profile shouldn't be referenced any more, which means it's safe to remove it. However, we observed error from sairedis saying it's still being referenced. This is because orchagent received two notifications in a reversed order in which they were sent.
I suspect that the orchagent handles notifications in a different order in which buffermgrd sends them is caused by the low-level mechanism of the orchagent, which takes advantage of I/O multiply mechanism to receive notifications from redis-db:
When the system isn't busy everything works well. But when the system is busy, notifications are backlogged in sockets, the orchagent will handle notifications in the order like FD-ascending or FD-descending rather than the order they come.
My suggestion is to add a dependency between different buffer tables. In details,
task_need_retry
should be returned in the following case:To be more precise, we need to:
BUFFER_POOL
,BUFFER_PROFILE
andBUFFER_PG
in the orchagent.set
to the objects. it contains the objects that depend on it. typically they are the objects that reference it.set
of the object it depends on.for example, when a new
BUFFER_PG
is added, it will be added to theset
of theBUFFER_POOL
object it references.for example, if the
profile
in aBUFFER_PG
is updated froma
tob
, we need remove theBUFFER_PG
from theset
ofa
and add it to that ofb
set
of the depended objectset
isn't empty.By doing so, an entry is prevented from being released before all the references removed.
The drawback of the solution is that making it retry will block all the further notification from one table and make orchagent keep retrying. But this won't introduce further risk as long as:
Steps to reproduce the issue:
See the above summary.
Describe the results you received:
Detailed flow and log message:
Describe the results you expected:
No error should be observed.
Additional information you deem important (e.g. issue happens only occasionally):
The text was updated successfully, but these errors were encountered: