Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Flex counter] A buffer pool object can be removed before its counter is removed even if orchagent removes the counter first #3004

Closed
stephenxs opened this issue Dec 27, 2023 · 1 comment
Assignees

Comments

@stephenxs
Copy link
Collaborator

stephenxs commented Dec 27, 2023

Description

A buffer pool object can be removed before its counter is removed even if orchagent removes the counter first.
This defect can occur on all objects that have a counter attached. This is because orchagent notifies sairedis to remove an object and the counter via different channels. There is no mechanism to keep the order between OA and sairedis. For object, it uses ASIC_DB channel but for a counter it uses FLEX_DB.

This issue is very similar to sonic-net/sonic-buildimage#14628 which is for the RIF object.

Steps to reproduce the issue:

It's a rare case for the buffer pool object. We observed it once when the zero buffer pool (for reclaiming buffer) is removed (all ports are started up) just after system warm reboot.
The zero buffer pool was removed:

2023-12-16.04:32:29.911961|BUFFER_POOL_TABLE:ingress_zero_pool|DEL

According to the code, it will remove object first

    else if (op == DEL_COMMAND)
    {
...
        if (SAI_NULL_OBJECT_ID != sai_object)
        {
            clearBufferPoolWatermarkCounterIdList(sai_object);
            sai_status = sai_buffer_api->remove_buffer_pool(sai_object);
            if (SAI_STATUS_SUCCESS != sai_status)
            {
                SWSS_LOG_ERROR("Failed to remove buffer pool %s with type %s, rv:%d", object_name.c_str(), map_type_name.c_str(), sai_status);
                task_process_status handle_status = handleSaiRemoveStatus(SAI_API_BUFFER, sai_status);
                if (handle_status != task_process_status::task_success)
                {
                    return handle_status;
                }
            }
            SWSS_LOG_NOTICE("Removed buffer pool %s with type %s", object_name.c_str(), map_type_name.c_str());
        }

In clearBufferPoolWatermarkCounterIdList it removes entry in FLEX_COUNTER_DB

void BufferOrch::clearBufferPoolWatermarkCounterIdList(const sai_object_id_t object_id)
{
    if (m_isBufferPoolWatermarkCounterIdListGenerated)
    {
        string key = BUFFER_POOL_WATERMARK_STAT_COUNTER_FLEX_COUNTER_GROUP ":" + sai_serialize_object_id(object_id);
        m_flexCounterTable->del(key);
    }
}

But in the log we see the counter was still accessed and removed after the buffer pool had been removed

Dec 16 06:32:30.469317 r-spider-05 INFO syncd#SDK: :- processSingleEvent: key: SAI_OBJECT_TYPE_BUFFER_POOL:oid:0x18000000000a3d op: remove
Dec 16 06:32:30.469317 r-spider-05 NOTICE syncd#SDK: [SAI_BUFFER.NOTICE] ./src/mlnx_sai_buffer.c[2221]- mlnx_sai_remove_buffer_pool: Remove BUFFER_POOL [OID:0x400000018] [sx_cos_pool_id:4]
Dec 16 06:32:30.470509 r-spider-05 INFO syncd#SDK: :- sendApiResponse: sending response for SAI_COMMON_API_REMOVE api with status: SAI_STATUS_SUCCESS

Dec 16 06:32:30.722009 r-spider-05 INFO syncd#SDK: :- tryTranslateVidToRid: unable to get RID for VID oid:0x18000000000a3d
Dec 16 06:32:30.722061 r-spider-05 WARNING syncd#SDK: :- processFlexCounterEvent: port VID oid:0x18000000000a3d, was not found (probably port was removed/splitted) and will remove from counters now

Describe the results you received:

Describe the results you expected:

Output of show version:

SONiC Software Version: SONiC.202305_RC.51-6416e238c_Internal
SONiC OS Version: 11
Distribution: Debian 11.8
Kernel: 5.10.0-23-2-amd64
Build commit: 6416e238c
Build date: Thu Dec 14 04:28:16 UTC 2023
Built by: sw-r2d2-bot@r-build-sonic-ci02-242

Platform: x86_64-mlnx_msn2410-r0
HwSKU: ACS-MSN2410
ASIC: mellanox
ASIC Count: 1
Serial Number: MT1921X01546
Model Number: MSN2410-CB2FO
Hardware Revision: A2
Uptime: 06:42:44 up 4 min,  1 user,  load average: 2.51, 3.07, 1.58
Date: Sat 16 Dec 2023 06:42:44

Docker images:
REPOSITORY                                         TAG                               IMAGE ID       SIZE
docker-syncd-mlnx                                  202305_RC.51-6416e238c_Internal   a16b904ebab1   838MB
docker-syncd-mlnx                                  latest                            a16b904ebab1   838MB
docker-platform-monitor                            202305_RC.51-6416e238c_Internal   82275a00b244   829MB
docker-platform-monitor                            latest                            82275a00b244   829MB
docker-dhcp-relay                                  latest                            1e4780b04384   308MB
docker-macsec                                      latest                            7987aec2df36   320MB
docker-eventd                                      202305_RC.51-6416e238c_Internal   dcceb37f9932   300MB
docker-eventd                                      latest                            dcceb37f9932   300MB
docker-teamd                                       202305_RC.51-6416e238c_Internal   e41fe22b368e   318MB
docker-teamd                                       latest                            e41fe22b368e   318MB
docker-orchagent                                   202305_RC.51-6416e238c_Internal   f147bf2f5dc4   330MB
docker-orchagent                                   latest                            f147bf2f5dc4   330MB
docker-fpm-frr                                     202305_RC.51-6416e238c_Internal   d699f298db1e   350MB
docker-fpm-frr                                     latest                            d699f298db1e   350MB
docker-nat                                         202305_RC.51-6416e238c_Internal   2b56e4943d9d   321MB
docker-nat                                         latest                            2b56e4943d9d   321MB
docker-sflow                                       202305_RC.51-6416e238c_Internal   8fa0c1ba3454   320MB
docker-sflow                                       latest                            8fa0c1ba3454   320MB
docker-sonic-telemetry                             202305_RC.51-6416e238c_Internal   3a0d24f463e1   387MB
docker-sonic-telemetry                             latest                            3a0d24f463e1   387MB
docker-snmp                                        202305_RC.51-6416e238c_Internal   878593cde6f1   340MB
docker-snmp                                        latest                            878593cde6f1   340MB
docker-lldp                                        202305_RC.51-6416e238c_Internal   964407cc4c7b   343MB
docker-lldp                                        latest                            964407cc4c7b   343MB
docker-database                                    202305_RC.51-6416e238c_Internal   9fdbdd6b2caf   301MB
docker-database                                    latest                            9fdbdd6b2caf   301MB
docker-mux                                         202305_RC.51-6416e238c_Internal   e4f0d9b05c52   349MB
docker-mux                                         latest                            e4f0d9b05c52   349MB
docker-router-advertiser                           202305_RC.51-6416e238c_Internal   409706828615   301MB
docker-router-advertiser                           latest                            409706828615   301MB
docker-sonic-mgmt-framework                        202305_RC.51-6416e238c_Internal   2a36085b813c   416MB
docker-sonic-mgmt-framework                        latest                            2a36085b813c   416MB
urm.nvidia.com/sw-nbu-sws-sonic-docker/sonic-wjh   1.6.0-202305-25                   3e820a00274a   433MB
urm.nvidia.com/sw-nbu-sws-sonic-docker/doai        1.1.0-202305-36                   f3755210d1c0   276MB

Output of show techsupport:

(paste your output here or download and attach the file here )

Additional information you deem important (e.g. issue happens only occasionally):

@stephenxs stephenxs changed the title [Buffer pool] A buffer pool object can be removed before its counter is removed even if orchagent removes the counter first [Flex counter] A buffer pool object can be removed before its counter is removed even if orchagent removes the counter first Dec 27, 2023
@lguohan lguohan transferred this issue from sonic-net/sonic-buildimage Jan 3, 2024
@stephenxs stephenxs self-assigned this May 17, 2024
@stephenxs
Copy link
Collaborator Author

Fixed by #3076 and sonic-net/sonic-sairedis#1362

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant