Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

orchangent crash during crm acl_resouce_list polling in armhf environment #801

Closed
rajkumar38 opened this issue Feb 26, 2021 · 42 comments · Fixed by sonic-net/sonic-buildimage#7366

Comments

@rajkumar38
Copy link
Contributor

rajkumar38 commented Feb 26, 2021

Below is the gdb stack trace.
Looks like libboost1.71 pkg causing the crash. Issue is not seen after reverting PR, sonic-net/sonic-buildimage#6532
Looks like basic_json constructor is invoked with value_t::null instead of value_t::array.
gdb) bt
#0 0xb68cb746 in ?? () from /lib/arm-linux-gnueabihf/libc.so.6
#1 0xb68d90ae in raise () from /lib/arm-linux-gnueabihf/libc.so.6
#2 0xb68cb1f2 in abort () from /lib/arm-linux-gnueabihf/libc.so.6
#3 0xb6ac4888 in __gnu_cxx::__verbose_terminate_handler() ()
from /usr/lib/arm-linux-gnueabihf/libstdc++.so.6
#4 0xb6ac31ec in ?? () from /usr/lib/arm-linux-gnueabihf/libstdc++.so.6
#5 0xb6ac3242 in std::terminate() ()
from /usr/lib/arm-linux-gnueabihf/libstdc++.so.6
#6 0xb6ac347c in __cxa_throw ()
from /usr/lib/arm-linux-gnueabihf/libstdc++.so.6
#7 0xb6c20a70 in nlohmann::basic_json<std::map, std::vector, std::__cxx11::basic_string<char, std::char_traits, std::allocator >, bool, long long, unsigned long long, double, std::allocator>::push_back (this=0xbecee3a8,
val=...) at /usr/include/swss/json.hpp:4710
#8 0xb6c16458 in sai_serialize_acl_resource_list[abi:cxx11](_sai_acl_resource_list_t const&, bool) (aclresource=..., countOnly=countOnly@entry=false)
at saiserialize.cpp:1277
#9 0xb6c1688e in sai_serialize_attr_value[abi:cxx11](_sai_attr_metadata_t const&, _sai_attribute_t const&, bool) (meta=..., attr=...,
countOnly=countOnly@entry=false) at saiserialize.cpp:1613
#10 0xb6c0bf1e in saimeta::SaiAttributeList::serialize_attr_list[abi:cxx11](_sai_object_type_t, unsigned int, _sai_attribute_t const*, bool) (
objectType=objectType@entry=SAI_OBJECT_TYPE_SWITCH,
--Type for more, q to quit, c to continue without paging--
attr_count=attr_count@entry=1, attr_list=attr_list@entry=0xbecee808,
countOnly=countOnly@entry=false) at SaiAttributeList.cpp:120
#11 0xb6e72862 in sairedis::RedisRemoteSaiInterface::get (
this=this@entry=0x8a1bd8,
objectType=objectType@entry=SAI_OBJECT_TYPE_SWITCH,
serializedObjectId="oid:0x21", '0' <repeats 12 times>,
attr_count=attr_count@entry=1, attr_list=attr_list@entry=0xbecee808)
at RedisRemoteSaiInterface.cpp:804
#12 0xb6e72aac in sairedis::RedisRemoteSaiInterface::get (this=0x8a1bd8,
objectType=SAI_OBJECT_TYPE_SWITCH, objectId=, attr_count=1,
attr_list=0xbecee808) at RedisRemoteSaiInterface.cpp:566
#13 0xb6c30426 in saimeta::Meta::get (this=0x8a3f0c,
object_type=SAI_OBJECT_TYPE_SWITCH, object_id=,
attr_count=1, attr_list=0xbecee808) at Meta.cpp:1528
#14 0xb6e5ce34 in sairedis::Sai::get (this=0x889c84,
objectType=SAI_OBJECT_TYPE_SWITCH, objectId=, attr_count=1,
attr_list=0xbecee808) at Sai.cpp:240
#15 0xb6e5877c in redis_get_switch_attribute (object_id=,
attr_count=1, attr_list=0xbecee808) at sai_redis_switch.cpp:27
#16 0x00580016 in ?? ()
#17 0x0058032a in ?? ()
#18 0x004e0818 in ?? ()
#19 0x004c3f44 in ?? ()
--Type for more, q to quit, c to continue without paging--
#20 0xb68cb524 in __libc_start_main () from /lib/arm-linux-gnueabihf/libc.so.6
#21 0x004d9d38 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
(gdb) fr 7
#7 0xb6c20a70 in nlohmann::basic_json<std::map, std::vector, std::__cxx11::basic_string<char, std::char_traits, std::allocator >, bool, long long, unsigned long long, double, std::allocator>::push_back (this=0xbecee3a8,
val=...) at /usr/include/swss/json.hpp:4710
4710 /usr/include/swss/json.hpp: No such file or directory.
(gdb) p* this
$1 = {
m_type = nlohmann::basic_json<std::map, std::vector, std::_cxx11::basic_string<char, std::char_traits, std::allocator >, bool, long long, unsigned long long, double, std::allocator>::value_t::object, m_value = {
object = 0x9d2c70, array = 0x9d2c70, string = 0x9d2c70, boolean = 112,
number_integer = 10300528, number_unsigned = 10300528,
number_float = 5.0891370188258436e-317}}
(gdb) fr 8
#8 0xb6c16458 in sai_serialize_acl_resource_list[abi:cxx11](sai_acl_resource_list_t const&, bool) (aclresource=..., countOnly=countOnly@entry=false)
at saiserialize.cpp:1277
1277 saiserialize.cpp: No such file or directory.
(gdb) info local
item = {
m_type = nlohmann::basic_json<std::map, std::vector, std::__cxx11::basic_string<char, std::char_traits, std::allocator >, bool, long long, unsigned long long, double, std::allocator>::value_t::object, m_value = {
object = 0x9cd930, array = 0x9cd930, string = 0x9cd930, boolean = 48,
number_integer = 10279216, number_unsigned = 10279216,
number_float = 5.0786074917816749e-317}}
i = 0
logger__LINE
= {m_line = 1258,
m_fun = 0xb6c39414 <sai_serialize_acl_resource_list[abi:cxx11](_sai_acl_resource_list_t const&, bool)::FUNCTION> "sai_serialize_acl_resource_list"}
FUNCTION = "sai_serialize_acl_resource_list"
j = {
m_type = nlohmann::basic_json<std::map, std::vector, std::__cxx11::basic_string<char, std::char_traits, std::allocator >, bool, long long, unsigned long long, double, std::allocator>::value_t::object, m_value = {
object = 0x9cb870, array = 0x9cb870, string = 0x9cb870, boolean = 112,
number_integer = 10270832, number_unsigned = 10270832,
number_float = 5.0744652454069419e-317}}
arr = {
m_type = nlohmann::basic_json<std::map, std::vector, std::__cxx11::basic_string<char, std::char_traits, std::allocator >, bool, long long, unsigned long long, double, std::allocator>::value_t::object, m_value = {
--Type for more, q to quit, c to continue without paging--
object = 0x9d2c70, array = 0x9d2c70, string = 0x9d2c70, boolean = 112,
number_integer = 10300528, number_unsigned = 10300528,
number_float = 5.0891370188258436e-317}}

With break-points:

Thread 1 "orchagent" hit Breakpoint 2, nlohmann::basic_json<std::map, std::vector, std::__cxx11::basic_string<char, std::char_traits, std::allocator >, bool, long long, unsigned long long, double, std::allocator>::basic_json (
this=0xbef1a3a8, init=..., type_deduction=false,
manual_type=nlohmann::basic_json<std::map, std::vector, std::__cxx11::basic_string<char, std::char_traits, std::allocator >, bool, long long, unsigned long long, double, std::allocator>::value_t::null) at json.hpp:1520
1520 json.hpp: No such file or directory.
(gdb) bt
#0 nlohmann::basic_json<std::map, std::vector, std::__cxx11::basic_string<char, std::char_traits, std::allocator >, bool, long long, unsigned long long, double, std::allocator>::basic_json (this=0xbef1a3a8, init=...,
type_deduction=false,
manual_type=nlohmann::basic_json<std::map, std::vector, std::__cxx11::basic_string<char, std::char_traits, std::allocator >, bool, long long, unsigned long long, double, std::allocator>::value_t::null) at json.hpp:1520
#1 0xb6c2443e in nlohmann::basic_json<std::map, std::vector, std::__cxx11::basic_string<char, std::char_traits, std::allocator >, bool, long long, unsigned long long, double, std::allocator>::array (init=...)
at /usr/include/swss/json.hpp:1615
#2 sai_serialize_acl_resource_list[abi:cxx11](_sai_acl_resource_list_t const&, bool) (aclresource=..., countOnly=countOnly@entry=false)
at saiserialize.cpp:1271
#3 0xb6c2488e in sai_serialize_attr_value[abi:cxx11](_sai_attr_metadata_t const&, _sai_attribute_t const&, bool) (meta=..., attr=...,
countOnly=countOnly@entry=false) at saiserialize.cpp:1613
#4 0xb6c19f1e in saimeta::SaiAttributeList::serialize_attr_list[abi:cxx11](_sai_object_type_t, unsigned int, _sai_attribute_t const*, bool) (
objectType=objectType@entry=SAI_OBJECT_TYPE_SWITCH,
attr_count=attr_count@entry=1, attr_list=attr_list@entry=0xbef1a808,
countOnly=countOnly@entry=false) at SaiAttributeList.cpp:120
#5 0xb6e80862 in sairedis::RedisRemoteSaiInterface::get (
--Type for more, q to quit, c to continue without paging--
this=this@entry=0x1493c80,
objectType=objectType@entry=SAI_OBJECT_TYPE_SWITCH,
serializedObjectId="oid:0x21", '0' <repeats 12 times>,
attr_count=attr_count@entry=1, attr_list=attr_list@entry=0xbef1a808)
at RedisRemoteSaiInterface.cpp:804
#6 0xb6e80aac in sairedis::RedisRemoteSaiInterface::get (this=0x1493c80,
objectType=SAI_OBJECT_TYPE_SWITCH, objectId=, attr_count=1,
attr_list=0xbef1a808) at RedisRemoteSaiInterface.cpp:566
#7 0xb6c3e426 in saimeta::Meta::get (this=0x1496084,
object_type=SAI_OBJECT_TYPE_SWITCH, object_id=,
attr_count=1, attr_list=0xbef1a808) at Meta.cpp:1528
#8 0xb6e6ae34 in sairedis::Sai::get (this=0x147bc84,
objectType=SAI_OBJECT_TYPE_SWITCH, objectId=, attr_count=1,
attr_list=0xbef1a808) at Sai.cpp:240
#9 0xb6e6677c in redis_get_switch_attribute (object_id=,
attr_count=1, attr_list=0xbef1a808) at sai_redis_switch.cpp:27
#10 0x004f0016 in ?? ()
#11 0x004f032a in ?? ()
#12 0x00450818 in ?? ()
#13 0x00433f44 in ?? ()
#14 0xb68d9524 in __libc_start_main () from /lib/arm-linux-gnueabihf/libc.so.6
#15 0x00449d38 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
(gdb) p manual_type
$1 = nlohmann::basic_json<std::map, std::vector, std::__cxx11::basic_string<char, std::char_traits, std::allocator >, bool, long long, unsigned long long, double, std::allocator>::value_t::null

(gdb)

@lguohan
Copy link
Contributor

lguohan commented Mar 1, 2021

@taahme, any idea why the regression is caused the boost 1.71 upgrade?

rajkumar38 added a commit to rajkumar38/sonic-swss-common-1 that referenced this issue Mar 11, 2021
…or call.

Default value set to value_t::array in constructor.
Issue: sonic-net/sonic-sairedis#801

Signed-off-by: Rajkumar Pennadam Ramamoorthy <rpennadamram@marvell.com>
rajkumar38 added a commit to rajkumar38/sonic-swss-common-1 that referenced this issue Mar 11, 2021
…or call.

Default value set to value_t::array in constructor.
Issue: sonic-net/sonic-sairedis#801

Signed-off-by: Rajkumar Pennadam Ramamoorthy <rpennadamram@marvell.com>
@rajkumar38
Copy link
Contributor Author

rajkumar38 commented Mar 16, 2021

When setting a hardware watchpoint to the stack memory address, the value is not changing from 2 (value_t::array) to 0 (value_t::null), instead value 0 is directly written to memory.
Looks like kernel is changing this memory.
Note that watch is set much before the line (1271), that is causing the crash.

(gdb) x/256wa $sp
0xbe9b5368: 0xb6f4f320 0xb6f50968 <_stack_chk_guard> 0xbe9b53ac 0xb38
0xbe9b5378: 0xb6c059c0 <ZZ26sai_deserialize_attr_valueRKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEERK20_sai_attr_metadata_tR16_sai_attribute_tbE12__FUNCTION
> 0xbe9b5420 0x1 0xb6f4f320
0xbe9b5388: 0xb6c212bc _Z31sai_serialize_acl_resource_listB5cxx11RK24_sai_acl_resource_list_tb@got.plt 0xb6f40010 0xb6f404d0 0x0
0xbe9b5398: 0x1 0xbe9b54cc 0xbe9b54e4 0xb6f32fa3
0xbe9b53a8: 0x0 0x1 0x5 0x0
0xbe9b53b8: 0xbe9b5454 0xb6bc6aac 0x21 0x0
0xbe9b53c8: 0xb6c21000 0xbe9b540c 0xb6f50968 <__stack_chk_guard> 0x0
0xbe9b53d8: 0xb6c21000 0xbe9b540c 0xb6f50968 <__stack_chk_guard> 0xbe9b5808
0xbe9b53e8: 0xb6d4d670 <sai_metadata_attr_SAI_SWITCH_ATTR_AVAILABLE_ACL_TABLE> 0xbe9b54cc 0xbe9b54e4 0xb6be288f <sai_serialize_attr_value[abi:cxx11](_sai_attr_metadata_t const&, _sai_attribute_t const&, bool)+462>
0xbe9b53f8: 0xb6f50968 <__stack_chk_guard> 0xb6f50968 <__stack_chk_guard> 0xb6d4d670 <sai_metadata_attr_SAI_SWITCH_ATTR_AVAILABLE_ACL_TABLE> 0xb6a8f867 <Znwj+18>
0xbe9b5408: 0x24 0x5f4 0xb6c0545c <ZZ24sai_serialize_attr_valueB5cxx11RK20_sai_attr_metadata_tRK16_sai_attribute_tbE12__FUNCTION
> 0xb6d17fbc
0xbe9b5418: 0xb6f50968 <__stack_chk_guard> 0x67446900 0x21 0xbe9b54--Type for more, q to quit, c to continue without paging--q
Quit
(gdb) watch *0xbe9b5368
Hardware watchpoint 2: *0xbe9b5368
(gdb) n
1258 in saiserialize.cpp
(gdb)
1260 in saiserialize.cpp
(gdb)
1257 in saiserialize.cpp
(gdb)
1258 in saiserialize.cpp
(gdb)
1257 in saiserialize.cpp
(gdb)
1258 in saiserialize.cpp
(gdb)
1262 in saiserialize.cpp
(gdb)
1260 in saiserialize.cpp
(gdb)
1262 in saiserialize.cpp
(gdb)
1260 in saiserialize.cpp
(gdb)
1262 in saiserialize.cpp
(gdb)
1264 in saiserialize.cpp
(gdb)
1271 in saiserialize.cpp
(gdb)

Thread 1 "orchagent" hit Hardware watchpoint 2: *0xbe9b5368

Old value = -1225460960
New value = 0
0xb6be2436 in nlohmann::basic_json<std::map, std::vector, std::__cxx11::basic_string<char, std::char_traits, std::allocator >, bool, long long, unsigned long long, double, std::allocator>::array (init=...)
at /usr/include/swss/json.hpp:1615
1615 /usr/include/swss/json.hpp: No such file or directory.

Attached full log of gdb.
watchpoint_gdb.txt

@kcudnik
Copy link
Collaborator

kcudnik commented Mar 16, 2021

kernel changing memory ? rather user space

@rajkumar38
Copy link
Contributor Author

kernel changing memory ? rather user space

No other thread in orchagent process context is writing in this memory.
Looks like this memory is modified in kernel space.

@kcudnik
Copy link
Collaborator

kcudnik commented Mar 19, 2021

hmm, this code is executed inside sairedis lib, and all apis are protected by mutex, not sure how other thread could modify this, and why kernel ? that seems really messed up, unless there is some bad allocation in json.hpp
And it's wired, that this is only happening on arm, since it's happening on acl serialization, it should be easy to extract this code just to single line in some main.cpp just to confirm that this is just related to json and not some other OA stuff

@rajkumar38
Copy link
Contributor Author

hmm, this code is executed inside sairedis lib, and all apis are protected by mutex, not sure how other thread could modify this, and why kernel ? that seems really messed up, unless there is some bad allocation in json.hpp
And it's wired, that this is only happening on arm, since it's happening on acl serialization, it should be easy to extract this code just to single line in some main.cpp just to confirm that this is just related to json and not some other OA stuff

Issue is not related to json.hpp. I wrote simple main.cpp (with specific code causing problem), no crash observed.

@kcudnik
Copy link
Collaborator

kcudnik commented Mar 22, 2021

if it's not json.hpp, do you have suspect for a root cause? and if it's not json.hpp what's about then #802?? and then moved this to swss-common to fix json.hpp internally? this suggest this is something wrong with json.hpp, or compiler issue, or perhaps emulation of armrh

@lguohan
Copy link
Contributor

lguohan commented Mar 22, 2021

@rajkumar38, how frequently do you see this issue on arm?

@rajkumar38
Copy link
Contributor Author

@rajkumar38, how frequently do you see this issue on arm?

Issue is consistently reproducible.

@kcudnik
Copy link
Collaborator

kcudnik commented Mar 24, 2021

will you be able to narrow this down by removing more and more components from OA since you said that simple main.cpp worked fine (did you also compiled this with exact same flags that sairdis is compiled? and with the same input for serialize method?). Or is there a possibility to access your device ? i can then try do do it myself.

Also, is this the only place this happens? i mentioned before that array() json is used in many places in serialieze, just want to narrow this if this is strictly related to json or acl_resource_list serialiaze

@rajkumar38
Copy link
Contributor Author

will you be able to narrow this down by removing more and more components from OA since you said that simple main.cpp worked fine (did you also compiled this with exact same flags that sairdis is compiled? and with the same input for serialize method?). Or is there a possibility to access your device ? i can then try do do it myself.

Can you be more specific on statement "will you be able to narrow this down by removing more and more components from OA" ?
Attached sample.cpp
sample_cpp.txt

compiled with flags
g++ -g -O2 -fdebug-prefix-map=/sonic/src/sonic-sairedis=. -fstack-protector-strong -Wformat -Werror=format-security sample.cpp -o samp

I'm checking on the possibility of device access internally. will get back to you on this.

Also, is this the only place this happens? i mentioned before that array() json is used in many places in serialieze, just want to narrow this if this is strictly related to json or acl_resource_list serialiaze

We are consistently see the issue only in array() json in acl_resource_list_serialize.
I don't think issue is with json array() call in acl_resource_list_serialize. Root cause is stack corruption which causing json failure in this flow.

@kcudnik
Copy link
Collaborator

kcudnik commented Mar 25, 2021

thanks for sharing example, but you are missing a lot of flags when sairedis/meta is compiled for example saiserialize.cpp where acl serialize method exists is compiled like this:

libtool: compile:  g++ -DHAVE_CONFIG_H -I. -I.. -g -I../SAI/inc -I../SAI/meta -I../SAI/experimental -I../lib/inc -ansi -fPIC -std=c++11 -Wall -Wcast-align -Wcast-qual -Wconversion -Wdisabled-optimization -Werror -Wextra -Wfloat-equal -Wformat=2 -Wformat-nonliteral -Wformat-security -Wformat-y2k -Wimport -Winit-self -Wno-inline -Winvalid-pch -Wmissing-field-initializers -Wmissing-format-attribute -Wmissing-include-dirs -Wmissing-noreturn -Wno-aggregate-return -Wno-padded -Wno-switch-enum -Wno-unused-parameter -Wpacked -Wpointer-arith -Wredundant-decls -Wshadow -Wstack-protector -Wstrict-aliasing=3 -Wswitch -Wswitch-default -Wunreachable-code -Wunused -Wvariadic-macros -Wwrite-strings -Wno-switch-default -Wconversion -Wno-psabi -Wcast-align=strict -I/home/sonic/include -I/home/sonic/include -O2 -MT libsaimetadata_la-saiserialize.lo -MD -MP -MF .deps/libsaimetadata_la-saiserialize.Tpo -c saiserialize.cpp  -fPIC -DPIC -o .libs/libsaimetadata_la-saiserialize.o
libtool: compile:  g++ -DHAVE_CONFIG_H -I. -I.. -g -I../SAI/inc -I../SAI/meta -I../SAI/experimental -I../lib/inc -ansi -fPIC -std=c++11 -Wall -Wcast-align -Wcast-qual -Wconversion -Wdisabled-optimization -Werror -Wextra -Wfloat-equal -Wformat=2 -Wformat-nonliteral -Wformat-security -Wformat-y2k -Wimport -Winit-self -Wno-inline -Winvalid-pch -Wmissing-field-initializers -Wmissing-format-attribute -Wmissing-include-dirs -Wmissing-noreturn -Wno-aggregate-return -Wno-padded -Wno-switch-enum -Wno-unused-parameter -Wpacked -Wpointer-arith -Wredundant-decls -Wshadow -Wstack-protector -Wstrict-aliasing=3 -Wswitch -Wswitch-default -Wunreachable-code -Wunused -Wvariadic-macros -Wwrite-strings -Wno-switch-default -Wconversion -Wno-psabi -Wcast-align=strict -I/home/sonic/include -I/home/sonic/include -O2 -MT libsaimetadata_la-saiserialize.lo -MD -MP -MF .deps/libsaimetadata_la-saiserialize.Tpo -c saiserialize.cpp -o libsaimetadata_la-saiserialize.o >/dev/null 2>&1

and those options can have different effect on compilation (also this is cmd for x64 not for arm), warnings can be probably skipped but those options: -ansi -fPIC -std=c++11 potentially can generate some other binary code

for test program i was rather thinking about something like this:

#include "meta/sai_serialize.h"

int main()
{
    sai_acl_resource_list_t res = { }; // populate with some values
    sai_serialize_acl_resource_list(res, false); // or use sai_serialize_attr_value instead
}

and compile this like this g++ main.cpp -I sairedis/meta -Lsairedis/meta/.libs -lsaimeta -lsaimetadata
note that those will be shared libraries and not compiled into main itself

and saimeta and saimetadata should be compiled by buildimage, to rule out some other factor, and confirm whether issue still exists, and this will give some more insight, if it exists still, then problem is in json.hpp or compilaton of saimeta/saimetadata, and if error don't exists, then it will seem like it's strictly related to orchagent only and not actual json array issue.

you can clone sairedis repo, do ./autogen.sh; ./configure; cd meta; make
to compile just metadata with that serialize method, all compilation flags should be included then, or you can just take *so libraries directly from docker on device itself and link main program.

if you need some help, maybe it will be faster to setup some meeting and live solve problem, this will be also definitely faster

if this investigation will still show no error, then it would suggest that error is related strictly to OA, and by "narrow this down" i meant removing code from OA, compile it and test if it still hit the issue, function that takes those resources is located in
void CrmOrch::getResAvailableCounters(), so for example call to that function could be moved directly after create_switch() in OA main.cpp without starting any other tasks, to see if there is still some errors there

@rajkumar38
Copy link
Contributor Author

rajkumar38 commented Mar 28, 2021

thanks for sharing example, but you are missing a lot of flags when sairedis/meta is compiled for example saiserialize.cpp where acl serialize method exists is compiled like this:

libtool: compile:  g++ -DHAVE_CONFIG_H -I. -I.. -g -I../SAI/inc -I../SAI/meta -I../SAI/experimental -I../lib/inc -ansi -fPIC -std=c++11 -Wall -Wcast-align -Wcast-qual -Wconversion -Wdisabled-optimization -Werror -Wextra -Wfloat-equal -Wformat=2 -Wformat-nonliteral -Wformat-security -Wformat-y2k -Wimport -Winit-self -Wno-inline -Winvalid-pch -Wmissing-field-initializers -Wmissing-format-attribute -Wmissing-include-dirs -Wmissing-noreturn -Wno-aggregate-return -Wno-padded -Wno-switch-enum -Wno-unused-parameter -Wpacked -Wpointer-arith -Wredundant-decls -Wshadow -Wstack-protector -Wstrict-aliasing=3 -Wswitch -Wswitch-default -Wunreachable-code -Wunused -Wvariadic-macros -Wwrite-strings -Wno-switch-default -Wconversion -Wno-psabi -Wcast-align=strict -I/home/sonic/include -I/home/sonic/include -O2 -MT libsaimetadata_la-saiserialize.lo -MD -MP -MF .deps/libsaimetadata_la-saiserialize.Tpo -c saiserialize.cpp  -fPIC -DPIC -o .libs/libsaimetadata_la-saiserialize.o
libtool: compile:  g++ -DHAVE_CONFIG_H -I. -I.. -g -I../SAI/inc -I../SAI/meta -I../SAI/experimental -I../lib/inc -ansi -fPIC -std=c++11 -Wall -Wcast-align -Wcast-qual -Wconversion -Wdisabled-optimization -Werror -Wextra -Wfloat-equal -Wformat=2 -Wformat-nonliteral -Wformat-security -Wformat-y2k -Wimport -Winit-self -Wno-inline -Winvalid-pch -Wmissing-field-initializers -Wmissing-format-attribute -Wmissing-include-dirs -Wmissing-noreturn -Wno-aggregate-return -Wno-padded -Wno-switch-enum -Wno-unused-parameter -Wpacked -Wpointer-arith -Wredundant-decls -Wshadow -Wstack-protector -Wstrict-aliasing=3 -Wswitch -Wswitch-default -Wunreachable-code -Wunused -Wvariadic-macros -Wwrite-strings -Wno-switch-default -Wconversion -Wno-psabi -Wcast-align=strict -I/home/sonic/include -I/home/sonic/include -O2 -MT libsaimetadata_la-saiserialize.lo -MD -MP -MF .deps/libsaimetadata_la-saiserialize.Tpo -c saiserialize.cpp -o libsaimetadata_la-saiserialize.o >/dev/null 2>&1

and those options can have different effect on compilation (also this is cmd for x64 not for arm), warnings can be probably skipped but those options: -ansi -fPIC -std=c++11 potentially can generate some other binary code

for test program i was rather thinking about something like this:

#include "meta/sai_serialize.h"

int main()
{
    sai_acl_resource_list_t res = { }; // populate with some values
    sai_serialize_acl_resource_list(res, false); // or use sai_serialize_attr_value instead
}

and compile this like this g++ main.cpp -I sairedis/meta -Lsairedis/meta/.libs -lsaimeta -lsaimetadata
note that those will be shared libraries and not compiled into main itself

and saimeta and saimetadata should be compiled by buildimage, to rule out some other factor, and confirm whether issue still exists, and this will give some more insight, if it exists still, then problem is in json.hpp or compilaton of saimeta/saimetadata, and if error don't exists, then it will seem like it's strictly related to orchagent only and not actual json array issue.

you can clone sairedis repo, do ./autogen.sh; ./configure; cd meta; make
to compile just metadata with that serialize method, all compilation flags should be included then, or you can just take *so libraries directly from docker on device itself and link main program.

if you need some help, maybe it will be faster to setup some meeting and live solve problem, this will be also definitely faster

if this investigation will still show no error, then it would suggest that error is related strictly to OA, and by "narrow this down" i meant removing code from OA, compile it and test if it still hit the issue, function that takes those resources is located in
void CrmOrch::getResAvailableCounters(), so for example call to that function could be moved directly after create_switch() in OA main.cpp without starting any other tasks, to see if there is still some errors there

Thanks for the inputs. I followed the procedure and compiled the main.cpp linking shared-libs from the compiled debians.
I'm seeing the crash with main.cpp.
Let me know if you want to try anything more on this.

#include "sai_serialize.h"
#include "saitypes.h"
#include <vector>

#define CRM_ACL_RESOURCE_COUNT 256
extern std::string sai_serialize_acl_resource_list(
                        _In_ const sai_acl_resource_list_t& aclresource,
                                _In_ bool countOnly);
int main()
{
        std::vector<sai_acl_resource_t> resources(CRM_ACL_RESOURCE_COUNT);
        sai_acl_resource_list_t res;
        res.count = CRM_ACL_RESOURCE_COUNT;
        res.list = resources.data();
        sai_serialize_acl_resource_list(res, false);
}
g++ main.cpp -I. -I../ -I../SAI/inc -I../SAI/meta -I../SAI/experimental -I../lib/inc -I ../../sonic-swss-common/debian/libswsscommon-dev/usr/include -ansi -fPIC -std=c++11 -L./libs/usr/lib/arm-linux-gnueabihf -L./libs/lib/arm-linux-gnueabihf -lnl-3 -lnl-genl-3 -lnl-route-3 -lpthread -lsairedis -lswsscommon -lsaimeta -lsaimetadata -lzmq -lhiredis -lnl-nf-3 -o sampProg
root@localhost:/home/admin# docker exec -ti swss bash
root@localhost:/# cd /usr/bin/
root@localhost:/usr/bin# ./sampProg
terminate called after throwing an instance of 'std::domain_error'
  what():  cannot use push_back() with object
Aborted (core dumped)

@kcudnik
Copy link
Collaborator

kcudnik commented Mar 29, 2021

Ok, so now we have a different exception or maybe it's the same ? push_back seems to come from json.hpp, and place is probably similar as in case when OA calls the code. For future root causing this, we can take out function code from sai_serialize_acl_resource_list, and put that function directly into main.cpp, so no libs would be required to compile this. If error will still persist, then you can try to remove compile flags one by one to see if there is any effect on that.

Also you code is missing "-O2" argument, which can be crucial when compiling code inside main.cpp. I'm still betting this some compiler code optimization bug on armhf

@rajkumar38
Copy link
Contributor Author

Ok, so now we have a different exception or maybe it's the same ? push_back seems to come from json.hpp, and place is probably similar as in case when OA calls the code. For future root causing this, we can take out function code from sai_serialize_acl_resource_list, and put that function directly into main.cpp, so no libs would be required to compile this. If error will still persist, then you can try to remove compile flags one by one to see if there is any effect on that.

It is the same exception similar to OA call.
I moved sai_serialize_acl_resource_list API code to main.cpp(
main2_cpp.txt
) and compiled without libs as below. Issue is not seen.

g++ main2.cpp -g -O2 -I. -I../ -I../SAI/inc -I../SAI/meta -I../SAI/experimental -I../lib/inc -I ../../sonic-swss-common/debian/libswsscommon-dev/usr/include -ansi -fPIC -std=c++11  -o sampProgNoLibs

Also you code is missing "-O2" argument, which can be crucial when compiling code inside main.cpp. I'm still betting this some compiler code optimization bug on armhf

Issue is seen with or without O2 flag.

FYI, this was working before in armhf env. So I tried to isolate the commit that was causing this error.
Looks like with sonic-swss-common PR 452 (sonic-net/sonic-swss-common#452), we are seeing the issue. I recompiled sonic-swss-common deb after reverting PR452 and used in linking main.cpp. Issue is not seen. But the code in this PR doesn't look problematic.

@kcudnik
Copy link
Collaborator

kcudnik commented Mar 31, 2021

Code in sonic-net/sonic-swss-common#452 is not related to serialization in any way (It can possibly generate some extra code from json.hpp since everything there is template). So not sure why would error go away if you remove it. You are mentioning also that issue is seen with O2 or without, but for what scenario ? when you compile main with libs ? O2 does not affect already compiled libs, you would need to recompile libs without O2 to potentially see the difference.

ps. if error goes away if you remove code in 452, then this maybe pointing to some issues with compiler. What compiler do you use to compile armhf gcc abi?? what version?

ps. i still suggest to have a quick call if possible, it would definitely speed up finding out the issue.

@rajkumar38
Copy link
Contributor Author

Code in Azure/sonic-swss-common#452 is not related to serialization in any way (It can possibly generate some extra code from json.hpp since everything there is template). So not sure why would error go away if you remove it. You are mentioning also that issue is seen with O2 or without, but for what scenario ? when you compile main with libs ? O2 does not affect already compiled libs, you would need to recompile libs without O2 to potentially see the difference.

ps. if error goes away if you remove code in 452, then this maybe pointing to some issues with compiler. What compiler do you use to compile armhf gcc abi?? what version?

ps. i still suggest to have a quick call if possible, it would definitely speed up finding out the issue.

I didn't recompile libs without O2. Can you suggest on the list of submodules that O2 has to be disabled ?
Problem with armhf env is that it takes more time to finish the compilation.

We can have a call to discuss further. Pls share your email and time zone, will send zoom invite.

@kcudnik
Copy link
Collaborator

kcudnik commented Mar 31, 2021

by default O2 is enabled on all modules, that serialization methods only uses saimeta from sairedis and swsscommon which has it's own dependencies, but since only json.hpp is used, actually no swsscommon should be required at all. and probably to reproduce this, only saiserialize.cpp would be required and json.hpp

im in CET/GMT+1 time zone, you can send invite to my alias at microsoft.com "kcudnik", my available hours are 10am-10pm

@kcudnik
Copy link
Collaborator

kcudnik commented Mar 31, 2021

we spend some time on zoom call, and we find out that when sonic is build using docker the issue is present
but if we manually build swss-common and sairedis the issue is magically gone :( we still have no root cause this, PR swss-common 452 seems totally unrelated to main issue, but yet when PR is removed from sonic, everything magically works

@rajkumar38 one question here, will you be able to test this on different hardware device ? just in case

@rajkumar38
Copy link
Contributor Author

@rajkumar38 one question here, will you be able to test this on different hardware device ? just in case
We have tested 202012 branch (commit 1650777723555a74c7bf9e6a3632b039038f7581) on arm64 and x86_64. No issues.

@kcudnik
Copy link
Collaborator

kcudnik commented Apr 1, 2021

We have tested 202012 branch (commit 1650777723555a74c7bf9e6a3632b039038f7581) on arm64 and x86_64. No issues.

i had in mind test this on second armhf device if you have possible access to 2nd device

@rajkumar38
Copy link
Contributor Author

We have tested 202012 branch (commit 1650777723555a74c7bf9e6a3632b039038f7581) on arm64 and x86_64. No issues.

i had in mind test this on second armhf device if you have possible access to 2nd device

We have seen same crash on different armhf HW.

@kcudnik
Copy link
Collaborator

kcudnik commented Apr 1, 2021

ok, thanks for confirmation

@kcudnik
Copy link
Collaborator

kcudnik commented Apr 2, 2021

I was able to reproduce this issue locally on our lab device:

   92  docker cp swss:./usr/lib/arm-linux-gnueabihf/libswsscommon.so.0.0.0 .
   97  docker cp swss:./usr/lib/arm-linux-gnueabihf/libsaimeta.so.0.0.0 .
   98  docker cp swss:./usr/lib/arm-linux-gnueabihf/libsaimetadata.so.0.0.0 .
  109  docker cp swss:./usr/lib/arm-linux-gnueabihf/libsaimetadata.so.0 .
  110  docker cp swss:./usr/lib/arm-linux-gnueabihf/libsaimeta.so.0 .
  111  docker cp swss:./usr/lib/arm-linux-gnueabihf/libswsscommon.so.0 .
  118  history |grep cp
root@str2-7215-acs-1 /home/admin # make
g++ -Wall -O2 -ansi -fPIC -std=c++11 main.cpp -o main \
        -I sonic-sairedis/meta/ \
        -I sonic-sairedis/SAI/inc/ \
        libswsscommon.so.0.0.0 \
        libsaimeta.so.0.0.0 \
        libsaimetadata.so.0.0.0
root@str2-7215-acs-1 /home/admin # ./main
terminate called after throwing an instance of 'std::domain_error'
  what():  cannot use push_back() with object
Aborted (core dumped)
root@str2-7215-acs-1 /home/admin #

but im' not sure why we don't see any issues with that on our side, if we run orchagent, maybe those available resources are not queried properly, or tests are wrong, no clue. also difference i noticed, that there is a lot of less libraries linked to binary

/home/admin # ldd main
        linux-vdso.so.1 (0xbef89000)
        libswsscommon.so.0 => ./libswsscommon.so.0 (0xb6f1c000)
        libsaimeta.so.0 => ./libsaimeta.so.0 (0xb6dbb000)
        libsaimetadata.so.0 => ./libsaimetadata.so.0 (0xb6d5c000)
        libstdc++.so.6 => /lib/arm-linux-gnueabihf/libstdc++.so.6 (0xb6c51000)
        libm.so.6 => /lib/arm-linux-gnueabihf/libm.so.6 (0xb6bd6000)
        libgcc_s.so.1 => /lib/arm-linux-gnueabihf/libgcc_s.so.1 (0xb6bad000)
        libc.so.6 => /lib/arm-linux-gnueabihf/libc.so.6 (0xb6ab3000)
        libpthread.so.0 => /lib/arm-linux-gnueabihf/libpthread.so.0 (0xb6a8e000)
        libhiredis.so.0.14 => /lib/arm-linux-gnueabihf/libhiredis.so.0.14 (0xb6a74000)
        libnl-genl-3.so.200 => /lib/arm-linux-gnueabihf/libnl-genl-3.so.200 (0xb6a60000)
        libnl-nf-3.so.200 => /lib/arm-linux-gnueabihf/libnl-nf-3.so.200 (0xb6a41000)
        libnl-route-3.so.200 => /lib/arm-linux-gnueabihf/libnl-route-3.so.200 (0xb69de000)
        libnl-3.so.200 => /lib/arm-linux-gnueabihf/libnl-3.so.200 (0xb69b8000)
        /lib/ld-linux-armhf.so.3 (0xb6fa1000)

and that's it, and on your platform this list was like 2x size, but i guess that's not matter now since we have a repro,

 # gcc -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/lib/gcc/arm-linux-gnueabihf/8/lto-wrapper
Target: arm-linux-gnueabihf
Configured with: ../src/configure -v --with-pkgversion='Debian 8.3.0-6' ...Thread model: posix
gcc version 8.3.0 (Debian 8.3.0-6)

@kcudnik
Copy link
Collaborator

kcudnik commented Apr 2, 2021

Last night i spent couple hours after midnight to figure this out, and i still have no root cause, but issue is definitely related to armhf g++/linker and linux elf dynamic linker and loader

on our lab device:

$ show ver
Platform: armhf-nokia_ixs7215_52x-r0
HwSKU: Nokia-7215
ASIC: marvell

$ cat /proc/cpuinfo
model name      : ARMv7 Processor rev 1 (v7l)
model name      : ARMv7 Processor rev 1 (v7l)
Hardware        : Marvell Armada 380/385 (Device Tree)

libs looks like this: and are copied directly from swss docker as in previous post

$ ls -l
total 2.1M
lrwxrwxrwx 1 root  root    23 Apr  1 21:20 libsaimetadata.so.0 -> libsaimetadata.so.0.0.0
-rw-r--r-- 1 root  root  316K Apr  1 20:18 libsaimetadata.so.0.0.0
lrwxrwxrwx 1 root  root    19 Apr  1 21:20 libsaimeta.so.0 -> libsaimeta.so.0.0.0
-rw-r--r-- 1 root  root  1.4M Apr  1 20:18 libsaimeta.so.0.0.0
lrwxrwxrwx 1 root  root    22 Apr  1 21:20 libswsscommon.so.0 -> libswsscommon.so.0.0.0
-rw-r--r-- 1 root  root  397K Apr  1 20:18 libswsscommon.so.0.0.0

Makefile

all: main

main: main.cpp Makefile
        g++ main.cpp -o main -I /home/admin/sonic-sairedis/SAI/inc/ \
                libswsscommon.so.0 libsaimeta.so.0 libsaimetadata.so.0

.PHONY: clean

clean:
        rm -f main

here is main.cpp used for test:

#include "saitypes.h"
#include <string>

extern std::string sai_serialize_acl_resource_list(
        _In_ const sai_acl_resource_list_t& aclresource,
        _In_ bool countOnly);

int main()
{
    sai_acl_resource_t res;
    sai_acl_resource_list_t list = { .count = 1, .list = &res };
    sai_serialize_acl_resource_list(list, false);
}

compilation and execution

$ export LD_LIBRARY_PATH=.
$ make && ./main
g++ main.cpp -o main -I /home/admin/sonic-sairedis/SAI/inc/ \
        libswsscommon.so.0 libsaimeta.so.0 libsaimetadata.so.0
terminate called after throwing an instance of 'std::domain_error'
  what():  cannot use push_back() with object
Aborted (core dumped)

that was basic scenario and it's reproducible 100% of the time issue is observed on push_back() method in json.hpp file
it also happens on other serialization methods where json::array() is used like sai_serialize_qos_map_list(), fdb_notification, many places in sonic-sairedis/meta/saiserialize.cpp

playing with only libswsscommon.so.0.0.0

I build setup gcc, libnl (libnl-dev packages will be needed) and other libraries on device itself to test everything on device.

configuration on sonic-swsscommon: (on commit f01fede8500a499ce94f3573f0bb6c3c5991efe7)

$ ./autogen.sh
$ CXXFLAGS="-I/home/admin/include -O2" LDFLAGS="-L/home/adminc/lib" ./configure  --prefix=/home/admin

this will produce common/.libs/libswsscommon.so.0.0.0, if we copy that to our test dir and recompile main, the error will still be present, this commit contains line in dbconnector.cpp: (btw this code is not used in serialization methods)

void DBConnector::hmset(const std::unordered_map<std::string, std::vector<std::pair<std::string, std::string>>>& multiHash)
{
    SWSS_LOG_ENTER();

    // make sure this will be object (not null) when multi hash is empty
    json j = json::object();

if we change that that to:

void DBConnector::hmset(const std::unordered_map<std::string, std::vector<std::pair<std::string, std::string>>>& multiHash)
{
    SWSS_LOG_ENTER();

    // make sure this will be object (not null) when multi hash is empty
    json j ; // = json::object();

then when we recompile swss-common and copy again common/.libs/libswsscommon.so.0.0.0 to our test directory then error is magically gone!, this suggests that there could be some issues with linking/dynamic loading of libraries and some symbols issues

so now we have 2 swsscommon libraries, which one is causing error, and second not. libsaimetadata.so is untouched so what's the difference? i tried to do a "nm -D" and "radelf -a" to compare between good and bad lib

so if we do ** diff -U0 good bad** for command "nm -DC"

--- good        2021-04-02 11:37:11.154726016 +0000
+++ bad 2021-04-02 11:37:15.954742416 +0000
@@ -600,0 +601 @@
+W nlohmann::basic_json<std::map, std::vector, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool, long long, unsigned long long, double, std::allocator>::basic_json(std::initializer_list<nlohmann::basic_json<std::map, std::vector, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool, long long, unsigned long long, double, std::allocator> >, bool, nlohmann::basic_json<std::map, std::vector, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool, long long, unsigned long long, double, std::allocator>::value_t)
@@ -601,0 +603 @@
+W nlohmann::basic_json<std::map, std::vector, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool, long long, unsigned long long, double, std::allocator>::basic_json(std::initializer_list<nlohmann::basic_json<std::map, std::vector, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool, long long, unsigned long long, double, std::allocator> >, bool, nlohmann::basic_json<std::map, std::vector, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool, long long, unsigned long long, double, std::allocator>::value_t)

in bad, there are 2 extra symbols - which are actually the same. "readelf -a" contaisn like 6k differnces, and addresses are differnt, so there is no good way to quickly compare this. i will try different approach:

i stripped entire code from swss-common library that is not used by libsaimetadata.so, (only Logger and tokenizer are used) i also removed all references to json.hpp, and created own swss.cpp which contains minimal interface needed for libsaimetadata.so:

swss.cpp

#include <string>
#include <vector>

namespace swss {

    class Logger
    {
        public:
            enum Priority { SWSS_EMERG, };

            void write(Priority prio, const char *fmt, ...);
            void wthrow(Priority prio, const char *fmt, ...);

            static Logger &getInstance();

            class ScopeLogger
            {
                public:
                    ScopeLogger(int line, const char *fun);
                    ~ScopeLogger();
            };

            class ScopeTimer
            {
                public:
                    ScopeTimer(int line, const char* fun, const char *msg, ...);
                    ~ScopeTimer();
            };
    };

    Logger &Logger::getInstance() { static Logger log; return log; }
    Logger::ScopeLogger::ScopeLogger(int line, const char *fun){}
    Logger::ScopeLogger::~ScopeLogger(){}

    Logger::ScopeTimer::ScopeTimer(int line, const char* fun, const char *msg, ...){}
    Logger::ScopeTimer::~ScopeTimer(){}

    void Logger::write(Priority prio, const char *fmt, ...){}
    void Logger::wthrow(Priority prio, const char *fmt, ...){}

    std::vector<std::string> tokenize(const std::string &str, const char token) { throw; }
}

Makefile

all: main

main: main.cpp Makefile libswsscommon.so.0.0.0
        g++ main.cpp -o main -I /home/admin/sonic-sairedis/SAI/inc/ \
                libswsscommon.so.0 libsaimeta.so.0 libsaimetadata.so.0

libswsscommon.so.0.0.0: swss.cpp
        g++ -shared -fPIC swss.cpp -o libswsscommon.so.0.0.0

.PHONY: clean

clean:
        rm -f main libswsscommon.so.0.0.0

ok, at this point we have original libsaimeta.so libsaimetadata.so and minimalistic libswsscommon.so, which compiles, with main.cpp and everything works, serialization works in main at this point.

at this point we are introducing new library "foo.so" from foo.cpp

foo.cpp

#include "json.hpp"
using json = nlohmann::json;
void bar() { json j = json::object(); }
void foo() { json j = json::array(); }

and compile like this:
Makefile

all: main

main: main.cpp Makefile libswsscommon.so.0.0.0 foo.so
        g++ main.cpp -o main -I /home/admin/sonic-sairedis/SAI/inc/ \
                foo.so libswsscommon.so.0 libsaimeta.so.0 libsaimetadata.so.0

libswsscommon.so.0.0.0: swss.cpp
        g++ -shared -fPIC swss.cpp -o libswsscommon.so.0.0.0

foo.so: foo.cpp Makefile
        g++ -shared -O2 -ansi -fPIC -std=c++11 -Wno-psabi foo.cpp -o foo.so

.PHONY: clean

clean:
        rm -f main libswsscommon.so.0.0.0 foo.so

and error strikes again when we add totally unrelated library foo.so, which is just linked, error comes back !
at this point im really confused what's going on :/

and here is fun part

  • if we have function foo and bar present in foo.cpp, error exists
  • if we remove ether bar or ether foo function, error is gone !
  • if we remove "-O2" flag from foo.cpp compile, error exists in combinations foo | bar | foo+bar, magic ! (this is really warring me that removing O2 always causes issue to happen)
  • if we replace json::object() with json::parse("{}") error is gone

i think we could treat this json::parse("{}") as temporary/workaround solution to replace code in sonic-net/sonic-swss-common#452 but this is not fixing root cause, i think still problem is somewhere on compiler/linker/loader level and may or may not be mitigated with some specific compiler flags, but no guarantee for that.

playing with libsaimetadata.so

if i compile manually libsaimetadata.so or even if i use this:

DEB_BUILD_OPTIONS=nocheck SWSS_COMMON_INC="/home/admin/include" SWSS_COMMON_LIB="/home/admin/lib" fakeroot debian/rules binary-syncd-vs

and get library from there directly, and copy libsaimetadata.so.0.0.0 and keep original libsaimeta and libswsscommon, error goes away, and i don't know why :/ i also noticed that when during fake root build, some extra parameters are pushed to compiler like:

 -fdebug-prefix-map=/sonic/src/sonic-sairedis=. -fstack-protector-strong

especially that stack protector is causing to produce extra code, but even with this, error is gone when compiled locally

so far i could not make scenario that i locally compile libsaimetadata.so and i will still have error present, i will try build using sonic-buildimage, and see if this help to bring back error, and then try to narrow down code from build docker, i tried to get artifacts from azure pipeline like from here: https://dev.azure.com/mssonic/build/_build/results?buildId=7329&view=artifacts&pathAsName=false&type=publishedArtifacts and if i take those debs from there (just libsaimeta, libsaimetadata and libswsscommon) error still exists, and this shows that there are still some differences when building docker and building locally

final thoughts

i also noticed that when i compare libsaimetadata.so compiled locally (no error) vs compiled by azure pipeline (error exists) some symbols are named differently some have [c++11::abi] inside them, but most parts is the same, still i think there is some compatibility issue

but adding that extra foo.so library clearly shows that this is some linker/loader related and narrowing this down would require to have locally build libsaimetadata that also causes this error to show up, and then cutting json.hpp until we get smallest example what's going on, it also could be hard to achieve json.hpp is about 8k lines, and even if we would be able to make 10 lines class that could reproduce that, it still could be cross-compiler issue

unfortunately im not familiar with armhf assembly to figure out the differences between good and bad code and figure out wha't happening there :/

notice also that json.hpp as only file, produces "note" info from compiler (that's why added -Wno-psabi), there were errors in gcc5 that introduces some api interface errors, like functions could have different parameter size because of this, but this is fixed already since then,
all together it could be cross compiler issue that nobody hit so far, and i also think that even if in some scenarios we actually get this to work, by for example using parse() workaround instead of json::objects(), this is like walking on ice, this times works, because maybe data in memory layout was ok, but then some code could be added to swsscommon, totally not related, but could cause compiler and linker to do different stuff and rearrange some symbols and we could get to square 1 again

@kcudnik
Copy link
Collaborator

kcudnik commented Apr 3, 2021

Thanks @lguohan and @yxieca to jump in internal discussion. Hear some more clues:

Guohan asked:

When you compile, do you compile on native armhf environment? Or you are using cross-compiling?

Good observation, I was natively compiled on device, but docker is cross-compiling and maybe that’s the key. This is next on my list to build that in docker and do similar code stripping to find if that could be the case.
But also notice, that this error shows up on armhf at the first place where all of the libs are cross-compiled in docker, but if I compile them all locally, no issue. This would suggest that cross compiling gcc is causing some problems, and it could be possible bug in cross compiler itself.

Ying suggested that this could be related to some symbols incompatibility with the same symbol being located in 2 dynamic libraries and proposed to swap order or *.so libraryies

In our case, even if that would be the case, all symbols should be exactly the same, since entire sonic is build on the same docker with the same version of gcc cross compiler for armhf.

@ying Xie, thanks for analysis, I'm aware of that symbol strip, and there is even a special flag for linker that uses local symbols if symbol is multiple defined in several libraries “-Bsymbolic”, but I does not make effect if I compile foo.cpp with it.

As for changing order of the libraries, I just tried that, and if i put foo.so with (foo and bar functions) the error goes way (yey) even if I remove -O2 from foo compilation. And original issue goes away if we swap order of libswsscommon and libsaimetadata, so great you are right here 😊

But I think this I still not root causing the problem, and changing order not always could be possible due to other dependencies that can be required. I will ask marvell to try this order change on OA, but libswsscommon and libsaimetadata are also in many different places in siredislib at this order like saiplayer.

Attempt to workaround: sonic-net/sonic-swss#1689

As for json::parse("{}") it can still generate json::object inside parse function when expanded but then it could be marked by compiler as internal object, and removed/hidded when -O2 is present also it’s about 3x slower (just measured) than json::object().

What I also observed that even foo.cpp is compiled with foo and bar methods, and -O2, there are still some weak json symbols exported even thou json.hpp is self-contained library and everything in foo.cpp is used internally, if I mark foo and bar as static method, no other json symbols are exported and error goes away, but when skip O2 param, error goes back, since json symbols will be showed up (probably error caused by symbols).

My conclusion here is that there is some issue with symbols that are generated via cross compilation. Question here is, whether we want to dig more into that if lib swap will be good workaround for that.

@rajkumar38 is possible, please cherry pick sonic-net/sonic-swss#1689 and test on your side if this will be possible workaround fix, and if it will move forward a little bit :P

@kcudnik
Copy link
Collaborator

kcudnik commented Apr 3, 2021

One more workaround i found is to make all json.hpp namespace hidden

namespace nlohmann __attribute__ ((visibility ("hidden")))

this will also work with/without -O2 param

Potential fix: sonic-net/sonic-swss-common#471

@rajkumar38 you could also try to cherrypick this one too maybe independently even

Edit: visibility attribute causes compile issue on swss, since GearParserBase exposes json to public

@lguohan
Copy link
Contributor

lguohan commented Apr 3, 2021

@rajkumar38 , can you share the root cause analysis, we need to make sure we fully understand the root cause.

@kcudnik
Copy link
Collaborator

kcudnik commented Apr 3, 2021

There is possible another workaround to add gcc flag

-fvisibility-inlines-hidden

where json.hpp is used, so swss-common, swss and sairedis

@kcudnik
Copy link
Collaborator

kcudnik commented Apr 4, 2021

I tried cross compile but on some older gcc version

 $ arm-linux-gnueabihf-g++ -v
COLLECT_GCC=arm-linux-gnueabihf-g++
COLLECT_LTO_WRAPPER=/usr/lib/gcc-cross/arm-linux-gnueabihf/5/lto-wrapper
Target: arm-linux-gnueabihf
gcc version 5.4.0 20160609 (Ubuntu/Linaro 5.4.0-6ubuntu1~16.04.9)

with swss-common, sairedis and swss, with

 ./configure --host=arm-linux-gnueabihf

everything compiles fine, and when i copy compiled libs and binaries to target device, there is no error, i will need to build entire sonic image locally and keep build docker, and then try to narrow this down

@rajkumar38
Copy link
Contributor Author

@kcudnik
I cherry-picked changes from below PRs (in different iteration) and no crash seen,

sonic-net/sonic-swss#1689 (recompiled and installed swss_1.0.0_armhf.deb)
sonic-net/sonic-swss-common#471 (recompiled and installed libswsscommon_1.0.0_armhf.deb, libsaimetadata_1.0.0_armhf.deb, libsairedis_1.0.0_armhf.deb)

I also tried with sample main.cpp with above PRs and no exception seen.

@kcudnik
Copy link
Collaborator

kcudnik commented Apr 5, 2021

Great!, so each of the PR's worked separetly ?

@rajkumar38
Copy link
Contributor Author

Great!, so each of the PR's worked separetly ?

Yes. I tried separately. It works.

@kcudnik
Copy link
Collaborator

kcudnik commented Apr 5, 2021

Thanks for confirmation! That's good, 1689 is already merged, so we could proceed with that and update sonic-buildimage, just to make things going and possibly hit otter issues on armhf XD. We still probably need to root cause this exactly, i will start playing with docker tomorrow if i will post my progress. Please also note that this is not the only issue I'm working on :)

@kcudnik
Copy link
Collaborator

kcudnik commented Apr 7, 2021

Getting back to "foo.cpp" example with foo and bar functions, when we compile that example with -O2 flag, issue still exists, but there is only 11 exported symbols related to json, and it was easy to check which symbol is causing problem, and if we rename this one in compiled foo.so:

_ZN8nlohmann10basic_jsonISt3mapSt6vectorNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEbxydSaEC1ESt16initializer_listIS9_EbNS9_7value_tE

Nlohmann::basic_json<std::map, std::vector, std::__cxx11::basic_string<char, std::char_traits, std::allocator >, bool, long long, unsigned long long, double, std::allocator>::basic_json(std::initializer_list<Nlohmann::basic_json<std::map, std::vector, std::__cxx11::basic_string<char, std::char_traits, std::allocator >, bool, long long, unsigned long long, double, std::allocator> >, bool, Nlohmann::basic_json<std::map, std::vector, std::__cxx11::basic_string<char, std::char_traits, std::allocator >, bool, long long, unsigned long long, double, std::allocator>::value_t)

then issue goes away, same symbol also exists in libsaimetadata.so, but since it can't be found in foo.so it uses local version of it and "everything" works, i mean there is no crash with this particular example, in full code there still could be other symbols that cause similar issues

playing with single symbols like that would be bad idea, so it's better to go with -fvisibility-inlines-hidden flag

@kcudnik
Copy link
Collaborator

kcudnik commented Apr 8, 2021

ok, i finally found root cause for this issue, i build locally sonic image and debs, and i noticed that by default there are 4 docker images created, 2 for streach and 2 for buster:

$ docker images
REPOSITORY                          TAG                  IMAGE ID            CREATED             SIZE
sonic-slave-buster-armhf-kcudnik    cad7b0a3ed5          486bd506e1d7        11 hours ago        4.76GB
sonic-slave-buster-armhf            a2bd5155010          c7bd27943fee        11 hours ago        4.76GB
sonic-slave-stretch-armhf-kcudnik   927221c9c49          27aaf34c38d0        16 hours ago        4.61GB
sonic-slave-stretch-armhf           d0fe5358ae3          4bf49914b0b6        16 hours ago        4.61GB
multiarch/qemu-user-static          register             ed40092460e6        3 months ago        1.26MB
multiarch/debian-debootstrap        armhf-buster         c429e4647245        6 months ago        101MB
multiarch/debian-debootstrap        armhf-stretch        0a2710137eae        6 months ago        98MB
multiarch/qemu-user-static          x86_64-arm-5.0.0-2   13ea962d6a81        10 months ago       3.92MB

sonic-buildimage makefiles are using those to build deb packages, and you can select whether those are build on stretch or buster (ex using BLDENV or NOSTRECH=1)

I used those 2 images, to locally build libswsscommon and libsaimetadata, but in both cases i didn't see the issue, and i started to think that, this probably is not compiler bug, but rather something else. What i also noticed that gcc version is different for them:

stretch: gcc version 6.3.0 20170516 (Debian 6.3.0-18+deb9u1)
buster: gcc version 8.3.0 (Debian 8.3.0-6)

and if we combine this with gcc NOTE message when compiling (on stretch) swsscommon which includes json.hpp (or libsaimetadata saiserialize.cpp):

# gcc -v
gcc version 6.3.0 20170516 (Debian 6.3.0-18+deb9u1)
...
json.hpp: In member function 'void swss::DBConnector::del(const std::vector<std::__cxx11::basic_string<char> >&)':
json.hpp:1615:54: note: parameter passing for argument of type 'std::initializer_list<nlohmann::basic_json<> >' will change in GCC 7.1
         return basic_json(init, false, value_t::array);
                                                      ^
json.hpp: In member function 'void swss::DBConnector::hmset(const std::unordered_map<std::__cxx11::basic_string<char>, std::vector<std::pair<std::__cxx11::basic_string<char>, std::__cxx11::basic_string<char> > > >&)':
json.hpp:1655:55: note: parameter passing for argument of type 'std::initializer_list<nlohmann::basic_json<> >' will change in GCC 7.1
         return basic_json(init, false, value_t::object);
                                                       ^

and note here: will change in GCC 7.1 - notice WILL

and when compiling on buster:

#gcc -v
gcc version 8.3.0 (Debian 8.3.0-6)
...
json.hpp: In member function 'void swss::DBConnector::del(const std::vector<std::__cxx11::basic_string<char> >&)':
json.hpp:1615:54: note: parameter passing for argument of type 'std::initializer_list<nlohmann::basic_json<> >' changed in GCC 7.1
         return basic_json(init, false, value_t::array);
                                                      ^
json.hpp: In member function 'void swss::DBConnector::hmset(const std::unordered_map<std::__cxx11::basic_string<char>, std::vector<std::pair<std::__cxx11::basic_string<char>, std::__cxx11::basic_string<char> > > >&)':
json.hpp:1655:55: note: parameter passing for argument of type 'std::initializer_list<nlohmann::basic_json<> >' changed in GCC 7.1
         return basic_json(init, false, value_t::object);

and note here: changed in GCC 7.1 - notice CHANGED

And then it hit me, the problem is that somehow libswsscommon is compiled on stretch, and libsaimetdata on buster, even thou buster was selected! gcc was informing us all along, that there will be changes in passing parameter, and if that's true then gcc v6 is NOT compattible with v8 for this particular function, which happens to be json constructor.

Now for confirmation, i took artifacts from here: https://dev.azure.com/mssonic/build/_build/results?buildId=7329&view=artifacts&pathAsName=false&type=publishedArtifacts from sonic-buildimage.marvell-armhf/target/debs/buster

-rw-r--r-- 1 admin admin  360688 Apr  8 13:14 libsaimetadata_1.0.0_armhf.deb
-rw-r--r-- 1 admin admin 1949016 Apr  8 13:14 libsaimetadata-dbg_1.0.0_armhf.deb
-rw-r--r-- 1 admin admin  175452 Apr  8 13:14 libswsscommon_1.0.0_armhf.deb
-rw-r--r-- 1 admin admin 4699612 Apr  8 13:14 libswsscommon-dbg_1.0.0_armhf.deb

since release packages were stripped from all debug info, we need to to use debug symbols to find out compiler:

/home/admin/buster_artifacts #  find -name "*debug*" -type f
./libsaimetadata-dbg_1.0.0_armhf.deb.1/usr/lib/debug/.build-id/e1/5a0647d66ae641b06fc4fdc7b1436c3c6b03d1.debug
./libsaimetadata-dbg_1.0.0_armhf.deb.1/usr/lib/debug/.build-id/5b/e28c0fbff659ff82c6e8a7810a7799477dec59.debug
./libswsscommon-dbg_1.0.0_armhf.deb.1/usr/lib/debug/.build-id/49/c12580e57471e20a86d402ff449b1c9844f2df.debug
./libswsscommon-dbg_1.0.0_armhf.deb.1/usr/lib/debug/.build-id/22/9ab30bdc7faecf22a74292e7f5988a6571c660.debug

/home/admin/buster_artifacts # find -name "*debug*" -type f|xargs strings|grep -i GCC:
GCC: (Debian 6.3.0-18+deb9u1) 6.3.0 20170516
YGCC: (Debian 8.3.0-6) 8.3.0
GCC: (Debian 6.3.0-18+deb9u1) 6.3.0 20170516
GCC: (Debian 8.3.0-6) 8.3.0
`GCC: (Debian 8.3.0-6) 8.3.0

Bingo!, 2 different versions for gcc, so json constructor symbols are incompatible between libswsscommon and libsaimetadatta, and hence we are getting exception on serialization, which is actually lucky that we get exception instead of crash.

now for the good part: why libswsscommon is compiled with different version of gcc? i have no idea :D

I will need help from someone who designed makefiles for sonic-buildimage to find why stretch/gccv6 is used, could be azure pipeline artifact selection bad? - rather no, since locally build also have this issue

@kcudnik
Copy link
Collaborator

kcudnik commented Apr 8, 2021

some more insights from azure pipeline artifacts from libsaimetadata-dbg_1.0.0_armhf.deb:

// which corresponds to libsaimetadata.so
$ strings ./b/usr/lib/debug/.build-id/5b/e28c0fbff659ff82c6e8a7810a7799477dec59.debug|grep GCC: 
YGCC: (Debian 8.3.0-6) 8.3.0
GCC: (Debian 6.3.0-18+deb9u1) 6.3.0 20170516

// which corresponds to libsaimeta.so
$ strings ./b/usr/lib/debug/.build-id/e1/5a0647d66ae641b06fc4fdc7b1436c3c6b03d1.debug|grep GCC:
GCC: (Debian 6.3.0-18+deb9u1) 6.3.0 20170516

this is wired, since first one contains both gcc6 and gcc8 versions, but in that docker which builds libsaimetadata and libsaimeta there is no gccv6 installed

those are steps which i did locally on latest ubuntu 20.04.1-Ubuntu (+ installing dockers etc from apt) "SONIC_IMAGE_VERSION" : "master.0-0b16ca4a" clone commit: sonic-net/sonic-buildimage@0b16ca4

$ sudo apt install docker docker.io gcc make perl git python python3-pip python-pip-whl
$ git clone https://github.com/Azure/sonic-buildimage.git
$ cd sonic-buildimage
$ git submodule update --init --recursive
$ make configure PLATFORM=marvell-armhf PLATFORM_ARCH=armhf
$ make NOSTRETCH=1 target/debs/buster/libswsscommon_1.0.0_armhf.deb
$ make NOSTRETCH=1 target/debs/buster/libsaimetadata_1.0.0_armhf.deb

after this, everything is compiled with gcc version v8.3.0, and there is no issue present when testing with main.cpp

@rajkumar38
Copy link
Contributor Author

rajkumar38 commented Apr 19, 2021

Raised below PR which moves syncd compilation from stretch to buster.
sonic-net/sonic-buildimage#7366

With this libswsscommon,libsaimetadata and libsairedis debs will be compiled only for buster with gccv8x.

marvell@cpss-build4:~/master/sonic-buildimage/target/debs/buster$ mkdir tmp_dir
marvell@cpss-build4:~/master/sonic-buildimage/target/debs/buster$ dpkg-deb -x libswsscommon-dbg_1.0.0_armhf.deb tmp_dir/
marvell@cpss-build4:~/master/sonic-buildimage/target/debs/buster$ dpkg-deb -x libsaimetadata-dbg_1.0.0_armhf.deb tmp_dir/
marvell@cpss-build4:~/master/sonic-buildimage/target/debs/buster$ dpkg-deb -x libsairedis-dbg_1.0.0_armhf.deb tmp_dir/
marvell@cpss-build4:~/master/sonic-buildimage/target/debs/buster$ cd tmp_dir/
marvell@cpss-build4:~/master/sonic-buildimage/target/debs/buster/tmp_dir$ find -name "*debug*" -type f|xargs strings|grep -i GCC:
GCC: (Debian 8.3.0-6) 8.3.0
^GCC: (Debian 8.3.0-6) 8.3.0
`GCC: (Debian 8.3.0-6) 8.3.0
GCC: (Debian 8.3.0-6) 8.3.0
GCC: (Debian 8.3.0-6) 8.3.0

This PR fixes the orchagent json issue.

@kcudnik
Copy link
Collaborator

kcudnik commented Apr 19, 2021

what if someone would like to build it still on stretch ?

@radha-danda
Copy link

what if someone would like to build it still on stretch ?

@kamil, as I understand community is moving towards buster. Is there any reason why do we need to support stretch?
Please clarify

@kcudnik
Copy link
Collaborator

kcudnik commented Apr 20, 2021

not all vendors moved to buster, and they still use stretch, we will be encouraging all vendors to move to buster

@radha-danda
Copy link

not all vendors moved to buster, and they still use stretch, we will be encouraging all vendors to move to buster

We had already moved to Buster on armhf platform, so no need to support stretch anymore.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants