Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

201911 #205

Open
wants to merge 806 commits into
base: 201911
Choose a base branch
from
Open

201911 #205

wants to merge 806 commits into from

Conversation

bbinxie
Copy link
Collaborator

@bbinxie bbinxie commented Jul 30, 2020

- What I did

- How I did it

- How to verify it

- Description for the changelog

- A picture of a cute animal (not mandatory but encouraged)

abdosi and others added 29 commits April 8, 2021 18:02
see below error:

+ sudo https_proxy= LANG=C chroot ./fsroot easy_install pip==20.3.3
Searching for pip==20.3.3
Reading https://pypi.python.org/simple/pip/
Couldn't find index page for 'pip' (maybe misspelled?)
Scanning index of all packages (this may take a while)
Reading https://pypi.python.org/simple/
No local packages or working download links found for pip==20.3.3
error: Could not find suitable distribution for Requirement.parse('pip==20.3.3')

How I fix:

Install python-pip via apt-get
Pin the version to 20.3.3
Master has same changes.

Signed-off-by: Abhishek Dosi <abdosi@microsoft.com>
209b7ddec109587ddeb90071ca23ae6a288b1442 (HEAD -> 201911, origin/201911) Fixed the possibility of using uninitialized variable in route_check.py (#1551)
e30387cbebaaccbf9385059b1e501955c40be338 route_check: Fix hanging & logging level (#1520)
3c8de6950615a4608a80e3d47ea678f8e8487186 Add self timeout and crash if exceeded. (#1502)

Signed-off-by: Abhishek Dosi <abdosi@microsoft.com>
  Fix show interface status Ethernet* (#1559)

Signed-off-by: Abhishek Dosi <abdosi@microsoft.com>
Fix Bad Merge

Signed-off-by: Abhishek Dosi <abdosi@microsoft.com>
…tainers. (#7340)

Signed-off-by: Yong Zhao yozhao@microsoft.com

Why I did it
This PR aims to monitor critical processes in router advertiser and dhcp_relay containers by Monit.

How I did it
Router advertiser container only ran on T0 device and the T0 device should have at least one VLAN interface
which was configured an IPv6 address. At the same time, router advertiser container will not run on devices of which
the deployment type is 8.

As such, I created a service which will dynamically generate Monit configuration file of router advertiser from a
template.

Similarly Monit configuration file of dhcp_relay was also generated from a template since the number of dhcrelay process in dhcp_relay container is depended on number of VLANs.

How to verify it
I verified this implementation on a DuT.
…o poll mode (#7334)

#### Why I did it

- xcvrd crash was seen in latest 201811 images.
- For Dell S6100,API 2.0 uses poll mode while 1.0 was still using interrupt mode.

#### How I did it

- Modified get_transceiver_change_event in 1.0 to poll mode in all the related branches.

Backport of #7309 to the 201911 branch
…dhcp_relay (#7378)

#### Why I did it
Since we will have multiple `dhcrelay` processes if there exists different VLANs in the table `VLAN_INTERFACE` of `CONIFG_DB`, 
we should use unique service name for each `dhcrelay` process in Monit configuration file. Otherwise, Monit service will fail to work.

#### How I did it
I append the VLAN name to the end of each service name such that they are unique.

Signed-off-by: Yong Zhao <yozhao@microsoft.com>
a364614 2021-04-22 | [201911][acl] Use a list instead of a comma-separated string for ACL port list (#1576) [Danny Allen]
391e524 2021-04-15 | [201911] Fix Multi-ASIC show specific resursive route (#1563) [gechiang]
[techsupport] Update show ip interface command (#1562)

Signed-off-by: Abhishek Dosi <abdosi@microsoft.com>
Issue is get_pip.py is moved to pip 21.1 (https://github.com/pypa/get-pip/commits/main) which is not compatible with 3.6.
Issue of pip itself is fixed as part of 21.1.1 in pip community (pypa/pip#9835).
However get-pip.py is still not updated to latest pip. Also get.pip.py does not support python 3.6 version explicitly (pypa/get-pip#88)

Step 15/29 : RUN curl https://bootstrap.pypa.io/get-pip.py | python3.6
 ---> Running in bece31f49267
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100 1891k  100 1891k    0     0  9564k      0 --:--:-- --:--:-- --:--:-- 9600k
Traceback (most recent call last):
  File "<stdin>", line 24298, in <module>
  File "<stdin>", line 139, in main
  File "<stdin>", line 115, in bootstrap
  File "<stdin>", line 96, in monkeypatch_for_cert
  File "/tmp/tmp5fnxrz0a/pip.zip/pip/_internal/commands/__init__.py", line 9, in <module>
  File "/tmp/tmp5fnxrz0a/pip.zip/pip/_internal/cli/base_command.py", line 12, in <module>
  File "/tmp/tmp5fnxrz0a/pip.zip/pip/_internal/cli/cmdoptions.py", line 30, in <module>
  File "/tmp/tmp5fnxrz0a/pip.zip/pip/_internal/utils/hashes.py", line 2, in <module>
ImportError: cannot import name 'NoReturn'
The command '/bin/sh -c curl https://bootstrap.pypa.io/get-pip.py | python3.6' returned a non-zero code: 1
How I did:

Got the file from https://github.com/pypa/get-pip/tree/21.0 and added to the buildimage
pin pip to the previous release 21.0.1. (Similar is done in other public repos eg: grpc/grpc-java#8115)

Signed-off-by: Abhishek Dosi <abdosi@microsoft.com>
New features and fixes in the new SDK/FW:

SN4600C | AN/LT support
SN2700 | AN/LT bugs fixes
WJH | FID_MISS support

Signed-off-by: Kebo Liu <kebol@nvidia.com>
Signed-off-by: Yong Zhao yozhao@microsoft.com

Why I did it
This PR aims to monitor the critical processes in PMon container by Monit in 201911 branch.

How I did it
I created a template configuration file of Monit and it will be rendered to generate Monit configuration file of PMon container
by a service generate_monit_config.service.

How to verify it
I verified this on a Mellanox device str-msn2700-03 and an Arista device str-a7050-acs-1.

Which release branch to backport (provide reason below if selected)
 201811
[x ] 201911
 202006
 202012
- Fix ACL ANY debug counter to correctly track ACL drops
- Add VXLAN source port hard coded range, controlled by K/V

Signed-off-by: Dror Prital <drorp@nvidia.com>
…ly (#7501)

Signed-off-by: Abhishek Dosi <abdosi@microsoft.com>
20e1589 [Mellanox] [201911] backport kernel patches for hw-management 7.0100.2303 (#210)
* Set monitoring VLAN hostif up dy default (for VNET ping tool)

Signed-off-by: Volodymyr Samotiy <volodymyrs@nvidia.com>
- Update hw-mgmt pointer
- Remove unused patches
- Fix existing patch to make sure it apply successfully
…ollect process (#7308)

Recently, we found on some of our testbeds the entropy collecting process finishes more than 60 seconds after system started.
This results in swss not able to start sporadically.
To install haveged can accelerate the entropy collect process.

Signed-off-by: Stephen Sun <stephens@nvidia.com>
Enable VXLAN src port range configuration via SAI profile
 [201911]: add show bgp neigh/network support for multi asic (#1587)

Signed-off-by: Abhishek Dosi <abdosi@microsoft.com>
Why I did it
Added soft-reboot plugin support.
Added SSD version s16425cq check
Added error message to display in console/SSH in case reboot is called in faulty/non-upgraded devices.
1f249282e8066a5837f2b34478eb4e0f6b4a654c (HEAD -> 201911, origin/201911) [201911] soft-reboot - support ssd_fw_update  (#1518)
30a3cb3c085a7f208a44b58060ba797e4299214a [route_check] Filter out VNET routes (#1582)

Signed-off-by: Abhishek Dosi <abdosi@microsoft.com>
dd01491e4d167993b3a80517f737188151443a75 (HEAD -> 201911, origin/201911) [Monitor Vlan] Fix a typo in hostif (#1722)

Signed-off-by: Abhishek Dosi <abdosi@microsoft.com>
Add downstreamsubrole parsing to minigraph.py So that downstreamsubrole values can be used for policies. Backport PR, same as #7193
e438b0db6a8912b50f7acddf93d4dc2157f53ecf (HEAD -> 201911, origin/201911) Increase Syncd operation timeout from 1 min to 6 min. (#828)
17974adb369111b44dd56837547806918ed4b1ed Update syncd_flex_counter.cpp (#798)

Signed-off-by: Abhishek Dosi <abdosi@microsoft.com>
d898b03e4ec91f964f0e1fcba535ea33a78c838e (HEAD -> 201911, origin/201911) Create mappings using existing tunnel (#1593)

Signed-off-by: Abhishek Dosi <abdosi@microsoft.com>
…7536)

#### Why I did it

MSN4700 A1/A0 used different sensor chip but keep the existing platform name *x86_64-mlnx_msn4700-r0*, this is a workaround to replace the sensor conf on MSN4700 A1/A0

#### How I did it

Use a shell script to get the sensor conf path and copy that files to /etc/sensors.d/sensors.conf
Signed-off-by: Yong Zhao yozhao@microsoft.com

Why I did it
The service file generate_monit_config.service is used to generate the Monit configuration file from template. I also should install this service file and enable it.

How I did it
I appended this service file name at the end of /etc/sonic/generated_services.conf.

How to verify it
I verified this on the device str2-7260cx3-acs-1.

Which release branch to backport (provide reason below if selected)
 201811
[x ] 201911
 202006
 202012
liuh-80 and others added 30 commits October 31, 2022 10:25
Upgrade systemd to fix timer elapsed issue.

#### Why I did it
On 201911 release, snmp.timer become elapsed status and snmp.service will not be trigger by snmp.timer:

● snmp.service - SNMP container
Loaded: loaded (/usr/lib/systemd/system/snmp.service; static; vendor preset: enabled)
Active: inactive (dead)
● snmp.timer - Delays snmp container until SONiC has started
Loaded: loaded (/usr/lib/systemd/system/snmp.timer; enabled; vendor preset: enabled)
Active: active (elapsed) since Wed 2022-08-03 18:12:59 UTC; 2 months 17 days ago

This issue caused by systemd bug: https://github.com/systemd/systemd/pull/10778/files

This issue can be reproduce with following steps:
1. reboot system.
2. continusly run following commands till  timer elapsed:
systemctl status snmp.timer
sudo systemctl daemon-reload

#### How I did it
Install latest version systemd from offical backport source.

#### How to verify it
Pass all test case.
Manually check reproduce steps, verify the issue fixed.

#### Which release branch to backport (provide reason below if selected)

<!--
- Note we only backport fixes to a release branch, *not* features!
- Please also provide a reason for the backporting below.
- e.g.
- [x] 202006
-->

- [ ] 201811
- [ ] 201911
- [ ] 202006
- [ ] 202012
- [ ] 202106
- [ ] 202111
- [ ] 202205

#### Description for the changelog
Upgrade systemd to fix timer elapsed issue.

#### Ensure to add label/tag for the feature raised. example - PR#2174 under sonic-utilities repo. where, Generic Config and Update feature has been labelled as GCU.

#### Link to config_db schema for YANG module changes
<!--
Provide a link to config_db schema for the table for which YANG model
is defined
Link should point to correct section on https://github.com/Azure/sonic-buildimage/blob/master/src/sonic-yang-models/doc/Configuration.md
-->

#### A picture of a cute animal (not mandatory but encouraged)
…#12158)

* Add Celestica Silverstone-X platform deb dependency files
* Optimized Celestica Silverstone-X platform deb dependency files indentation
Modified the skip check to be greater than or equal to compared to equal to previously
Signed-off-by: Prince George <prgeor@microsoft.com>
* Fix to improve hostname handling
If config_db.json is missing hostname entry, hostname-config.sh ends
up deleting existing entry too and hostname changes to default 'localhost'

* default hostname to 'sonic` if missing in config file
#### Why I did it

The GPG key used for Jessie's official repos has since expired, which means building 201911 images no longer works.

#### How I did it

Fake the time to be before the expiry date.
* Create Vxlan and Vnet default configs
- Why I did it
Added BIOS upgrade infra

- How I did it
Added new make target

- How to verify it
Copy msn3800_bios.tar.gz to platform/mellanox/bios
make configure PLATFORM=mellanox
make target/files/stretch/msn3800_bios.tar.gz

Signed-off-by: Nazarii Hnydyn <nazariig@nvidia.com>
#### Why I did it

To allow SSH connections from IPv6 addresses

Resolves #7668

#### How I did it

In build_debian.sh, modify sshd_config file so as to enable listening for IPv6 connections
*Fix a typo introduced as part of #13403
Why I did it
docker.com's gpg key start to work from 2023-02-23. While debian.org's gpg key expired in 2022-11.
We used a walkaround for security checking for debian gpg keys. Now we need to exclude docker.com's gpg key.

How I did it
Update docker.com's gpg key without faketime.
Update others' gpg key with faketime '2022-11'

How to verify it
Change to use the snapshot mirror http://packages.trafficmanager.net/snapshot.

Warning: The Jessie distribution is EOL, please avoid to use it if you can. And the snapshot mirror will be removed in near future as well.
Why I did it
Some products might experience an occasional IO failure in the communication between CPU and SSD.
Based on some research it could be attributable to some device not handling ATA NCQ (Native Command Queue).

This issue currently affect 4 products:

DCS-7170-32C*
DCS-7170-64C
DCS-7060DX4-32
DCS-7260CX3-64
DCS-7050CX3-32S

How I did it
This change disable NCQ on the affected drive for a small set of products.

How to verify it
When the fix is applied, these 2 patterns can be found in the dmesg.
ata[0-9]+.00: FORCE: horkage modified (noncq)
NCQ (not used)

Test results using: fio --direct=1 --rw=randrw --bs=64k --ioengine=libaio --iodepth=64 --runtime=120 --numjobs=4

with NCQ (ata1.00: 61865984 sectors, multi 1: LBA48 NCQ (depth 32), AA)

   READ: bw=33.9MiB/s (35.6MB/s), 33.9MiB/s-33.9MiB/s (35.6MB/s-35.6MB/s), io=4073MiB (4270MB), run=120078-120078msec
  WRITE: bw=34.1MiB/s (35.8MB/s), 34.1MiB/s-34.1MiB/s (35.8MB/s-35.8MB/s), io=4100MiB (4300MB), run=120078-120078msec
without NCQ (ata1.00: 61865984 sectors, multi 1: LBA48 NCQ (not used))

   READ: bw=31.7MiB/s (33.3MB/s), 31.7MiB/s-31.7MiB/s (33.3MB/s-33.3MB/s), io=3808MiB (3993MB), run=120083-120083msec
  WRITE: bw=31.9MiB/s (33.4MB/s), 31.9MiB/s-31.9MiB/s (33.4MB/s-33.4MB/s), io=3830MiB (4016MB), run=120083-120083msec
Which release branch to backport (provide reason below if selected)
Improve sudo cat command for RO user.
Manually cherry-pick for #14428
… of squashfs (#14270)

202211 and above uses different squashfs compression type that 201911 kernel can not handle. Therefore, we avoid mounting squashfs altogather with this change.
Upgrade BRCM SAI to Debian package SAI 3.7.6.1-3.
[Build] Fix the stretch/jessie mirror removed issue.
ISSU version check fails due to inability to mount squashfs from 202211 on 201911
This PR makes two changes:
    - Store Jinja2 cache in LOGLEVEL DB instead of STATE DB
    - Store bytecode cache encoded in base64

Tested with the following command: "redis-dump -d 3 -k JINJA2_CACHE"

Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
Why I did it
Fix: #16086
faketime package url expired. It breaks 201911 build.
Update package url.

Work item tracking
Microsoft ADO (number only): 24930879
ef2a0cd0 [201911] [multi_asic] Script to monitor errors on internal links (#2971)
1252e31b Changes to separate UT data for internal link monitor (#2976)
3e6654e [[201911] [multi-asic] Unit test fix for internal link monitoring (#2977)
… script (#16393)

Monit changes to enable script to monitor SAI_PORT_STAT_IF_IN_ERRORS & SAI_PORT_STAT_IF_OUT_ERRORS on internal (backend) ports of multi-asic device.
Why I did it
Back port #6478 and #6519 to 201911 branch.

Work item tracking
Microsoft ADO (number only):
24978836
How I did it
Add checking the connection between zebra and bgp during bgpd start.

How to verify it
Modify start.h, add debug log and check the syslog

  _Sep 22 02:41:29.716356 str-a7060cx-acs-10 INFO bgp#root: ####: start zebra
  Sep 22 02:41:30.815341 str-a7060cx-acs-10 INFO bgp#root: ####: start check connection
  Sep 22 02:41:30.868784 str-a7060cx-acs-10 INFO bgp#root: ####: It took 0.029979 seconds to wait for zebra to be ready to accept connections
  Sep 22 02:41:30.873685 str-a7060cx-acs-10 INFO bgp#root: ####: start bgpd
  Sep 22 02:41:35.270569 str-a7060cx-acs-10 INFO bgp#root: ####: done_

  _Sep 22 03:28:02.423438 str-a7060cx-acs-10 INFO bgp#root: ####: start zebra
  Sep 22 03:28:03.731320 str-a7060cx-acs-10 INFO bgp#root: ####: start check connection
  Sep 22 03:28:33.749152 str-a7060cx-acs-10 INFO bgp#root: ####: Error: zebra is not ready to accept connections
  Sep 22 03:28:33.752490 str-a7060cx-acs-10 INFO bgp#root: ####: start bgpd
  Sep 22 03:28:34.259735 str-a7060cx-acs-10 INFO bgp#root: ####: start bgpd done
  Sep 22 03:28:34.755538 str-a7060cx-acs-10 INFO bgp#root: ####: start bgpcfgd
  Sep 22 03:28:35.800906 str-a7060cx-acs-10 INFO bgp#root: ####: done_
…16907)

Fix monit false alarm issue, which located in process_checker and it missed "disk-sleep" status check, thus some 201911 SONiC box report "pmon|sensord" error coincidently.

#### Why I did it
Currently psutil library returns below detail process status:
running: The process is currently running.
sleeping: The process is sleeping or waiting for an event to occur.
disk-sleep: The process is waiting for I/O operations to complete.
stopped: The process has been stopped (e.g. via the SIGSTOP signal).
zombie: The process has terminated but is still listed in the process table.
dead: The process has terminated and has been removed from the process table.

We should regard running/sleeping/disk-sleep as normal case and not alert in monit process.

Now once the disk-sleep occurs during monit cycle, below syslog will be paged, so get rid of syslog output meanwhile.

yslog.2.gz:Feb 24 06:12:17.394619 MEL23-0101-0301-04T1 ERR monit[6040]: 'pmon|sensord' status failed (1) -- '/usr/sbin/sensord -f daemon' is not running in host
syslog.2.gz:Feb 24 06:13:17.932531 MEL23-0101-0301-04T1 ERR monit[6040]: 'pmon|sensord' status failed (1) -- '/usr/sbin/sensord -f daemon' is not running in host
syslog.2.gz:Feb 24 06:14:18.502505 MEL23-0101-0301-04T1 ERR monit[6040]: 'pmon|sensord' status failed (1) -- '/usr/sbin/sensord -f daemon' is not running in host

Then I tried to reproduce the issue by triggering process_checker for sensord frequently and observed it's under "disk-sleep" status once the alert is raised.

##### Work item tracking
- Microsoft ADO **(number only)**:17663589

#### How I did it
Fix process_checker script code for adding "disk-sleep" case handling.

#### How to verify it
Verified in local DUT.
8b9cab7 2023-10-26 [201911] Fix IfHighSpeed UT issue on 201911 (#299) 
622b771 2023-10-13 | Fix backup port rfc2863 UT to 202012 branch issue (#298) [Hua Liu]
fa94798 2023-10-11 | Add ifhighspeed UT (#296) [Hua Liu]
41789ca 2023-09-14 | Support interface speed for PortChannels (#262) [Lukas Stockner]
Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
* [build] Use public storage for public resources. (#18038)

* fix

* fix
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.