Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update 201811 from azure #39

Open
wants to merge 219 commits into
base: 201811
Choose a base branch
from
Open

update 201811 from azure #39

wants to merge 219 commits into from

Conversation

bbinxie
Copy link
Collaborator

@bbinxie bbinxie commented Nov 5, 2019

Description of PR

Summary:
Fixes # (issue)

Type of change

  • [] Bug fix
  • [] Testbed and Framework(new/improvement)
  • [] Test case(new/improvement)

Approach

How did you do it?

How did you verify/test it?

Any platform specific information?

Supported testbed topology if it's a new test case?

Documentation

wangxin and others added 30 commits March 28, 2019 17:31
The existing script does not restore lag rate setting on VMs in case
of failure. This improvement is to restore lag setting on VMs if the
testing failed.

Signed-off-by: Xin Wang <xinw@mellanox.com>
…842)

After applying acltb_test_rules_part_1.json BGP sessions may go down
before we apply acltb_test_rules_part_2.json (which had BGP ACL forward
rules); This results in BGP flap during ptf test run;
It is safer to apply BGP ACL forward rules first to avoid BGP flapping.

Signed-off-by: Stepan Blyschak <stepanb@mellanox.com>
…cted (#844)

PR #831 does not fully fix the issue introduced by PR #822. Ansible's
include_vars module could not override variable value previous defined
by set_fact. Variables in vars/run_config_test_vars.yml may still have
old value.

The change is to avoid using include_vars. The variables defined in
run_config_test_vars.yml are moved into script
run_command_with_log_analyzer.yml. The vars files are deleted.

The same change is made to other scripts using the same pattern.

Signed-off-by: Xin Wang <xinw@mellanox.com>
If use apt_key module for getting docker official GPG key, there would
be cert validation issue. Replace the apt_key module with 'curl' command
recommended on docker official documentation site.
… PTF container (#836)

The PTF container will be destroyed if testbed-cli.sh remove-topo is
executed. Run testbed-cli.sh add-topo will add a new PTF conainer.
Usually the new PTF container will have a new MAC address. If add-topo
is executed immediately after remove-topo, ARP table of neighbor
switches and hosts may still have entry of the old PTF MAC address. This
would cause connectivity issue to the new PTF container for a while
until the old PTF MAC address is expired.

This workaround is to send out an ARPing from the PTF container querying
mgmt_gw after new PTF container is deployed and attached to network.
The ARPing request will be broadcasted to all neighbors on same LAN and
will refresh ARP table of neighbors with new MAC address of new PTF.

Signed-off-by: Xin Wang <xinw@mellanox.com>
Otherwise, if 2 systems have names where one is prefix of the other one, parsing of the
shorter name will come up with 2 lines.

Signed-off-by: Ying Xie <ying.xie@microsoft.com>
sonic-net/sonic-utilities#504

This is to make all the commands backwards compatible

Signed-off-by: Shu0T1an ChenG <shuche@microsoft.com>
…#823)

* [ptf_runner] Save ptf log to script executing host in case of failure

The PTF log and pcap files are useful for debugging in case of PTF
script failed. However, these files are in the PTF container and could
be lost when the PTF container is re-deployed.

This improvement is to save the log and pcap files to the script
executing host when the PTF script is failed.

Signed-off-by: Xin Wang <xinw@mellanox.com>

* [ptf_runner] Add option for specifying whether to save ptf log

The previous commit changed the default behavior. This change is to add
an option for specifying whether to save ptf log in case of failure.
For example: ansible-playbook <some_test>.yml ... -e save_ptf_log=yes
…hange (#866)

PFC_WD_TABLE --> PFC_WD

Signed-off-by: Ying Xie <ying.xie@microsoft.com>
Signed-off-by: Qi Luo <qiluo-msft@users.noreply.github.com>
* [fast/warm reboot] improve new image installation code

- Allow new_sonic_image being defined as empty string. It causes skipping image installation.
- Rename new_image_location to a generic name.
- Display defined new image url.

Signed-off-by: Ying Xie <ying.xie@microsoft.com>

* [fast/warm reboot] allow DUT to stay in the warm/fast reboot target release

This feature is needed in order to test ugprade path. Where we might upgrade from one version
to another, and more. We want the system to stay in target release for next steps.

Signed-off-by: Ying Xie <ying.xie@microsoft.com>

* Address review comments, test issues and some minor touch-ups

Signed-off-by: Ying Xie <ying.xie@microsoft.com>

* [fast/warm reboot] add knob to clean up old iamges on DUT before warm/fast reboot

When new image is specified for fast/warm reboot. The new image will be installed.
However, if the specified image is already installed on the target DUT, then
sonic_install will fail and fast/warm reboot will reboot into current image.

Add a knob to cleanup old images so that the installing of new image will have a
better chance to succeed.

Signed-off-by: Ying Xie <ying.xie@microsoft.com>

* address review issue
…les (#865)

The ntpd may generate 'ERR ntpd' in syslog and caused unnecessary test
case failure. Previous PR #816
added a matching pattern of 'ERR ntpd' in loganalyzer igonre files to
ignore the ntpd error messages. However, ntpd may generate two formats
of error messages. The previously added matching pattern can only match
one of the formats. This change is to update the pattern to match both
of the formats.

Signed-off-by: Xin Wang <xinw@mellanox.com>
* Add many testcases support to t0-56
* Fix bgp_speaker for t0-56
…g warm reboot (#890)

* [advanced-reboot] move Arista class to seperate module

Signed-off-by: Stepan Blyschak <stepanb@mellanox.com>

* [advanced-reboot] use lock to synchronize fast data plane and
reachability_watcher threads

Signed-off-by: Stepan Blyschak <stepanb@mellanox.com>

* [advanced-reboot] stabilize test when fast data plane send running

* Apply a filter on socket before sending fast data plane IO
* Save sniffed packets after the traffic test is done

Signed-off-by: Stepan Blyschak <stepanb@mellanox.com>

* [advanced-reboot] refactor fast data plane generator code

* reuse from_t1 and from_vlan_server generated packets in
  generate_bidirectional
* use tcp instead ofudp in generate_bidirectional

Signed-off-by: Stepan Blyschak <stepanb@mellanox.com>

* [advanced-reboot] add space back

Signed-off-by: Stepan Blyschak <stepanb@mellanox.com>
Some test setup has delay in the IO path, the original sniff wait time doesn't guarantee
all 36000 packet were received. Increasing sniffing wait time by 30 seconds.

Signed-off-by: Ying Xie <ying.xie@microsoft.com>
…put (#907)

* Modified sensors data for S6100/Z9100 according to latest output
* [advanced-reboot] start watcher thread after initializing Event objects

Signed-off-by: Stepan Blyschak <stepanb@mellanox.com>

* [advanced-reboot] improve error messages when DUT is not ready for test

Signed-off-by: Stepan Blyschak <stepanb@mellanox.com>
…ot (#904)

Signed-off-by: Stepan Blyschak <stepanb@mellanox.com>
```
Exception in thread Thread-2:
Traceback (most recent call last):
  File "/usr/lib/python2.7/threading.py", line 810, in __bootstrap_inner
    self.run()
  File "/usr/lib/python2.7/threading.py", line 763, in run
    self.__target(*self.__args, **self.__kwargs)
  File "ptftests/advanced-reboot.py", line 769, in peer_state_check
    self.fails[ip], self.info[ip], self.cli_info[ip], self.logs_info[ip] = ssh.run()
  File "ptftests/arista.py", line 157, in run
    log_data = self.parse_logs(log_lines)
  File "ptftests/arista.py", line 218, in parse_logs
    result_bgp, initial_time_bgp = self.extract_from_logs(bgp_r, data)
  File "ptftests/arista.py", line 205, in extract_from_logs
    raw_data.append((datetime.datetime.strptime(m.group(1), "%b %d %X"), m.group(2), m.group(3)))
AttributeError: 'module' object has no attribute '_strptime'

```
)

Signed-off-by: Stepan Blyschak <stepanb@mellanox.com>
* [deploy] Wait for vEOS to come back after restart
* [deploy] Replace handlers with tasks to ensure execution sequence
* [deploy] Replace ping cmd with wait_for module
* [deploy] Remove unused handlers
* Improved error handling when not all Interfaces are up

* Fixed PR 853.
…nout (#909)

Signed-off-by: Stepan Blyschak <stepanb@mellanox.com>
* Moved image processing from advanced-reboot.yml to separate file reboot-image-handle.yml

* Moved image processing from advanced-reboot.yml to separate file reboot-image-handle.yml
renukamanavalan and others added 30 commits April 8, 2020 22:32
* If core-storage secret key is available, add to /etc/sonic/core_analyzer.rc.json and enable & start core_uploader service
If https_proxy is provided, update /etc/sonic/core_analyzer.rc.json.

* Check the entire dict path before de-referencing.

* Improved regex per comments.

* Fixed syntax error.

* Add a sample file for newly introduced ansible facts.wq

* Removed a redundant empty line.

Co-authored-by: Ubuntu <remanava@remanava-kube-1.hblknyhzkmnujibhxvn3dmavjb.xx.internal.cloudapp.net>
10 sec default timeout for some devices, is not enough to complete the
reboot process. This PR increases the time out to reboot task timeout.

signed-off-by: Tamer Ahmed <tamer.ahmed@microsoft.com>
Signed-off-by: Nazarii Hnydyn <nazariig@mellanox.com>
Signed-off-by: Neetha John <nejo@microsoft.com>
make sure only syncd is match on pgrep
… image (#1722)

When minigraph.xml exists, config_db.json can be generated from it.
In this case, before booting into an new image, remove the config_db.json
from /host/old_config to force the new image to load minigraph.

This is needed when nightly testbed is moving from an higer version to a
lower version.

Signed-off-by: Ying Xie <ying.xie@microsoft.com>
Hence add a check. Apparently fastreboot test from an old
image failed due to this file missing.
* Qos SAI test restructure

Signed-off-by: Neetha John <nejo@microsoft.com>
…shared buffer (#1663)

* [qos] Support designating the packet size when testing water mark of shared buffer
Signed-off-by: Neetha John <nejo@microsoft.com>
If a DUT has already have the target image installed, then there
will be no /host/old_config/config_db.json afterward installing.
Add -f option to ignore file not exists error.

Signed-off-by: Ying Xie <ying.xie@microsoft.com>
…llanox fanout (#1764)

Signed-off-by: Volodymyr Samotiy <volodymyrs@mellanox.com>
there is check pfc_wd poll_time <= pfc_wd_detection/restoration_time.
So make sure in testscript before setting poll interval stop
pfc wd if enable by default because default detection/restoration time
can be < poll time interval making script failure.
… to timeout (#1820)

With the recent changes to caclmgrd, this check was not consistent, and could potentially fail with the following error:

```
TASK [test : Ensure the SSH port on the DuT becomes closed to us] ********************************************************************************************************************
Friday 26 June 2020  01:17:14 +0000 (0:00:05.323)       0:01:29.829 ***********
fatal: [sonic-dut-1]: FAILED! => {"msg": "The conditional check ''Timeout when waiting for search string OpenSSH' not in result.msg' failed. The error was: error while evaluating conditional ('Timeout when waiting for search string OpenSSH' not in result.msg): Unable to look up a name or access an attribute in template string ({% if 'Timeout when waiting for search string OpenSSH' not in result.msg %} True {% else %} False {% endif %}).\nMake sure your variable name does not contain invalid characters like '-': argument of type 'AnsibleUndefined' is not iterable"}
```

`result` could potentially be `AnsibleUndefined`.

This changes the logic to match that of the new pytest, which is a more appropriate method of checking that we are no longer able to SSH to the device, and no longer relies on parsing an error message (see https://github.com/Azure/sonic-mgmt/blob/master/tests/cacl/test_control_plane_acl.py#L34). Also remove unnecessary sleep, which also aligns more with the pytest version.
* wrong DIP on packet

it cannot receive BGP, SNMP, SSH IP2ME packet on t1-lag and the root cause is copp_test.py config wrong DIP on packet

* fix comment indentation

Co-authored-by: Ying Xie <yxieca@users.noreply.github.com>
A test server may have VMs for multiple test setups. The existing
tool sets can start the first N VMs and remove all VMs on the
server.

This change added the support of starting and stopping partial
of the VMs used by specified test setup.

Signed-off-by: Xin Wang <xiwang5@microsoft.com>
Verified with Dell 6000 Platforms.
…mmand line (#1908)

Add `-f` flag to pkill so that it will send the signal to processes where "exabgp" appears anywhere in the command line. Without this flag, it only sends the signal to processes where "exabgp" is the actual file being executed, thus leaving two `sh exabgp/start.sh` processes running. This change ensures all "exabgp" processes as well as the `sh exabgp/start.sh` processes are stopped.

Also update comment to be more precise about what signal is being sent, in case we need to be more forceful in the future, we could send SIGKILL instead.
…ncd (#1917)

Signed-off-by: Danny Allen <daall@microsoft.com>
a) Instead of teamdctl to add/remove port-channel member
   we shoudl config port-channel command. Reason being show interface
   status does not get updated as teadctl bypass config db

b) One of change done by me as prt of PR
#1893 was not complete as I
forgot to update Buffer size in swap_syncd.yml. Fixed now.

Signed-off-by: Abhishek Dosi <abdosi@microsoft.com>
…ion (#2024)

Sample old output:

      Interface            Lanes    Speed    MTU         Alias             Vlan    Oper    Admin             Type    Asym PFC
---------------  ---------------  -------  -----  ------------  ---------------  ------  -------  ---------------  ----------
      Ethernet0            77,78      50G   9100   Ethernet1/1            trunk    down       up  QSFP28 or later         off

Sample new output:
      Interface            Lanes    Speed    MTU    FEC         Alias             Vlan    Oper    Admin             Type    Asym PFC
---------------  ---------------  -------  -----  -----  ------------  ---------------  ------  -------  ---------------  ----------
     Ethernet48      57,58,59,60     100G   9100     rs  Ethernet13/1  PortChannel0001      up       up  QSFP28 or later         off

Signed-off-by: Ying Xie <ying.xie@microsoft.com>
for internal-201811 so that we can run 209111 image using
anisble-playbook.

Signed-off-by: Abhishek Dosi <abdosi@microsoft.com>
* minigraph: Add the ability to set a per-port speed in the minigraph (#3527)

* minigraph: Add the ability to set a per-port speed in the minigraph

(cherry picked from commit cef1f77)

* minigraph: Fix with_dict syntax in the playbook

That entry needs to be specified as referring to a variable.

Signed-off-by: Saikrishna Arcot <saiarcot895@gmail.com>

* [topo] Add test topology for 7050QX-32S-S4Q31 (#3568)

(cherry picked from commit 6d1720b)
…for populating the speed (#3584)

For filling in the speed of the port, when reading `device_conn`, the Sonic name needs to be used for reading into the dict, not the alias name.

Also, indexing by the hostname isn't required for 201811, since the returned structure gets rid of that "layer" if the hostname exists in that data structure.

Signed-off-by: Saikrishna Arcot <saiarcot895@gmail.com>
…tion (#3680)

* [topo] Fix up t0-35 to match the definition for other topos in the 201811 branch

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>

* Partial revert of 1511ae6

Ansible syntax changes are not needed.

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.