Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support PCI IDs #3420

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

Conversation

SRIKKANTH
Copy link
Collaborator

@SRIKKANTH SRIKKANTH commented Sep 14, 2024

Support PCI IDs
This will help in identifying devices even better.
Example: NVMe local devices vs remote storage devices when disc controller type == NVMe.

An issue with below testcases detected was fixed as part of this PR.
stress_sriov_with_max_nics_reboot_from_platform
stress_sriov_with_max_nics_stop_start_from_platform
Issue:
It takes a while to populate/detect all VFs after VM boot.
These tests were passing even if not all VF PCI devices were detected. Added 120s wait time to populate all VFs correctly in below testcases to fix the issue.

@SRIKKANTH SRIKKANTH marked this pull request as draft September 14, 2024 07:15
@SRIKKANTH SRIKKANTH marked this pull request as ready for review September 14, 2024 08:29
@SRIKKANTH SRIKKANTH force-pushed the smyakam/lspci_support_for_ids_14_09_2024 branch from 56c2acf to f17921b Compare September 14, 2024 12:45
@SRIKKANTH SRIKKANTH marked this pull request as draft September 15, 2024 16:30
@SRIKKANTH SRIKKANTH force-pushed the smyakam/lspci_support_for_ids_14_09_2024 branch 4 times, most recently from 1941252 to 19dca4b Compare September 16, 2024 17:08
return device_type_list

@retry(KeyError, tries=10, delay=20)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why retry can fix key errors?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some times difference seen between outputs of "lspci -n" and "lspci -m" used in this routine.
It usually occurs while checking for VFs right after VM booting when the VF count is 8.
All VF becomes available only after couple of minutes of VM booting.
This is for re-trying in such cases instead of simply failing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you check which one of "-n" or "-m" is incorrect, and why it's fixed by rerun? 10*20 means 200 seconds, if this method called many times, it slows down the test cases significantly.

Copy link
Collaborator Author

@SRIKKANTH SRIKKANTH Sep 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both -n and -m are correct. Its not an issue of lspci commands. Its the number of PCI devices newly added between these command executions is causing the exception. With out this wait, testcases will be dealing with less number of VF devices than actually attached.

@SRIKKANTH SRIKKANTH force-pushed the smyakam/lspci_support_for_ids_14_09_2024 branch 2 times, most recently from 76bd4a9 to 33623ea Compare September 19, 2024 05:32
@SRIKKANTH SRIKKANTH marked this pull request as ready for review September 20, 2024 16:22
This will help in identifying devices even better.
Example: NVMe local devices vs remote storage devices when disc controller type == NVMe.

An issue with below testcases detected was fixed as part of this PR.
stress_sriov_with_max_nics_reboot_from_platform
stress_sriov_with_max_nics_stop_start_from_platform
Issue:
It takes a while to populate/detect all VFs after VM boot.
These tests were passing even if not all VF PCI devices were detected. Added 120s wait time to populate all VFs correctly in below testcases to fix the issue.
if use_pci_ids:
for device in devices_list:
if (
device.controller_id in CONTROLLER_ID_DICT[device_type.upper()]
Copy link
Member

@squirrelsc squirrelsc Sep 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Define a class instead of 3 dict with the same key. The class can define a method to accept a device, and compare if exists in one dict.

@@ -142,47 +165,107 @@ def _install(self) -> bool:
return self._check_exists()

def get_device_names_by_type(
self, device_type: str, force_run: bool = False
self, device_type: str, force_run: bool = False, use_pci_ids: bool = False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks the new argument is to filter devices by name? If so, you can add a new method called get_devices_by_allowed_list, it calls get_device_names_by_type and filter them.

if matched_pci_device_info_list:
matched_pci_device_info = matched_pci_device_info_list[0]
self.slot = matched_pci_device_info.get("slot", "").strip()
def parse(self, raw_str: str, pci_ids: Dict[str, Any]) -> None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method looks not need to be changed, the filter happens on upper caller too. It doesn't need to filter in this method.

@@ -274,6 +275,8 @@ def stress_sriov_with_max_nics_reboot_from_platform(
for node in environment.nodes.list():
start_stop = node.features[StartStop]
start_stop.restart()
# Add delay to wait for the network interface ready.
sleep(120)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't add sleep, and it's so long. Find a way to wait some signal.

@LiliDeng
Copy link
Collaborator

Make sure you have tested it. like verify_device_statistics

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants