Skip to content

Conversation

@abhsahu
Copy link

@abhsahu abhsahu commented Jul 31, 2025

This PR adds a new driver for managing PCIe link when NVIDIA CX7 hot plug/unplug happens

@abhsahu abhsahu force-pushed the 6.11_mt8901_cx7_hp_driver_31Jul branch from fd87a89 to 8020a11 Compare July 31, 2025 08:07
@nvmochs
Copy link
Collaborator

nvmochs commented Jul 31, 2025

@abhsahu I took a look at this and have several questions/comments...

cx7_hp_probe():

       hp_dev->pdev = pdev;

        hp_dev->pd = pd;
        if (!hp_dev->pd || hp_dev->pd->port_nums >= HP_PORT_MAX) {
                ret = -ENODEV;
                goto err_ctx_search;
        }

Nit: Couldn’t these check be performed earlier in probe?

 if (IS_ERR(app_ctx->desc)) {
                        dev_err(&pdev->dev, "Failed to get GPIO descriptor: %d\n", i);
                        ret = PTR_ERR(app_ctx->desc);
                        app_ctx->desc = 0;
                        goto err_ctx_search;
                }

Nit: app_ctx->desc is a pointer, no? Use NULL instead.

app_ctx->ctx = gpio_acpi_setup(pdev, app_ctx->desc);
                if (app_ctx->ctx) {
                        gpiod_set_debounce(app_ctx->desc, app_ctx->ctx->debounce_timeout_us);
                        if (app_ctx->ctx->connection_type == ACPI_RESOURCE_GPIO_TYPE_INT) {
                                app_ctx->hp_dev = hp_dev;
                                ret = cx7_hp_setup_irq(app_ctx);
                                if (ret)
                                        goto err_irq;
                        }
                } else {
                        goto err_ctx_search;
                }

Nit: Flip the conditional, if (!app_ctx->ctx) , to avoid indenting the the good path

err_irq:
err_ctx_search:
        for (i = 0; i < hp_dev->gpio_count; i++) {

Question: Why does err_irq exist?

       ret = cx7_hp_pinctrl_init(pdev);
        if (ret)
                dev_err(&pdev->dev, "Pinmux init failed, ret: %d\n", ret);
...
 /* sysfs */
        ret = sysfs_create_group(&pdev->dev.kobj, &root_attr_group);
        if (ret < 0)
                dev_err(&pdev->dev, "Sysfs create failed, ret: %d\n", ret);

Question: It is valid/safe to continue onward when these fail?

hp_dev_global = hp_dev;
Question: Is it possible to come through this probe twice? If so, do we need better protection for this global (and/or detection that it was already set)?

plugin = 0;
Nit: It’s generally a good practice to have a global identified, maybe use plugin_global?

        /* Update cable removal or plug-in */
        if (gpiod_get_value(hp_dev->pins[PCIE_PIN_PRSNT].desc))
                plugin = 0;
        else
                plugin = 1;

        /* Send uevent */
        if (plugin)
                cx7_send_uevent(pdev, PLUG_IN_EVT);
        else
                cx7_send_uevent(pdev, REMOVAL_EVT);

Suggestion: It might be more straightforward to merge these 2 so there is just one set of conditionals

Nit: It would be cleaner to use an enum or #define for these plugin values (I had to look at plugin_store() to understand what the numbers meant)


plugin_store()

Question: Is the lack of serialization for “state” between this thread an in interrupt kthread simply because this is a debug mechanism?


**acpi_gpio_resource_handler()**

Question: Is it valid to return “AE_OK” when ares->type is not ACPI_RESOURCE_TYPE_GPIO ?


cx7_hp_ckm_control()

Nit: use bool instead of int for the disable parameter, and then true/false at the call sites

@clsotog
Copy link
Collaborator

clsotog commented Jul 31, 2025

I tried this PR in one system but when I did the echo for hotplug did not go well. So waiting for another system at colossus where I can have console.
Now I see a lot of hard coded buses for the CX7 card.
For example the system I tried earlier has this
0000:03:00.0 Ethernet controller: Mellanox Technologies MT2910 Family [ConnectX-7]
0000:03:00.1 Ethernet controller: Mellanox Technologies MT2910 Family [ConnectX-7]
0002:01:00.0 Ethernet controller: Mellanox Technologies MT2910 Family [ConnectX-7]
0002:01:00.1 Ethernet controller: Mellanox Technologies MT2910 Family [ConnectX-7]
So I was able to see the /sys/devices/platform/MTKP0001:00/cx7_dbg
But last week I have been helping another team with some nvbugs and at their system the CX7 are in different locations
0000:01:00.0 Ethernet controller: Mellanox Technologies MT2910 Family [ConnectX-7]
0000:01:00.1 Ethernet controller: Mellanox Technologies MT2910 Family [ConnectX-7]
0002:01:00.0 Ethernet controller: Mellanox Technologies MT2910 Family [ConnectX-7]
0002:01:00.1 Ethernet controller: Mellanox Technologies MT2910 Family [ConnectX-7]
So how this code will work in this system since will not see CX7 at 0000:03:00.X

@clsotog
Copy link
Collaborator

clsotog commented Aug 1, 2025

@abhsahu
Why we need this part of the code at probing:

/* Update cable removal or plug-in */
	if (gpiod_get_value(hp_dev->pins[PCIE_PIN_PRSNT].desc))
		plugin = 0;
	else
		plugin = 1;

	/* Send uevent */
	if (plugin)
		cx7_send_uevent(pdev, PLUG_IN_EVT);
	else
		cx7_send_uevent(pdev, REMOVAL_EVT);

Im seeing that the PRSNT is 1 so after the CX7s load at boot time then they get removed. This will confuse people.
Also why this code will do hotplug to both cards, we can never do just hotplug to one card?

@shgarg26
Copy link

shgarg26 commented Aug 4, 2025

@abhsahu Why we need this part of the code at probing:

/* Update cable removal or plug-in */
	if (gpiod_get_value(hp_dev->pins[PCIE_PIN_PRSNT].desc))
		plugin = 0;
	else
		plugin = 1;

	/* Send uevent */
	if (plugin)
		cx7_send_uevent(pdev, PLUG_IN_EVT);
	else
		cx7_send_uevent(pdev, REMOVAL_EVT);

Im seeing that the PRSNT is 1 so after the CX7s load at boot time then they get removed. This will confuse people.

yes, this code is required to keep hotplug status as expected during driver probe
for ex: if no devices are attached then PRSNT=1--> plugin=0---> driver will send REMOVAL event

Also why this code will do hotplug to both cards, we can never do just hotplug to one card?

The intent is not to have individual cable detect but just know if either port has a cable for powering CX-7 on and off.

@abhsahu abhsahu force-pushed the 6.11_mt8901_cx7_hp_driver_31Jul branch from 8020a11 to 79106ce Compare August 4, 2025 13:17
@abhsahu
Copy link
Author

abhsahu commented Aug 4, 2025

@abhsahu I took a look at this and have several questions/comments...

cx7_hp_probe():

       hp_dev->pdev = pdev;

        hp_dev->pd = pd;
        if (!hp_dev->pd || hp_dev->pd->port_nums >= HP_PORT_MAX) {
                ret = -ENODEV;
                goto err_ctx_search;
        }

Nit: Couldn’t these check be performed earlier in probe?

I have moved this earlier.

 if (IS_ERR(app_ctx->desc)) {
                        dev_err(&pdev->dev, "Failed to get GPIO descriptor: %d\n", i);
                        ret = PTR_ERR(app_ctx->desc);
                        app_ctx->desc = 0;
                        goto err_ctx_search;
                }

Nit: app_ctx->desc is a pointer, no? Use NULL instead.

I have changed to app_ctx->desc = NULL

app_ctx->ctx = gpio_acpi_setup(pdev, app_ctx->desc);
                if (app_ctx->ctx) {
                        gpiod_set_debounce(app_ctx->desc, app_ctx->ctx->debounce_timeout_us);
                        if (app_ctx->ctx->connection_type == ACPI_RESOURCE_GPIO_TYPE_INT) {
                                app_ctx->hp_dev = hp_dev;
                                ret = cx7_hp_setup_irq(app_ctx);
                                if (ret)
                                        goto err_irq;
                        }
                } else {
                        goto err_ctx_search;
                }

Nit: Flip the conditional, if (!app_ctx->ctx) , to avoid indenting the the good path

I have filled the conditional.

err_irq:
err_ctx_search:
        for (i = 0; i < hp_dev->gpio_count; i++) {

Question: Why does err_irq exist?

       ret = cx7_hp_pinctrl_init(pdev);
        if (ret)
                dev_err(&pdev->dev, "Pinmux init failed, ret: %d\n", ret);
...
 /* sysfs */
        ret = sysfs_create_group(&pdev->dev.kobj, &root_attr_group);
        if (ret < 0)
                dev_err(&pdev->dev, "Sysfs create failed, ret: %d\n", ret);

Question: It is valid/safe to continue onward when these fail?

I have changed kzalloc() to devm_kzalloc() so that we don't have to handle free explicitly.
Also, I have made probe to fail if we any condition fails.

hp_dev_global = hp_dev; Question: Is it possible to come through this probe twice? If so, do we need better protection for this global (and/or detection that it was already set)?

We are not expecting the probe to be called twice.
I have added a check in the probe function to have only one instance.

plugin = 0; Nit: It’s generally a good practice to have a global identified, maybe use plugin_global?

I have renamed this to dbg_plugin_global.

        /* Update cable removal or plug-in */
        if (gpiod_get_value(hp_dev->pins[PCIE_PIN_PRSNT].desc))
                plugin = 0;
        else
                plugin = 1;

        /* Send uevent */
        if (plugin)
                cx7_send_uevent(pdev, PLUG_IN_EVT);
        else
                cx7_send_uevent(pdev, REMOVAL_EVT);

Suggestion: It might be more straightforward to merge these 2 so there is just one set of conditionals

I have merged these into 1.

Nit: It would be cleaner to use an enum or #define for these plugin values (I had to look at plugin_store() to understand what the numbers meant)

I have updated code to use enum.

plugin_store()

Question: Is the lack of serialization for “state” between this thread an in interrupt kthread simply because this is a debug mechanism?

This is debug mechanism, so we have not added any serialization currently.

acpi_gpio_resource_handler()
Question: Is it valid to return “AE_OK” when ares->type is not ACPI_RESOURCE_TYPE_GPIO ?

I think, its fine to return AE_OK so that we will continue to traverse the tree.

cx7_hp_ckm_control()

Nit: use bool instead of int for the disable parameter, and then true/false at the call sites

I have updated this to use bool.

@abhsahu
Copy link
Author

abhsahu commented Aug 4, 2025

So how this code will work in this system since will not see CX7 at 0000:03:00.X

This hardcoding is according to DGX Spark build.
We can add CX7 vendor id and device id check also to confirm if the PCIe device is CX7 only.

But last week I have been helping another team with some nvbugs and at their system the CX7 are in different locations

Was that system DGX spark (DIGITS) or some other system ?

@clsotog
Copy link
Collaborator

clsotog commented Aug 4, 2025

Im seeing that the PRSNT is 1 so after the CX7s load at boot time then they get removed. This will confuse people.

yes, this code is required to keep hotplug status as expected during driver probe
for ex: if no devices are attached then PRSNT=1--> plugin=0---> driver will send REMOVAL event
I still do not follow this. Why removing the cards at boot time if they were able to load. Customers will complain about this.

Also why this code will do hotplug to both cards, we can never do just hotplug to one card?

The intent is not to have individual cable detect but just know if either port has a cable for powering CX-7 on and off.
Which cable and port we are talking here? These CX7 have 2 ports, we are talking about the port that we use for traffic?

@clsotog
Copy link
Collaborator

clsotog commented Aug 4, 2025

Was that system DGX spark (DIGITS) or some other system ?
I believe they are DGX spark system. The rest in lspci looks the same except the first CX7. These systems are used by NBU for CX7 validation.

@nvmochs
Copy link
Collaborator

nvmochs commented Aug 4, 2025

Thanks for the responses and code updates @abhsahu!

Two additional comments on the updated probe():

  • Suggestion: Consider adding an error print in all error paths. Right now some are covered and some are not.
  • Question: Should hp_dev_global be reset in the event of a failure?

@shgarg26
Copy link

shgarg26 commented Aug 5, 2025

Im seeing that the PRSNT is 1 so after the CX7s load at boot time then they get removed. This will confuse people.

yes, this code is required to keep hotplug status as expected during driver probe
for ex: if no devices are attached then PRSNT=1--> plugin=0---> driver will send REMOVAL event
I still do not follow this. Why removing the cards at boot time if they were able to load. Customers will complain about this.

Also why this code will do hotplug to both cards, we can never do just hotplug to one card?

The intent is not to have individual cable detect but just know if either port has a cable for powering CX-7 on and off.
Which cable and port we are talking here? These CX7 have 2 ports, we are talking about the port that we use for traffic?

QSFP cables to be plugged at end points of CX7 ports. Any of the 2 CX7 ports.

@abhsahu abhsahu force-pushed the 6.11_mt8901_cx7_hp_driver_31Jul branch from 79106ce to 6ca3135 Compare August 5, 2025 10:40
@abhsahu
Copy link
Author

abhsahu commented Aug 5, 2025

Thanks for the responses and code updates @abhsahu!

Two additional comments on the updated probe():

  • Suggestion: Consider adding an error print in all error paths. Right now some are covered and some are not.
  • Question: Should hp_dev_global be reset in the event of a failure?

Thanks @clsotog I have added error prints in all error path and did reser of hp_dev_global

@nvmochs
Copy link
Collaborator

nvmochs commented Aug 5, 2025

Thanks @abhsahu for addressing these...no further issues for me.

Acked-by: Matthew R. Ochs <mochs@nvidia.com>

@clsotog
Copy link
Collaborator

clsotog commented Aug 5, 2025

Hi @abhsahu
I still do not like the hotplug after boot if no cable but if this was discussed with NBU then I will approve.
I asked around and this card has socket direct so thats why is one physical card but 2 logical cards.
I hope that the case of the NBU system showing a different lspci for CX7 is just one off. I try to ask tomorrow to the owner of the system.

Sorry removing my arb. I found out why the system from nbu has different lscpi. It is because of FW.

@abhsahu abhsahu force-pushed the 6.11_mt8901_cx7_hp_driver_31Jul branch from 6ca3135 to dc31a94 Compare August 6, 2025 11:53
@abhsahu
Copy link
Author

abhsahu commented Aug 6, 2025

I have updated the commit to add check for CX7 vendor ID and device ID.
It will help in ensuring that CX7 NIC is present in the hardcoded PCI address.

https://github.com/NVIDIA/NV-Kernels/compare/6ca313511a213e22c864441299d992e6fd18235b..dc31a9498b40e54aac045b4fa2d6e0ebc44c401a

@abhsahu abhsahu force-pushed the 6.11_mt8901_cx7_hp_driver_31Jul branch from dc31a94 to 4883b0c Compare August 7, 2025 13:41
@abhsahu
Copy link
Author

abhsahu commented Aug 7, 2025

I missed to return error in my previous change so fixed the same.

@abhsahu abhsahu force-pushed the 6.11_mt8901_cx7_hp_driver_31Jul branch from 4883b0c to 4d4ac87 Compare August 11, 2025 10:17
@nvmochs
Copy link
Collaborator

nvmochs commented Aug 11, 2025

@abhsahu - Can you provide more context about the changes you just pushed?

@clsotog
Copy link
Collaborator

clsotog commented Aug 11, 2025

@abhsahu - Can you provide more context about the changes you just pushed?
Abishek is out this week. Let my try today's changes but the changes are related that latest Mellanox FW changed the lspci so with these changes would look for CX7 than hardcoded bus ids. It should work with old and new Mellanox FW.

@clsotog
Copy link
Collaborator

clsotog commented Aug 11, 2025

I tried both Mellanox FWs and the code worked.

@clsotog clsotog self-requested a review August 11, 2025 15:56
Copy link
Collaborator

@clsotog clsotog left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acked-by: Carol L Soto <csoto@nvidia.com>

@abhsahu
Copy link
Author

abhsahu commented Aug 11, 2025

Thanks @clsotog and @nvmochs.

In earlier version, it is hardcoding the PCI address.

With older version of Mellanox firmware:

000:00:00.0 PCI bridge: NVIDIA Corporation Device 22ce (rev 01)
0000:01:00.0 PCI bridge: Mellanox Technologies MT2910 Family [ConnectX-7 PCIe Bridge]
0000:02:00.0 PCI bridge: Mellanox Technologies MT2910 Family [ConnectX-7 PCIe Bridge]
0000:02:02.0 PCI bridge: Mellanox Technologies MT2910 Family [ConnectX-7 PCIe Bridge]
0000:03:00.0 Ethernet controller: Mellanox Technologies MT2910 Family [ConnectX-7]
0000:03:00.1 Ethernet controller: Mellanox Technologies MT2910 Family [ConnectX-7]
0000:04:00.0 PCI bridge: Mellanox Technologies MT2910 Family [ConnectX-7 PCIe Bridge]
0000:05:00.0 PCI bridge: Mellanox Technologies MT2910 Family [ConnectX-7 PCIe Bridge]
0002:00:00.0 PCI bridge: NVIDIA Corporation Device 22ce (rev 01)
0002:01:00.0 Ethernet controller: Mellanox Technologies MT2910 Family [ConnectX-7]
0002:01:00.1 Ethernet controller: Mellanox Technologies MT2910 Family [ConnectX-7]

With newer version of Mellanox firmware

0000:00:00.0 PCI bridge: NVIDIA Corporation Device 22ce (rev 01)
0000:01:00.0 Ethernet controller: Mellanox Technologies MT2910 Family [ConnectX-7]
0000:01:00.1 Ethernet controller: Mellanox Technologies MT2910 Family [ConnectX-7]
0002:00:00.0 PCI bridge: NVIDIA Corporation Device 22ce (rev 01)
0002:01:00.0 Ethernet controller: Mellanox Technologies MT2910 Family [ConnectX-7]
0002:01:00.1 Ethernet controller: Mellanox Technologies MT2910 Family [ConnectX-7]

with newer firmware version, it has removed the ConnectX-7 PCIe Bridge and now we are seeing CX7 devices in PCI address 0000:01:00.0 and 0000:01:00.0 instead of 0000:03:00.0 and 0000:03:00.1 for domain 0.

With latest patch, instead of hardcoding PCI address (Domain:Bus:Device.Function), it check the devices with CX7 PCI vendor ID and device ID in Domain 0 and 2 and counts that. If this count matches with GB10 board expected count (4), then probe continues, otherwise probe will fail.

@nvmochs
Copy link
Collaborator

nvmochs commented Aug 11, 2025

Thanks @clsotog and @nvmochs.

In earlier version, it is hardcoding the PCI address.

With older version of Mellanox firmware:

000:00:00.0 PCI bridge: NVIDIA Corporation Device 22ce (rev 01) 0000:01:00.0 PCI bridge: Mellanox Technologies MT2910 Family [ConnectX-7 PCIe Bridge] 0000:02:00.0 PCI bridge: Mellanox Technologies MT2910 Family [ConnectX-7 PCIe Bridge] 0000:02:02.0 PCI bridge: Mellanox Technologies MT2910 Family [ConnectX-7 PCIe Bridge] 0000:03:00.0 Ethernet controller: Mellanox Technologies MT2910 Family [ConnectX-7] 0000:03:00.1 Ethernet controller: Mellanox Technologies MT2910 Family [ConnectX-7] 0000:04:00.0 PCI bridge: Mellanox Technologies MT2910 Family [ConnectX-7 PCIe Bridge] 0000:05:00.0 PCI bridge: Mellanox Technologies MT2910 Family [ConnectX-7 PCIe Bridge] 0002:00:00.0 PCI bridge: NVIDIA Corporation Device 22ce (rev 01) 0002:01:00.0 Ethernet controller: Mellanox Technologies MT2910 Family [ConnectX-7] 0002:01:00.1 Ethernet controller: Mellanox Technologies MT2910 Family [ConnectX-7]

With newer version of Mellanox firmware

0000:00:00.0 PCI bridge: NVIDIA Corporation Device 22ce (rev 01) 0000:01:00.0 Ethernet controller: Mellanox Technologies MT2910 Family [ConnectX-7] 0000:01:00.1 Ethernet controller: Mellanox Technologies MT2910 Family [ConnectX-7] 0002:00:00.0 PCI bridge: NVIDIA Corporation Device 22ce (rev 01) 0002:01:00.0 Ethernet controller: Mellanox Technologies MT2910 Family [ConnectX-7] 0002:01:00.1 Ethernet controller: Mellanox Technologies MT2910 Family [ConnectX-7]

with newer firmware version, it has removed the ConnectX-7 PCIe Bridge and now we are seeing CX7 devices in PCI address 0000:01:00.0 and 0000:01:00.0 instead of 0000:03:00.0 and 0000:03:00.1 for domain 0.

With latest patch, instead of hardcoding PCI address (Domain:Bus:Device.Function), it check the devices with CX7 PCI vendor ID and device ID in Domain 0 and 2 and counts that. If this count matches with GB10 board expected count (4), then probe continues, otherwise probe will fail.

Got it, thanks for clarifying!

Jerry.Guo and others added 2 commits August 12, 2025 04:09
This driver is used to manage PCIe link for NVIDIA ConnectX-7 (CX7)
hot-plug/unplug on DGX Spark. We need to disable PCIe link when
CX7 cable plug out happens and enable pcie link when
CX7 cable plug in happens.

It also creates a sysfs entry to emulate cable plug in/out
behavior as below:

plug in - echo 1 > /sys/devices/platform/MTKP0001\:00/cx7_dbg/plugin
plug out - echo 0 > /sys/devices/platform/MTKP0001\:00/cx7_dbg/plugin

We also implement uevent to notify user-space applications when a
cable is plugged in or removed. Below are the details of our process:

* cable plug-in:

 1. report plug-in uevent (driver)
 2. enable pcie link (application)
 3. rescan devices (application)

* cable removal:

 1. report removal uevent (driver)
 2. remove devices (application)
 3. disable pcie link (application)

Signed-off-by: Jerry.Guo <jerry.guo@mediatek.com>
Signed-off-by: Yenchia Chen <yenchia.chen@mediatek.com>
Signed-off-by: Shubhi Garg <shgarg@nvidia.com>
Signed-off-by: Abhishek Sahu <abhsahu@nvidia.com>
…T_MTK_HP

Signed-off-by: Abhishek Sahu <abhsahu@nvidia.com>
@abhsahu abhsahu force-pushed the 6.11_mt8901_cx7_hp_driver_31Jul branch from 4d4ac87 to 31209a6 Compare August 12, 2025 04:19
@abhsahu
Copy link
Author

abhsahu commented Aug 12, 2025

struct cx7_hp_ep structure is also unused now, so I have removed that and updated.

@nvmochs
Copy link
Collaborator

nvmochs commented Aug 12, 2025

struct cx7_hp_ep structure is also unused now, so I have removed that and updated.

Confirmed this was the only change in the latest push.

@clsotog
Copy link
Collaborator

clsotog commented Aug 12, 2025

Acked-by: Carol L Soto <csoto@nvidia.com>

@clsotog
Copy link
Collaborator

clsotog commented Oct 21, 2025

Closing this PR I do not think this will get in based on comments from PR 220.

@clsotog clsotog closed this Oct 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants