Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Talos v1.7.0 mlx5_core kernel panic #8624

Closed
buroa opened this issue Apr 19, 2024 · 7 comments · Fixed by #8684
Closed

Talos v1.7.0 mlx5_core kernel panic #8624

buroa opened this issue Apr 19, 2024 · 7 comments · Fixed by #8684

Comments

@buroa
Copy link

buroa commented Apr 19, 2024

Bug Report

Getting a kernel panic with my nodes that use Mellanox cards.

Description

Talos boots and then once it gets to initializing the network, it kernel panics. If I disable the Mellanox PCIe card from the BIOS, Talos boots fine.

Logs

  <TASK>
  ? __die+ox23/0x70
  ? page_fault_oops+0x171/0x4c0
  ? exc_page_fault+0x171/0x130
  ? asm_exc_page_fault+0x26/0x30
  ? esw_port_metadata_get+0x19/0x30 [mlx5_core]
  ? __alloc_skb+0x8c/0x1b0
  devlink_param_notify.constprop.0+0x72/0xd0
  devl_params_register+0x130/0x2d0
  esw_offloads_init+0x165/0x180 [mlx5_core]
  mlx5_eswitch_init+03b2/0x650 [mlx5_core]
  mlx5_init_one_devl_locked+016d/0670 [mlx5_core]
  probe_one+0x325/0x4a0 [mlx5_core]
  local_pci_probe+0x42/0xa0
  work_for_cpu_fn+0x17/0x30
  process_one_work+0x176/0x310
  ? __pfx_worker_thread+0x10/0x10
  kthread+0xcd/0x100
  ? __pfx_kthread+0x10/0x10
  ref_from_fork+0x31/0x50
  ? __pfx_kthread+0x10+0x10
  ret_from_fork_asm+0x1b/0x30
  </TASK>
Modules linked in: wdat_wdt mlx5_core(+) ahci watchdog i2c_i801 lpc_ich mlxfw libahci mfd_core i2c_smbus
---[end trace 0000000000000000 ]---
RIP: 0010:esw_port_metadata_get+0x19/0x30 [mlx5_core]
Kernel panic - not syncing: Fatal exception
Kernel Offset: 0x28c0000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)

Environment

  • Talos version: v1.7.0
  • Kubernetes version: N/A
  • Platform: Metal
@buroa buroa changed the title Talos v1.7.0 mxl5_core kernel panic Talos v1.7.0 mlx5_core kernel panic Apr 19, 2024
@smira
Copy link
Member

smira commented Apr 19, 2024

Are you using any system extensions?

@buroa
Copy link
Author

buroa commented Apr 19, 2024

Yes, intel-ucode, nonfree-kmod-nvidia and nvidia-container-toolkit.

This may be relevant as well: https://lore.kernel.org/netdev/20240409190820.227554-2-tariqt@nvidia.com/T/

@smira
Copy link
Member

smira commented Apr 19, 2024

yep, might be fixed in future Linux 6.6 releases

@smira
Copy link
Member

smira commented Apr 22, 2024

Seems to be reported upstream already: https://lore.kernel.org/lkml/20240420135914.2AD9.409509F4@e16-tech.com/

@buroa
Copy link
Author

buroa commented Apr 22, 2024

Seems to be reported upstream already: lore.kernel.org/lkml/20240420135914.2AD9.409509F4@e16-tech.com

Thanks for keeping up on this, really appreciate it.

@smira
Copy link
Member

smira commented Apr 29, 2024

Looks like 6.6.29 got mlx5 updates https://cdn.kernel.org/pub/linux/kernel/v6.x/ChangeLog-6.6.29

smira added a commit to smira/pkgs that referenced this issue Apr 30, 2024
Should fix the `mlx5` issues: siderolabs/talos#8624

Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
smira added a commit to smira/pkgs that referenced this issue Apr 30, 2024
Should fix the `mlx5` issues: siderolabs/talos#8624

Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
(cherry picked from commit 28c5696)
This was referenced Apr 30, 2024
@smira smira linked a pull request May 1, 2024 that will close this issue
@smira smira closed this as completed May 1, 2024
@buroa
Copy link
Author

buroa commented May 1, 2024

Great news, thanks @smira!

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jul 1, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants