Skip to content
This repository was archived by the owner on Sep 22, 2025. It is now read-only.
This repository was archived by the owner on Sep 22, 2025. It is now read-only.

Error occurs at the first MPI run #5

@tetsushinto

Description

@tetsushinto

Hello,

My customer Fujitsu reports an issue below.

In the field, following errors occur when an MPI program is executed immediately after server startup for the first time.
It only occurs at the first MPI execution, and does not occur from the second time onwards.
This phenomenon occurred between 16:00 and 18:00 on July 14th.

When I asked Nvidia to check the MOFED driver, they told me that there was no error on the driver side.
Furthermore, they said that IntelMPI is based on libfrabric which OFED does not support. If customer want use IntelMPI need full stack including IB driver and libs from Intel. Hybrid IntelMPI with NVIDIA OFED is out of Nvidia support scope.

Does Intel support IntelMPI with Nvidia MOFED without Nvidia support?
If so, could you please investigate this issue?

Or does Intel only recommend using IntelMPI with Intel's IB driver?

Intel MPI  OS:RHEL7.9  MOFED:5.2-1.0.4.0  HCA:CX5 (EDR) (FW:16.29.1016 ) ---- [0] MPI startup(): Intel(R) MPI Library, Version 2021.2 Build 20210302 (id: f4f7c92cd) [0] MPI startup(): Copyright (C) 2003-2021 Intel Corporation. All rights reserved. [0] MPI startup(): library kind: release [0] MPI startup(): libfabric version: 1.11.0-impi [0] MPI startup(): libfabric provider: mlx [1657788057.566554] [cmp-044:38365:0] mpool.c:193 UCX ERROR Failed to allocate memory pool (name=devx dbrec) chunk: Out of memory [1657788057.582960] [cmp-046:37100:0] dc_mlx5_devx.c:66 UCX ERROR mlx5dv_devx_obj_create(DCT) failed, syndrome 0: Resource temporarily unavailable [1657788057.586338] [cmp-038:42220:0] dc_mlx5_devx.c:66 UCX ERROR mlx5dv_devx_obj_create(DCT) failed, syndrome 0: Resource temporarily unavailable ----

Thanks,
Shinto

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions