-
Notifications
You must be signed in to change notification settings - Fork 6
Error occurs at the first MPI run #5
Description
Hello,
My customer Fujitsu reports an issue below.
In the field, following errors occur when an MPI program is executed immediately after server startup for the first time.
It only occurs at the first MPI execution, and does not occur from the second time onwards.
This phenomenon occurred between 16:00 and 18:00 on July 14th.
When I asked Nvidia to check the MOFED driver, they told me that there was no error on the driver side.
Furthermore, they said that IntelMPI is based on libfrabric which OFED does not support. If customer want use IntelMPI need full stack including IB driver and libs from Intel. Hybrid IntelMPI with NVIDIA OFED is out of Nvidia support scope.
Does Intel support IntelMPI with Nvidia MOFED without Nvidia support?
If so, could you please investigate this issue?
Or does Intel only recommend using IntelMPI with Intel's IB driver?
Intel MPI OS:RHEL7.9 MOFED:5.2-1.0.4.0 HCA:CX5 (EDR) (FW:16.29.1016 ) ---- [0] MPI startup(): Intel(R) MPI Library, Version 2021.2 Build 20210302 (id: f4f7c92cd) [0] MPI startup(): Copyright (C) 2003-2021 Intel Corporation. All rights reserved. [0] MPI startup(): library kind: release [0] MPI startup(): libfabric version: 1.11.0-impi [0] MPI startup(): libfabric provider: mlx [1657788057.566554] [cmp-044:38365:0] mpool.c:193 UCX ERROR Failed to allocate memory pool (name=devx dbrec) chunk: Out of memory [1657788057.582960] [cmp-046:37100:0] dc_mlx5_devx.c:66 UCX ERROR mlx5dv_devx_obj_create(DCT) failed, syndrome 0: Resource temporarily unavailable [1657788057.586338] [cmp-038:42220:0] dc_mlx5_devx.c:66 UCX ERROR mlx5dv_devx_obj_create(DCT) failed, syndrome 0: Resource temporarily unavailable ----Thanks,
Shinto