[Mvapich-discuss] mvapich 3.0b srun start failure

christof.koehler at bccms.uni-bremen.de christof.koehler at bccms.uni-bremen.de
Thu May 18 07:47:35 EDT 2023


!-------------------------------------------------------------------|
  This Message Is From an External Sender
  This message came from outside your organization.
|-------------------------------------------------------------------!

Hello everybody,

I now started to test the mvapich 3.0b build. It was compiled on Rocky
Linux 9.1 with slurm 23.02.2 and gcc 11.3.1. See at the end of email for
mpichversion output.

When I try to start a simple mpi hello world with srun --mpi=pmi2
I see an error messages concerning PMI and a segfault, see also at the 
end of the email. The same mpi hello world source code using the same 
srun --mpi=pmi2 invocation (but obviously different binaries) works fine
with mvapich2 2.3.7, mpich 4.1.1 and openmpi 4.1.5.

Should I try another launcher, e.g. hydra by not setting --with-pm and
--with-pmi ? Would the hydra launcher be able to communicate wth slurm,
though ?

Best Regards

Christof

$ mpichversion
MVAPICH Version:        3.0b
MVAPICH Release date:   04/10/2023
MVAPICH Device:         ch4:ofi
MVAPICH configure:      --with-pm=slurm --with-pmi=pmi1
--with-device=ch4:ofi --prefix=/cluster/mpi/mvapich2/3.0a/gcc11.3.1
MVAPICH CC:     gcc    -DNDEBUG -DNVALGRIND -O2
MVAPICH CXX:    g++   -DNDEBUG -DNVALGRIND -O2
MVAPICH F77:    gfortran -fallow-argument-mismatch  -O2
MVAPICH FC:     gfortran   -O2
MVAPICH Custom Information:     @MVAPICH_CUSTOM_STRING@

Error Message:

INTERNAL ERROR: invalid error code 6163 (Ring ids do not match) in
MPIR_NODEMAP_build_nodemap_fallback:355
Abort(2141455) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init:
Other MPI error, error stack:
MPIR_Init_thread(175)...................: 
MPID_Init(509)..........................: 
MPIR_pmi_init(119)......................: 
build_nodemap(882)......................: 
MPIR_NODEMAP_build_nodemap_fallback(355): 
In: PMI_Abort(2141455, Fatal error in PMPI_Init: Other MPI error, error
stack:
MPIR_Init_thread(175)...................: 
MPID_Init(509)..........................: 
MPIR_pmi_init(119)......................: 
build_nodemap(882)......................: 
MPIR_NODEMAP_build_nodemap_fallback(355): )
INTERNAL ERROR: invalid error code 6106 (Ring ids do not match) in
MPIR_NODEMAP_build_nodemap_fallback:355
Abort(2141455) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init:
Other MPI error, error stack:
MPIR_Init_thread(175)...................: 
MPID_Init(509)..........................: 
MPIR_pmi_init(119)......................: 
build_nodemap(882)......................: 
MPIR_NODEMAP_build_nodemap_fallback(355): 
In: PMI_Abort(2141455, Fatal error in PMPI_Init: Other MPI error, error
stack:
MPIR_Init_thread(175)...................: 
MPID_Init(509)..........................: 
MPIR_pmi_init(119)......................: 
build_nodemap(882)......................: 
MPIR_NODEMAP_build_nodemap_fallback(355): )
srun: error: gpu001: tasks 0-9: Segmentation fault (core dumped)




-- 
Dr. rer. nat. Christof Köhler       email: c.koehler at uni-bremen.de
Universitaet Bremen/FB1/BCCMS       phone:  +49-(0)421-218-62334
Am Fallturm 1/ TAB/ Raum 3.06       fax: +49-(0)421-218-62770
28359 Bremen  



More information about the Mvapich-discuss mailing list