<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<style type="text/css" style="display:none;"> P {margin-top:0;margin-bottom:0;} </style>
</head>
<body dir="ltr">
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);" class="elementToProof">
Hi Cristof, <br>
</div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);" class="elementToProof">
<br>
</div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);" class="elementToProof">
Thanks for reporting this. It looks like what is happening is srun is unable to get your process mapping from the slurm daemon and is doing a fallback method. We've overridden that fallback to support other launchers with PMI1 support and it looks like we did
not provide the correct safeties to ensure it still worked with slrum. I should be able to provide you with a patch shortly. In the meantime, yes you can try building with hydra and/or mpirun_rsh by removing the slurm arguments. Both of those launchers have
some degree of integration with slurm. <br>
</div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);" class="elementToProof">
<br>
</div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);" class="elementToProof">
Thanks,</div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);" class="elementToProof">
Nat<br>
</div>
<div id="appendonsend"></div>
<hr style="display:inline-block;width:98%" tabindex="-1">
<div id="divRplyFwdMsg" dir="ltr"><font face="Calibri, sans-serif" style="font-size:11pt" color="#000000"><b>From:</b> Mvapich-discuss <mvapich-discuss-bounces@lists.osu.edu> on behalf of christof.koehler--- via Mvapich-discuss <mvapich-discuss@lists.osu.edu><br>
<b>Sent:</b> Thursday, May 18, 2023 07:47<br>
<b>To:</b> mvapich-discuss@lists.osu.edu <mvapich-discuss@lists.osu.edu><br>
<b>Subject:</b> [Mvapich-discuss] mvapich 3.0b srun start failure</font>
<div> </div>
</div>
<div class="BodyFragment"><font size="2"><span style="font-size:11pt;">
<div class="PlainText">!-------------------------------------------------------------------|<br>
This Message Is From an External Sender<br>
This message came from outside your organization.<br>
|-------------------------------------------------------------------!<br>
<br>
Hello everybody,<br>
<br>
I now started to test the mvapich 3.0b build. It was compiled on Rocky<br>
Linux 9.1 with slurm 23.02.2 and gcc 11.3.1. See at the end of email for<br>
mpichversion output.<br>
<br>
When I try to start a simple mpi hello world with srun --mpi=pmi2<br>
I see an error messages concerning PMI and a segfault, see also at the <br>
end of the email. The same mpi hello world source code using the same <br>
srun --mpi=pmi2 invocation (but obviously different binaries) works fine<br>
with mvapich2 2.3.7, mpich 4.1.1 and openmpi 4.1.5.<br>
<br>
Should I try another launcher, e.g. hydra by not setting --with-pm and<br>
--with-pmi ? Would the hydra launcher be able to communicate wth slurm,<br>
though ?<br>
<br>
Best Regards<br>
<br>
Christof<br>
<br>
$ mpichversion<br>
MVAPICH Version: 3.0b<br>
MVAPICH Release date: 04/10/2023<br>
MVAPICH Device: ch4:ofi<br>
MVAPICH configure: --with-pm=slurm --with-pmi=pmi1<br>
--with-device=ch4:ofi --prefix=/cluster/mpi/mvapich2/3.0a/gcc11.3.1<br>
MVAPICH CC: gcc -DNDEBUG -DNVALGRIND -O2<br>
MVAPICH CXX: g++ -DNDEBUG -DNVALGRIND -O2<br>
MVAPICH F77: gfortran -fallow-argument-mismatch -O2<br>
MVAPICH FC: gfortran -O2<br>
MVAPICH Custom Information: @MVAPICH_CUSTOM_STRING@<br>
<br>
Error Message:<br>
<br>
INTERNAL ERROR: invalid error code 6163 (Ring ids do not match) in<br>
MPIR_NODEMAP_build_nodemap_fallback:355<br>
Abort(2141455) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init:<br>
Other MPI error, error stack:<br>
MPIR_Init_thread(175)...................: <br>
MPID_Init(509)..........................: <br>
MPIR_pmi_init(119)......................: <br>
build_nodemap(882)......................: <br>
MPIR_NODEMAP_build_nodemap_fallback(355): <br>
In: PMI_Abort(2141455, Fatal error in PMPI_Init: Other MPI error, error<br>
stack:<br>
MPIR_Init_thread(175)...................: <br>
MPID_Init(509)..........................: <br>
MPIR_pmi_init(119)......................: <br>
build_nodemap(882)......................: <br>
MPIR_NODEMAP_build_nodemap_fallback(355): )<br>
INTERNAL ERROR: invalid error code 6106 (Ring ids do not match) in<br>
MPIR_NODEMAP_build_nodemap_fallback:355<br>
Abort(2141455) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init:<br>
Other MPI error, error stack:<br>
MPIR_Init_thread(175)...................: <br>
MPID_Init(509)..........................: <br>
MPIR_pmi_init(119)......................: <br>
build_nodemap(882)......................: <br>
MPIR_NODEMAP_build_nodemap_fallback(355): <br>
In: PMI_Abort(2141455, Fatal error in PMPI_Init: Other MPI error, error<br>
stack:<br>
MPIR_Init_thread(175)...................: <br>
MPID_Init(509)..........................: <br>
MPIR_pmi_init(119)......................: <br>
build_nodemap(882)......................: <br>
MPIR_NODEMAP_build_nodemap_fallback(355): )<br>
srun: error: gpu001: tasks 0-9: Segmentation fault (core dumped)<br>
<br>
<br>
<br>
<br>
-- <br>
Dr. rer. nat. Christof Köhler email: c.koehler@uni-bremen.de<br>
Universitaet Bremen/FB1/BCCMS phone: +49-(0)421-218-62334<br>
Am Fallturm 1/ TAB/ Raum 3.06 fax: +49-(0)421-218-62770<br>
28359 Bremen <br>
_______________________________________________<br>
Mvapich-discuss mailing list<br>
Mvapich-discuss@lists.osu.edu<br>
<a href="https://lists.osu.edu/mailman/listinfo/mvapich-discuss">https://lists.osu.edu/mailman/listinfo/mvapich-discuss</a><br>
</div>
</span></font></div>
</body>
</html>