[Mvapich-discuss] mvapich 3.0b, hangs depending on number of ranks

christof.koehler at bccms.uni-bremen.de christof.koehler at bccms.uni-bremen.de
Tue May 30 12:23:59 EDT 2023


Hello Nat,

On Tue, May 30, 2023 at 02:02:13PM +0000, Shineman, Nat wrote:
> Hi Christof,
> 
> Thanks for the detailed bug report. The odd process count looks like an issue in our CPU binding. I will take a look at that and see where the error is occurring, that one should be fairly straightforward to fix. Was this a PMI1 or PMI2 build of MVAPICH?

]$ mpichversion 
MVAPICH Version:        3.0b
MVAPICH Release date:   04/10/2023
MVAPICH Device:         ch4:ofi
MVAPICH configure:      --with-pm=slurm --with-pmi=pmi1 --with-device=ch4:ofi --with-libfabric=/cluster/libraries/libfabric/1.18.0/ --prefix=/cluster/mpi/mvapich2/3.0a/gcc11.3.1
MVAPICH CC:     gcc    -DNDEBUG -DNVALGRIND -O2
MVAPICH CXX:    g++   -DNDEBUG -DNVALGRIND -O2
MVAPICH F77:    gfortran -fallow-argument-mismatch  -O2
MVAPICH FC:     gfortran   -O2
MVAPICH Custom Information:     @MVAPICH_CUSTOM_STRING@

and

]$ mpichversion 
MVAPICH Version:        3.0b
MVAPICH Release date:   04/10/2023
MVAPICH Device:         ch4:ofi
MVAPICH configure:      --with-pm=slurm --with-pmi=pmi1 --with-device=ch4:ofi --prefix=/cluster/mpi/mvapich2/3.0a-libfabric16/gcc11.3.1
MVAPICH CC:     gcc    -DNDEBUG -DNVALGRIND -O2
MVAPICH CXX:    g++   -DNDEBUG -DNVALGRIND -O2
MVAPICH F77:    gfortran -fallow-argument-mismatch  -O2
MVAPICH FC:     gfortran   -O2
MVAPICH Custom Information:     @MVAPICH_CUSTOM_STRING@


They behave the same with respect to the failure to start on 57 ppn
(and, see below, probably all odd ppn). Checking the mpi hello world binary 
with ldd shows for the version without
"--with-libfabric=/cluster/libraries/libfabric/1.18.0/" that it uses

        linux-vdso.so.1 (0x00007ffdeefed000)
        libmpi.so.12 =>
/cluster/mpi/mvapich2/3.0a-libfabric16/gcc11.3.1/lib/libmpi.so.12
(0x00007fe23275a000)
        libc.so.6 => /lib64/libc.so.6 (0x00007fe23254a000)
        libm.so.6 => /lib64/libm.so.6 (0x00007fe23246f000)
        libpciaccess.so.0 => /lib64/libpciaccess.so.0
(0x00007fe232463000)
        libxml2.so.2 => /lib64/libxml2.so.2 (0x00007fe2322da000)
        libfabric.so.1 => /lib64/libfabric.so.1 (0x00007fe2318e0000)
        libpsm2.so.2 => /lib64/libpsm2.so.2 (0x00007fe231849000)
        libnuma.so.1 => /lib64/libnuma.so.1 (0x00007fe23183b000)
        libibverbs.so.1 => /lib64/libibverbs.so.1 (0x00007fe231819000)
        libuuid.so.1 => /lib64/libuuid.so.1 (0x00007fe231810000)
        librdmacm.so.1 => /lib64/librdmacm.so.1 (0x00007fe2317f3000)
        libefa.so.1 => /lib64/libefa.so.1 (0x00007fe2317e5000)
        libnl-3.so.200 => /lib64/libnl-3.so.200 (0x00007fe2317c1000)
        libnl-route-3.so.200 => /lib64/libnl-route-3.so.200
(0x00007fe231730000)
        libatomic.so.1 => /lib64/libatomic.so.1 (0x00007fe231727000)
        libpmi.so.0 => /lib64/libpmi.so.0 (0x00007fe23171f000)
        /lib64/ld-linux-x86-64.so.2 (0x00007fe2336ea000)
        libz.so.1 => /lib64/libz.so.1 (0x00007fe231705000)
        liblzma.so.5 => /lib64/liblzma.so.5 (0x00007fe2316d7000)
        libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00007fe2316bc000)
        libslurm_pmi.so => /usr/lib64/slurm/libslurm_pmi.so
(0x00007fe2314db000)
        libresolv.so.2 => /lib64/libresolv.so.2 (0x00007fe2314c7000)

The libfabric and the psm2 are the ones supplied by Cornelis. The
libslurm_pmi.so shows that this is indeed using pmi1 I assume.


> 
> For the hanging issue, let me take a look and see if I can reproduce this on our system. There may be a conflict with our shared memory implementation and the underlying OFI shared memory support. Can you please try running with the environment variable MVP_USE_SHARED_MEM=0? set? This will disable our enhanced shared memory designs so that we can check if they are somehow conflicting with PSM2/OPX. Am I correct in assuming that anything less than 58 ppn is working?

Good that you ask for confirmation. Something over here must have changed, 
see below, when I rebooted all machines yesterday. This is quite unfortunate.

Some details: Each node has two sockets with Intel 8362 CPUs (32 cores)
per socket, so 64 cores total. Slurm is set to bind to core, and I see
that working in top and slurms verbose output. The nodes are stateless,
i.e. ramdisk based. There were some configuration changes introduced 
yesterday, but as far as I can revert them they do not appear to have an 
impact; this includes reverting the num_user_contexts hfi1 module
setting. So I am at a loss what exactly changed.

Originally before the reboot I believe everything worked for even ppn up 
to 56 ppn, now:

libfabric 1.18.0 version, no environment set
--------------------------------------------
- testing only even ppn:  2 to 32 ppn ok, 34 to 62 ppn hang, 64 ppn ok
- start failures as described for 1, 3 and 5 ppn, not testing larger

libfabric 1.18.0 version, export MVP_USE_SHARED_MEM=0
--------------------------------------------
- testing only even ppn:  2 to 64 ppn ok, no problems
- start failures for 1, 3 and 5 ppn, not testing larger

libfabric 1.16.1 version, no environment set
--------------------------------------------
- testing only even ppn:  2 to 32 ppn ok, 34 to 62 ppn hang, 64 ppn ok
- start failures as described for 1, 3 and 5 ppn, not testing larger

libfabric 1.16.1 version, export MVP_USE_SHARED_MEM=0
--------------------------------------------
- testing only even ppn:  2 to 64 ppn ok, no problems
- start failures for 1, 3 and 5 ppn, not testing larger

The stacks (pstack) of the hanging programs I looked at appear 
not to have changed. This is a bit tedious to test, I hope I did not
mess something up. That the programm stopped working for 34 and more ppn
and the worked again for 64 ppn was a real surprise.

Best Regards

Christof

> 
> Thanks,
> Nat
> 
> 
> ________________________________
> From: Mvapich-discuss <mvapich-discuss-bounces at lists.osu.edu> on behalf of christof.koehler--- via Mvapich-discuss <mvapich-discuss at lists.osu.edu>
> Sent: Sunday, May 28, 2023 08:56
> To: mvapich-discuss at lists.osu.edu <mvapich-discuss at lists.osu.edu>
> Subject: Re: [Mvapich-discuss] mvapich 3.0b, hangs depending on number of ranks
> 
> Addendum:
> 
> While the stack traces note the mpi library locations as
> mvapich2/3.0a-libfabric16 and mvapich2/3.0a I can assure you that the
> mvapich 3.0 library source used is in fact 3.0b and not 3.0a !
> 
> On Sun, May 28, 2023 at 02:47:27PM +0200, christof.koehler--- via Mvapich-discuss wrote:
> > Hello everybody,
> >
> > first, my apologies for the very lengthy email.
> >
> > I have been testing the mvapich 3.0b on our cluster (Rocky Linux 9.1,
> > OmniPath interconnect, 64 cores per node) a bit more. I see a mpi hello
> > world program hanging depending on the number of ranks used, with some
> > slight variations in pstack traces depending on the libfabric version.
> > I will also report in this email a, possibly unrelated, problem with
> > launching an odd number of ranks (tasks) on single and multiple nodes
> > when using mvapich2 2.3.7 and mvapich 3.0b.
> >
> > I did not test this number of ranks before. Actually, before assembling
> > the data for the email I thought it was a problem when trying to
> > run on more than one node. I discovered this when changing hfi1 module
> > parameters for openmpi and re-testing everything.
> >
> > Note that we have the hfi1 module parameter num_user_contexts=128 set
> > to accomodate openmpi. Launcher is always srun --mpi=pmi2 ... with
> > CpuBind=cores set in the partition definition.
> >
> > The same program appears to work fine with mpich 4.1.1 (ofi) and
> > openmpi 4.1.5 (apparently ofi) in all situations on one and two nodes.
> >
> > PART 1, launching odd number of ranks
> > -------------------------------------
> > When trying to launch 57 ranks (--ntasks-per-node=57) the launchers of
> > mvapich2 and mapich 3.0b fail with these error messages. Note that
> > mvapich2 (!, not 3.0b) works when using an even number of ranks up to 64
> > on a single node, I did not test the odd ones (59, 61, 63) in between. I
> > assume mvapich2 will work for any even number of ranks.
> >
> > mvapich2
> > Error parsing CPU mapping string
> > INTERNAL ERROR: invalid error code ffffffff (Ring Index out of range) in
> > smpi_setaffinity:2741
> > Fatal error in MPI_Init: Other MPI error, error stack:
> > MPIR_Init_thread(493)........:
> > MPID_Init(400)...............:
> > MPIDI_CH3I_set_affinity(3594):
> > smpi_setaffinity(2741).......: Error parsing CPU mapping string
> > srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
> > slurmstepd: error: *** STEP 931.1 ON node013 CANCELLED AT
> > 2023-05-28T13:38:46 ***
> > srun: error: node013: tasks 0-55: Killed
> > srun: error: node013: task 56: Exited with exit code 1
> >
> > mvapich 3.0b
> >
> > Error parsing CPU mapping string
> > Invalid error code (-1) (error ring index 127 invalid)
> > INTERNAL ERROR: invalid error code ffffffff (Ring Index out of range) in
> > smpi_setaffinity:2789
> > Abort(2141583) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init:
> > Other MPI error, error stack:
> > MPIR_Init_thread(175)...........:
> > MPID_Init(597)..................:
> > srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
> > MPIDI_MVP_mpi_init_hook(268)....:
> > MPIDI_MVP_CH4_set_affinity(3745):
> > smpi_setaffinity(2789)..........: Error parsing CPU mapping string
> > In: PMI_Abort(2141583, Fatal error in PMPI_Init: Other MPI error, error
> > stack:
> > MPIR_Init_thread(175)...........:
> > MPID_Init(597)..................:
> > MPIDI_MVP_mpi_init_hook(268)....:
> > MPIDI_MVP_CH4_set_affinity(3745):
> > smpi_setaffinity(2789)..........: Error parsing CPU mapping string)
> > slurmstepd: error: *** STEP 932.0 ON node013 CANCELLED AT
> > 2023-05-28T13:45:25 ***
> > srun: error: node013: tasks 0-55: Killed
> > srun: error: node013: task 56: Exited with exit code 143
> >
> >
> > PART 2, hanging mpi hello world with mvapich 3.0b
> > -------------------------------------------------
> > Launching mvapich 3.0b via srun with --ntasks-per-node=58 or more even
> > ranks just hangs (Cornelis provided libfabric 1.16.1 and self-compiled 1.18.0).
> > I did not test odd ranks.  I see the mpi processes being spawned, but the
> > program never finishes or progresses. I add some pstack traces which might be
> > helpful. Please advise me how to obtain better debugging information if needed.
> >
> >
> > libfabric 1.16.1 (appears to use psm2 or ofi psm2 provider under the hood?)
> > Thread 2 (Thread 0x7f402c460640 (LWP 24222) "mpi_hello_world"):
> > #0  0x00007f409739771f in poll () from target:/lib64/libc.so.6
> > #1  0x00007f409657e245 in ips_ptl_pollintr () from target:/lib64/libpsm2.so.2
> > #2  0x00007f40972f4802 in start_thread () from target:/lib64/libc.so.6
> > #3  0x00007f4097294450 in clone3 () from target:/lib64/libc.so.6
> > Thread 1 (Thread 0x7f40961cbac0 (LWP 24118) "mpi_hello_world"):
> > #0  0x00007f409657cabd in ips_ptl_poll () from target:/lib64/libpsm2.so.2
> > #1  0x00007f409657a96f in psmi_poll_internal () from target:/lib64/libpsm2.so.2
> > #2  0x00007f409657482d in psm2_mq_ipeek () from target:/lib64/libpsm2.so.2
> > #3  0x00007f4096dd9459 in psmx2_cq_poll_mq (cq=cq at entry=0x697d00, trx_ctxt=0x6856d0, event_in=event_in at entry=0x7ffd119acdc0,
> > count=count at entry=8, src_addr=src_addr at entry=0x0) at prov/psm2/src/psmx2_cq.c:1086
> > #4  0x00007f4096ddc295 in psmx2_cq_readfrom (cq=0x697d00, buf=0x7ffd119acdc0, count=8, src_addr=0x0) at prov/psm2/src/psmx2_cq.c:1591
> > #5  0x00007f4097bf22e7 in MPIDI_OFI_progress () from target:/cluster/mpi/mvapich2/3.0a-libfabric16/gcc11.3.1/lib/libmpi.so.12
> > #6  0x00007f4097c19bfd in MPIDI_MVP_progress () from target:/cluster/mpi/mvapich2/3.0a-libfabric16/gcc11.3.1/lib/libmpi.so.12
> > #7  0x00007f4097bcb3c2 in progress_test () from target:/cluster/mpi/mvapich2/3.0a-libfabric16/gcc11.3.1/lib/libmpi.so.12
> > #8  0x00007f4097bcb793 in MPID_Progress_wait () from target:/cluster/mpi/mvapich2/3.0a-libfabric16/gcc11.3.1/lib/libmpi.so.12
> > #9  0x00007f4097684681 in MPIR_Wait_state () from target:/cluster/mpi/mvapich2/3.0a-libfabric16/gcc11.3.1/lib/libmpi.so.12
> > #10 0x00007f4097b4d479 in MPIC_Wait () from target:/cluster/mpi/mvapich2/3.0a-libfabric16/gcc11.3.1/lib/libmpi.so.12
> > #11 0x00007f4097b4d9ee in MPIC_Recv () from target:/cluster/mpi/mvapich2/3.0a-libfabric16/gcc11.3.1/lib/libmpi.so.12
> > #12 0x00007f40976fd73f in MPIR_Allreduce_intra_recursive_doubling () from target:/cluster/mpi/mvapich2/3.0a-libfabric16/gcc11.3.1/lib/libmpi.so.12
> > #13 0x00007f4097609ca2 in MPIR_Allreduce_impl () from target:/cluster/mpi/mvapich2/3.0a-libfabric16/gcc11.3.1/lib/libmpi.so.12
> > #14 0x00007f4097b523de in create_2level_comm () from target:/cluster/mpi/mvapich2/3.0a-libfabric16/gcc11.3.1/lib/libmpi.so.12
> > #15 0x00007f4097c07a38 in MPIDI_MVP_post_init () from target:/cluster/mpi/mvapich2/3.0a-libfabric16/gcc11.3.1/lib/libmpi.so.12
> > #16 0x00007f4097bc8b40 in MPID_InitCompleted () from target:/cluster/mpi/mvapich2/3.0a-libfabric16/gcc11.3.1/lib/libmpi.so.12
> > #17 0x00007f409766dd4f in MPIR_Init_thread () from target:/cluster/mpi/mvapich2/3.0a-libfabric16/gcc11.3.1/lib/libmpi.so.12
> > #18 0x00007f409766dae2 in PMPI_Init () from target:/cluster/mpi/mvapich2/3.0a-libfabric16/gcc11.3.1/lib/libmpi.so.12
> > #19 0x000000000040119d in main ()
> >
> > libfabric 1.18.1 (this appears to be the fi_opx provider now)
> > #0  0x00007fd43d31b087 in fi_opx_shm_poll_many.constprop () from target:/cluster/libraries/libfabric/1.18.0/lib/libfabric.so.1
> > #1  0x00007fd43d300367 in fi_opx_cq_read_FI_CQ_FORMAT_TAGGED_0_OFI_RELIABILITY_KIND_ONLOAD_FI_OPX_HDRQ_MASK_2048_0x0018000000000000ull () from target:/cluster/libraries/libfabric/1.18.0/lib/libfabric.so.1
> > #2  0x00007fd43e2392e7 in MPIDI_OFI_progress () from target:/cluster/mpi/mvapich2/3.0a/gcc11.3.1/lib/libmpi.so.12
> > #3  0x00007fd43e260bfd in MPIDI_MVP_progress () from target:/cluster/mpi/mvapich2/3.0a/gcc11.3.1/lib/libmpi.so.12
> > #4  0x00007fd43e2123c2 in progress_test () from target:/cluster/mpi/mvapich2/3.0a/gcc11.3.1/lib/libmpi.so.12
> > #5  0x00007fd43e212793 in MPID_Progress_wait () from target:/cluster/mpi/mvapich2/3.0a/gcc11.3.1/lib/libmpi.so.12
> > #6  0x00007fd43dccb681 in MPIR_Wait_state () from target:/cluster/mpi/mvapich2/3.0a/gcc11.3.1/lib/libmpi.so.12
> > #7  0x00007fd43e194479 in MPIC_Wait () from target:/cluster/mpi/mvapich2/3.0a/gcc11.3.1/lib/libmpi.so.12
> > #8  0x00007fd43e1949ee in MPIC_Recv () from target:/cluster/mpi/mvapich2/3.0a/gcc11.3.1/lib/libmpi.so.12
> > #9  0x00007fd43dd4473f in MPIR_Allreduce_intra_recursive_doubling () from target:/cluster/mpi/mvapich2/3.0a/gcc11.3.1/lib/libmpi.so.12
> > #10 0x00007fd43dc50ca2 in MPIR_Allreduce_impl () from target:/cluster/mpi/mvapich2/3.0a/gcc11.3.1/lib/libmpi.so.12
> > #11 0x00007fd43e1993de in create_2level_comm () from target:/cluster/mpi/mvapich2/3.0a/gcc11.3.1/lib/libmpi.so.12
> > #12 0x00007fd43e24ea38 in MPIDI_MVP_post_init () from target:/cluster/mpi/mvapich2/3.0a/gcc11.3.1/lib/libmpi.so.12
> > #13 0x00007fd43e20fb40 in MPID_InitCompleted () from target:/cluster/mpi/mvapich2/3.0a/gcc11.3.1/lib/libmpi.so.12
> > #14 0x00007fd43dcb4d4f in MPIR_Init_thread () from target:/cluster/mpi/mvapich2/3.0a/gcc11.3.1/lib/libmpi.so.12
> > #15 0x00007fd43dcb4ae2 in PMPI_Init () from target:/cluster/mpi/mvapich2/3.0a/gcc11.3.1/lib/libmpi.so.12
> > #16 0x000000000040119d in main ()
> >
> > #0  0x00007fd43d0140e0 in __errno_location at plt () from target:/cluster/libraries/libfabric/1.18.0/lib/libfabric.so.1
> > #1  0x00007fd43d300dd8 in fi_opx_cq_read_FI_CQ_FORMAT_TAGGED_0_OFI_RELIABILITY_KIND_ONLOAD_FI_OPX_HDRQ_MASK_2048_0x0018000000000000ull () from target:/cluster/libraries/libfabric/1.18.0/lib/libfabric.so.1
> > #2  0x00007fd43e2392e7 in MPIDI_OFI_progress () from target:/cluster/mpi/mvapich2/3.0a/gcc11.3.1/lib/libmpi.so.12
> > #3  0x00007fd43e260bfd in MPIDI_MVP_progress () from target:/cluster/mpi/mvapich2/3.0a/gcc11.3.1/lib/libmpi.so.12
> > #4  0x00007fd43e2123c2 in progress_test () from target:/cluster/mpi/mvapich2/3.0a/gcc11.3.1/lib/libmpi.so.12
> > #5  0x00007fd43e212793 in MPID_Progress_wait () from target:/cluster/mpi/mvapich2/3.0a/gcc11.3.1/lib/libmpi.so.12
> > #6  0x00007fd43dccb681 in MPIR_Wait_state () from target:/cluster/mpi/mvapich2/3.0a/gcc11.3.1/lib/libmpi.so.12
> > #7  0x00007fd43e194479 in MPIC_Wait () from target:/cluster/mpi/mvapich2/3.0a/gcc11.3.1/lib/libmpi.so.12
> > #8  0x00007fd43e1949ee in MPIC_Recv () from target:/cluster/mpi/mvapich2/3.0a/gcc11.3.1/lib/libmpi.so.12
> > #9  0x00007fd43dd4473f in MPIR_Allreduce_intra_recursive_doubling () from target:/cluster/mpi/mvapich2/3.0a/gcc11.3.1/lib/libmpi.so.12
> > #10 0x00007fd43dc50ca2 in MPIR_Allreduce_impl () from target:/cluster/mpi/mvapich2/3.0a/gcc11.3.1/lib/libmpi.so.12
> > #11 0x00007fd43e1993de in create_2level_comm () from target:/cluster/mpi/mvapich2/3.0a/gcc11.3.1/lib/libmpi.so.12
> > #12 0x00007fd43e24ea38 in MPIDI_MVP_post_init () from target:/cluster/mpi/mvapich2/3.0a/gcc11.3.1/lib/libmpi.so.12
> > #13 0x00007fd43e20fb40 in MPID_InitCompleted () from target:/cluster/mpi/mvapich2/3.0a/gcc11.3.1/lib/libmpi.so.12
> > #14 0x00007fd43dcb4d4f in MPIR_Init_thread () from target:/cluster/mpi/mvapich2/3.0a/gcc11.3.1/lib/libmpi.so.12
> > #15 0x00007fd43dcb4ae2 in PMPI_Init () from target:/cluster/mpi/mvapich2/3.0a/gcc11.3.1/lib/libmpi.so.12
> > #16 0x000000000040119d in main ()
> >
> > Best Regards
> >
> > Christof
> >
> >
> > --
> > Dr. rer. nat. Christof Köhler       email: c.koehler at uni-bremen.de
> > Universitaet Bremen/FB1/BCCMS       phone:  +49-(0)421-218-62334
> > Am Fallturm 1/ TAB/ Raum 3.06       fax: +49-(0)421-218-62770
> > 28359 Bremen
> 
> 
> 
> > _______________________________________________
> > Mvapich-discuss mailing list
> > Mvapich-discuss at lists.osu.edu
> > https://lists.osu.edu/mailman/listinfo/mvapich-discuss
> 
> 
> --
> Dr. rer. nat. Christof Köhler       email: c.koehler at uni-bremen.de
> Universitaet Bremen/FB1/BCCMS       phone:  +49-(0)421-218-62334
> Am Fallturm 1/ TAB/ Raum 3.06       fax: +49-(0)421-218-62770
> 28359 Bremen

-- 
Dr. rer. nat. Christof Köhler       email: c.koehler at uni-bremen.de
Universitaet Bremen/FB1/BCCMS       phone:  +49-(0)421-218-62334
Am Fallturm 1/ TAB/ Raum 3.06       fax: +49-(0)421-218-62770
28359 Bremen  
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: not available
URL: <http://lists.osu.edu/pipermail/mvapich-discuss/attachments/20230530/5d43bb48/attachment-0006.sig>


More information about the Mvapich-discuss mailing list