[mvapich-discuss] Deadlock in reduce/allreduce; Intel PSM2

Tue Nov 22 10:46:24 EST 2016

Hello everybody,

I would like to report deadlock issues connected to reduce/allreduce.
Setting MV2_USE_SHMEM_COLL=0 apparently prevents these as far as we have
been able to test. Please see bottom of the mail for the output of mpiname -a.
We are using a relatively new network (Intel Omni-Path, PSM2, newest
Intel driver package 10.2) and so are unsure what this means. Is this a
mvapich or a psm2 related issue (or both) ?

The program we observe the deadlocks with is vasp 5.3.5 for specific
inputs. These inputs appear to work without problems when using
openmpi/1.10. We have other inputs for which no deadlocks have 
been observed with mvapich. Without setting  MV2_USE_SHMEM_COLL=0 the
program deadlocks after some varying runtime always in the same vasp and
MPI subroutines.

Most often the deadlock is in (obtained with pstack on master rank, 
psm2 progress thread shown):

Thread 2 (Thread 0x2b881a4ff700 (LWP 27731)):
#0  0x00002b8812fa569d in poll () from /lib64/libc.so.6
#1  0x00002b881424ece2 in ips_ptl_pollintr () from /lib64/libpsm2.so.2
#2  0x00002b8812ca4dc5 in start_thread () from /lib64/libpthread.so.0
#3  0x00002b8812fafced in clone () from /lib64/libc.so.6
Thread 1 (Thread 0x2b8811cf0ec0 (LWP 27344)):
#0  0x00002b881424da0c in ips_ptl_poll () from /lib64/libpsm2.so.2
#1  0x00002b881424cccb in psmi_poll_internal () from /lib64/libpsm2.so.2
#2  0x00002b8814248238 in psm2_mq_ipeek2 () from /lib64/libpsm2.so.2
#3  0x00002b881259004f in psm_progress_wait () from /cluster/mpi/mvapich2/2.2u1/intel2016/lib/libmpi.so.12
#4  0x00002b881258ff93 in psm_try_complete () from /cluster/mpi/mvapich2/2.2u1/intel2016/lib/libmpi.so.12
#5  0x00002b881258fa33 in psm_recv () from /cluster/mpi/mvapich2/2.2u1/intel2016/lib/libmpi.so.12
#6  0x00002b88125865fd in MPID_Recv () from /cluster/mpi/mvapich2/2.2u1/intel2016/lib/libmpi.so.12
#7  0x00002b881254880b in MPIC_Recv () from /cluster/mpi/mvapich2/2.2u1/intel2016/lib/libmpi.so.12
#8  0x00002b88124c6ff4 in MPIR_Reduce_binomial_MV2 () from /cluster/mpi/mvapich2/2.2u1/intel2016/lib/libmpi.so.12
#9  0x00002b88124c7b67 in MPIR_Reduce_index_tuned_intra_MV2 () from /cluster/mpi/mvapich2/2.2u1/intel2016/lib/libmpi.so.12
#10 0x00002b88124c6d4a in MPIR_Reduce_MV2 () from /cluster/mpi/mvapich2/2.2u1/intel2016/lib/libmpi.so.12
#11 0x00002b8812473ef0 in MPIR_Reduce_impl () from /cluster/mpi/mvapich2/2.2u1/intel2016/lib/libmpi.so.12
#12 0x00002b88124747a0 in PMPI_Reduce () from /cluster/mpi/mvapich2/2.2u1/intel2016/lib/libmpi.so.12
#13 0x00002b8812161c9b in pmpi_reduce__ () from /cluster/mpi/mvapich2/2.2u1/intel2016/lib/libmpifort.so.12
#14 0x0000000000460337 in m_sum_master_d_ ()
#15 0x00000000004dce31 in kpar_sync_fertot_ ()
#16 0x00000000004dcb11 in kpar_sync_all_ ()
#17 0x0000000000434c8f in MAIN__ ()
#18 0x000000000040d81e in main ()

Less frequently, and therefore more difficult to test the influence of
MV2_USE_SHMEM_COLL=0, we observe another deadlock in:

Thread 2 (Thread 0x2aeac8617700 (LWP 136844)):
#0  0x00002aeac10bd69d in poll () from /lib64/libc.so.6
#1  0x00002aeac2366ce2 in ips_ptl_pollintr () from /lib64/libpsm2.so.2
#2  0x00002aeac0dbcdc5 in start_thread () from /lib64/libpthread.so.0
#3  0x00002aeac10c7ced in clone () from /lib64/libc.so.6
Thread 1 (Thread 0x2aeabfe08ec0 (LWP 136791)):
#0  0x00002aeac2364cec in psmi_poll_internal () from /lib64/libpsm2.so.2
#1  0x00002aeac2360238 in psm2_mq_ipeek2 () from /lib64/libpsm2.so.2
#2  0x00002aeac06a804f in psm_progress_wait () from /cluster/mpi/mvapich2/2.2u1/intel2016/lib/libmpi.so.12
#3  0x00002aeac0660f47 in MPIC_Wait () from /cluster/mpi/mvapich2/2.2u1/intel2016/lib/libmpi.so.12
#4  0x00002aeac0660d62 in MPIC_Sendrecv () from /cluster/mpi/mvapich2/2.2u1/intel2016/lib/libmpi.so.12
#5  0x00002aeac05ce5d4 in MPIR_Allreduce_pt2pt_rs_MV2 () from /cluster/mpi/mvapich2/2.2u1/intel2016/lib/libmpi.so.12
#6  0x00002aeac05d0adb in MPIR_Allreduce_index_tuned_intra_MV2 () from /cluster/mpi/mvapich2/2.2u1/intel2016/lib/libmpi.so.12
#7  0x00002aeac05d023d in MPIR_Allreduce_MV2 () from /cluster/mpi/mvapich2/2.2u1/intel2016/lib/libmpi.so.12
#8  0x00002aeac0582549 in MPIR_Allreduce_impl () from /cluster/mpi/mvapich2/2.2u1/intel2016/lib/libmpi.so.12
#9  0x00002aeac05822dd in PMPI_Allreduce () from /cluster/mpi/mvapich2/2.2u1/intel2016/lib/libmpi.so.12
#10 0x00002aeac0279d02 in pmpi_allreduce__ () from /cluster/mpi/mvapich2/2.2u1/intel2016/lib/libmpifort.so.12
#11 0x000000000045e689 in m_sumb_d_ ()
#12 0x00000000005f9c13 in charge_mp_soft_charge_ ()
#13 0x000000000066ec50 in us_mp_set_charge_ ()
#14 0x0000000000859b6b in elmin_ ()
#15 0x00000000004333fa in MAIN__ ()
#16 0x000000000040d81e in main ()

Setting MV2_USE_SHMEM_COLL=0 we did not observe these deadlocks any more
even for runtimes five times as large as the original testcase and
several test runs.

Unfortunately the reproducer needs 160 CPU cores (ca. 750 MByte/core
memory), so it is very difficult for us to analyse this further or 
check memory contents. 

Thank you very much for your help ! Please let us know if more
information is needed or if we can assist in any other way.

Best Regards

Christof

MVAPICH2 2.2 Thu Sep 08 22:00:00 EST 2016 ch3:psm

Compilation
CC: icc -O1 -fp-model precise   -O1
CXX: icpc -O1 -fp-model precise  -O1
F77: ifort -O1 -fp-model precise  -O1
FC: ifort -O1 -fp-model precise  -O1

Configuration
CC=icc CXX=icpc FC=ifort CFLAGS=-O1 -fp-model precise CXXFLAGS=-O1
-fp-model precise FFLAGS=-O1 -fp-model precise FCFLAGS=-O1 -fp-model
precise --enable-shared --without-mpe --with-device=ch3:psm --with-psm2
--enable-fast=O1 --enable-option-checking
--prefix=/cluster/mpi/mvapich2/2.2u1/intel2016

-- 
Dr. rer. nat. Christof Köhler       email: c.koehler at bccms.uni-bremen.de
Universitaet Bremen/ BCCMS          phone:  +49-(0)421-218-62334
Am Fallturm 1/ TAB/ Raum 3.12       fax: +49-(0)421-218-62770
28359 Bremen  

PGP: http://www.bccms.uni-bremen.de/cms/people/c_koehler/