[mvapich-discuss] Stuck in waitall

Maksym Planeta mplaneta at os.inf.tu-dresden.de
Thu Sep 22 07:11:24 EDT 2016


Hello,

I have MPI_Reduce which I want to replace with a non-blocking collective.

Each rank passes a double, and I expect to gather sum on rank 0. The 
reduce I call as following:

        MPI_Reduce(&ts->time, &ts->sum, 1, MPI_DOUBLE, MPI_SUM,
                    0, scr_comm_node);

scr_comm_node is a communicator which unites ranks of the same node.

I replaced this call with combination of ireduce, test, and wait, but 
the code was stuck all the time.

I started to simplify the code and ended up with these MPI calls which 
basically follow each other:

        MPI_Ireduce(&ts->time, &ts->sum, 1, MPI_DOUBLE, MPI_SUM,
                    0, scr_comm_node, &ts->request[ts->num_req++]);
        ...
        MPI_Waitall(ts->num_req, &ts->request[0], &ts->status[0]);

        ... <Then follow collectives on other communicators>

And looking different stack traces I see that at some of the nodes there 
are ranks, which can't simply leave this MPI_Waitall.

Am I doing something wrong here, or does it look like a bug?

I set following environment variables:

export MV2_USE_BLOCKING=1
# I set affinity on my own, because I have 2 processes per CPU
export MV2_ENABLE_AFFINITY=0
export MV2_RDMA_NUM_EXTRA_POLLS=1
export MV2_CM_MAX_SPIN_COUNT=1
export MV2_SPIN_COUNT=1


mpiname output:

MVAPICH2 2.2 Thu Sep 08 22:00:00 EST 2016 ch3:mrail

Compilation
CC: gcc    -g -O0
CXX: g++   -g -O0
F77: gfortran -L/lib -L/lib   -g -O0
FC: gfortran   -g -O0

Configuration
--enable-fortran=all --enable-cxx --enable-error-checking=all 
--enable-error-messages=none --enable-timing=none 
--enable-check-compiler-flags --enable-threads=multiple 
--enable-weak-symbols --disable-dependency-tracking 
--enable-fast-install --disable-rdma-cm --with-pm=mpirun:hydra 
--with-rdma=gen2 --with-device=ch3:mrail --enable-alloca --enable-hwloc 
--disable-fast --enable-g=dbg --enable-error-messages=all 
--enable-error-checking=all 
--prefix=/home/s9951545/apps.taurus/mvapich2/2.2-mpirun-dbg/



-- 
Regards,
Maksym Planeta


-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 5174 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20160922/157ae855/attachment-0001.p7s>


More information about the mvapich-discuss mailing list