[mvapich-discuss] Stuck in waitall
Maksym Planeta
mplaneta at os.inf.tu-dresden.de
Thu Sep 22 11:27:20 EDT 2016
I was lucky, so there exists a reproducer for me:
#include <mpi.h>
int main(int argc, char ** argv)
{
MPI_Init(&argc, &argv);
MPI_Comm comm;
MPI_Comm_dup(MPI_COMM_WORLD, &comm);
int rank;
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
double send = rank;
double sum = 0;
MPI_Request req;
MPI_Ireduce(&send, &sum, 1, MPI_DOUBLE, MPI_SUM, 0, comm, &req);
MPI_Waitall(1, &req, MPI_STATUSES_IGNORE);
MPI_Finalize();
}
And I factored out the reason to the value of variable MV2_SPIN_COUNT.
If it is 1 it reliably hangs, if it is more, then doesn't seem to hang.
For time being i will use bigger values for SPIN_COUNT
On 09/22/2016 03:54 PM, Hari Subramoni wrote:
> Thanks for the report. Sorry to hear that you are facing issues. Do you
> have a reproducer that we could use? Does the hang go away of you remove
> any of the environment variables?
>
> Thanks,
> Hari.
>
>
> On Sep 22, 2016 7:11 AM, "Maksym Planeta" <mplaneta at os.inf.tu-dresden.de
> <mailto:mplaneta at os.inf.tu-dresden.de>> wrote:
>
> Hello,
>
> I have MPI_Reduce which I want to replace with a non-blocking
> collective.
>
> Each rank passes a double, and I expect to gather sum on rank 0. The
> reduce I call as following:
>
> MPI_Reduce(&ts->time, &ts->sum, 1, MPI_DOUBLE, MPI_SUM,
> 0, scr_comm_node);
>
> scr_comm_node is a communicator which unites ranks of the same node.
>
> I replaced this call with combination of ireduce, test, and wait,
> but the code was stuck all the time.
>
> I started to simplify the code and ended up with these MPI calls
> which basically follow each other:
>
> MPI_Ireduce(&ts->time, &ts->sum, 1, MPI_DOUBLE, MPI_SUM,
> 0, scr_comm_node, &ts->request[ts->num_req++]);
> ...
> MPI_Waitall(ts->num_req, &ts->request[0], &ts->status[0]);
>
> ... <Then follow collectives on other communicators>
>
> And looking different stack traces I see that at some of the nodes
> there are ranks, which can't simply leave this MPI_Waitall.
>
> Am I doing something wrong here, or does it look like a bug?
>
> I set following environment variables:
>
> export MV2_USE_BLOCKING=1
> # I set affinity on my own, because I have 2 processes per CPU
> export MV2_ENABLE_AFFINITY=0
> export MV2_RDMA_NUM_EXTRA_POLLS=1
> export MV2_CM_MAX_SPIN_COUNT=1
> export MV2_SPIN_COUNT=1
>
>
> mpiname output:
>
> MVAPICH2 2.2 Thu Sep 08 22:00:00 EST 2016 ch3:mrail
>
> Compilation
> CC: gcc -g -O0
> CXX: g++ -g -O0
> F77: gfortran -L/lib -L/lib -g -O0
> FC: gfortran -g -O0
>
> Configuration
> --enable-fortran=all --enable-cxx --enable-error-checking=all
> --enable-error-messages=none --enable-timing=none
> --enable-check-compiler-flags --enable-threads=multiple
> --enable-weak-symbols --disable-dependency-tracking
> --enable-fast-install --disable-rdma-cm --with-pm=mpirun:hydra
> --with-rdma=gen2 --with-device=ch3:mrail --enable-alloca
> --enable-hwloc --disable-fast --enable-g=dbg
> --enable-error-messages=all --enable-error-checking=all
> --prefix=/home/s9951545/apps.taurus/mvapich2/2.2-mpirun-dbg/
>
>
>
> --
> Regards,
> Maksym Planeta
>
>
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> <mailto:mvapich-discuss at cse.ohio-state.edu>
> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> <http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss>
>
--
Regards,
Maksym Planeta
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 5174 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20160922/f62080e9/attachment-0001.p7s>
More information about the mvapich-discuss
mailing list