[mvapich-discuss] ictest.c hangs with np > 2

Mark Debbage markdebbage at pathscale.com
Fri Jun 23 15:11:06 EDT 2006


On Fri, 2006-06-23 at 08:19 -0500, Sayantan Sur wrote:
> If RPUT or RGET support is included, then by default MVAPICH utilizes 
> the RDMA based collectives. The RDMA collectives require exchange of 
> addresses/keys and such, so the comm_exch_addr routine is triggered. Is 
> this what you meant?

Yes, exactly.

> Yes. We are able to see a hang on ictest.

OK, thanks for confirming that you see the problem too.

> We haven't yet analyzed the problem completely. We will get back to you 
> after studying the problem properly and finding proper fixes. For the 
> time being, may be you can disable the RDMA collectives.

There doesn't appear to be a define to disable the code in
comm_exch_addr.. Even if I set:

    export DISABLE_RDMA_BARRIER=1
    export DISABLE_RDMA_ALLTOALL=1
    export DISABLE_RDMA_ALLGATHER=1

the code in comm_exch_addr is still used. So I went through the
collective files and added my own #define to disable all of the RDMA
collective mechanism, and now ictest passes at np=4. I think this is
good enough for our internal testing. However, I'm not sure how we will
deal with this for our customers who want to use MVAPICH. We don't
distribute a built version of MVAPICH, so one way would be to distribute
a patch to turn off these collectives. Of course the same problem for
other MVAPICH users too, it is not specific to the InfiniPath adapters.

Let me know the results of your analysis. I think this is a longstanding
bug in the underlying MPICH source code base, but I'm not sure of the
fix.

Regards, 

Mark.

> 
> Thanks,
> Sayantan.
> 



More information about the mvapich-discuss mailing list