[mvapich-discuss] ictest.c hangs with np > 2

Sayantan Sur surs at cse.ohio-state.edu
Fri Jun 23 09:19:37 EDT 2006


Mark,

Thanks for your detailed report!

>The MVAPICH configuration comes from make.mvapich.gen2 with no
>significant changes using the settings for SDR and PCI-Express. I think
>that the underlying problem comes from the MPICH source code base and is
>exposed by the extra code in MVAPICH in the function comm_exch_addr. I
>believe that this will always happen when either VIADEV_RPUT_SUPPORT or
>VIADEV_RGET_SUPPORT is defined (and multi-rail is not enabled).
>  
>
If RPUT or RGET support is included, then by default MVAPICH utilizes 
the RDMA based collectives. The RDMA collectives require exchange of 
addresses/keys and such, so the comm_exch_addr routine is triggered. Is 
this what you meant?

>My question is - do you see a failure on ictest when run with more than
>2 processes?
>  
>
Yes. We are able to see a hang on ictest.

>I debugged the hang quite extensively and came to the following
>understanding. This looks like an underlying MPICH bug that is exposed
>by MVAPICH. The hang occurs at this line in ictest.c:
>
>    MPI_Intercomm_merge ( mySecondComm, 1, &merge4 );
>  
>
We haven't yet analyzed the problem completely. We will get back to you 
after studying the problem properly and finding proper fixes. For the 
time being, may be you can disable the RDMA collectives.

Thanks,
Sayantan.

-- 
http://www.cse.ohio-state.edu/~surs



More information about the mvapich-discuss mailing list