[mvapich-discuss] MPI_Comm_connect/accept segfault during MPIDI_CH3I_comm_create

Jian Lin lin.2180 at osu.edu
Tue Nov 25 18:12:58 EST 2014


Hi, Neil,

We noticed that you are using MV2 2.0 release. It could be a known bug
about non-blocking collective in 2.0. You can try 2.0 with NBC disabled
to see whether the error can be reproduced:

  mpiexec -n 1 -genv MV2_USE_OSU_NB_COLLECTIVES 0 ./mpi_connect_accept_sink 
  mpiexec -n 1 -genv MV2_USE_OSU_NB_COLLECTIVES 0 ./mpi_connect_accept HOSTNAME

If it works, we suggest you use MV2 2.0.1 or 2.1a directly. The bug has 
been fixed in these new versions. 

Please let us know if you problem has been solved. Thanks!

On Fri, 21 Nov 2014 14:20:54 -0800
Neil Spruit <nrspruit at gmail.com> wrote:

> Sure, please see my attached reproducer, to build and run please
> follow these steps:
> 
> 1) to build the test binaries run ./build.sh
> 2) once built scp the mpi_connect_accept_sink to /tmp on your target
> remote host
> 3) from the remote host goto /tmp and run "mpiexec -n 1
> ./mpi_connect_accept_sink" (this is the method in which this scenario
> is launched) this binary will open a port and write the MPI port to a
> file 4) From your main Host run "mpiexec -n 1 ./mpi_connect_accept
> remote_hostname" where remote_hostname is the hostname of the system
> that launched mpi_connect_accept_sink
> 5) Once launched on the host the mpi_connect_accept will wait for a
> key press from the user to read the remote host's opened port, then
> attempt to connect.
> 
> My current configuration is using an infiniband 1-1 connection
> between two machines with OFED 3.12 with mellanox cards.
> 
> So far I have tested this with both mpich and Intel MPI and both are
> able to connect and exit cleanly.
> 
> Thank you very much for looking into this issue!
> 
> Respectfully,
> Neil Spruit
> 
> On Fri, Nov 21, 2014 at 1:49 PM, Jonathan Perkins <
> perkinjo at cse.ohio-state.edu> wrote:
> 
> > Thanks for the info Neil. Is there a simple reproducer that you can
> > share with us? We'll take a look at it and see if what the problem
> > may be. On Nov 21, 2014 3:47 PM, "Neil Spruit" <nrspruit at gmail.com>
> > wrote:
> >
> >> Yes, I have  MV2_ENABLE_AFFINITY=0 and MV2_SUPPORT_DPM=1 both set
> >> in this case since I am performing dynamic process creation and
> >> using thread level "multiple". I have mvapich on my boxes built
> >> with --enable-threads=multiple.
> >>
> >> Thanks,
> >> Neil
> >>
> >> On Fri, Nov 21, 2014 at 12:35 PM, Jonathan Perkins <
> >> perkinjo at cse.ohio-state.edu> wrote:
> >>
> >>> Hi, have you tried setting MV2_SUPPORT_DPM=1? Please take a look
> >>> at
> >>>
> >>> http://mvapich.cse.ohio-state.edu/static/media/mvapich/mvapich2-2.0.1-userguide.html#x1-22700011.73
> >>> for more information on this runtime variable.
> >>>
> >>
> >>



-- 
Jian Lin
http://linjian.org



More information about the mvapich-discuss mailing list