[mvapich-discuss] MPI_Comm_connect/accept segfault during MPIDI_CH3I_comm_create

Neil Spruit nrspruit at gmail.com
Tue Nov 25 18:58:56 EST 2014


Hello,

Ok, apparently that solved that issue, settingthe MV2_USE_OSU_NB_COLLECTIVES
= 0 removed the segfault and successfully completes the connection.

So, I will go ahead and install MV2 2.01 or 2.1a then, thank you all very
much for your time in helping me root cause this, it is appreciated!

Respectfully,
Neil Spruit

On Tue, Nov 25, 2014 at 3:12 PM, Jian Lin <lin.2180 at osu.edu> wrote:

> Hi, Neil,
>
> We noticed that you are using MV2 2.0 release. It could be a known bug
> about non-blocking collective in 2.0. You can try 2.0 with NBC disabled
> to see whether the error can be reproduced:
>
>   mpiexec -n 1 -genv MV2_USE_OSU_NB_COLLECTIVES 0 ./mpi_connect_accept_sink
>   mpiexec -n 1 -genv MV2_USE_OSU_NB_COLLECTIVES 0 ./mpi_connect_accept
> HOSTNAME
>
> If it works, we suggest you use MV2 2.0.1 or 2.1a directly. The bug has
> been fixed in these new versions.
>
> Please let us know if you problem has been solved. Thanks!
>
> On Fri, 21 Nov 2014 14:20:54 -0800
> Neil Spruit <nrspruit at gmail.com> wrote:
>
> > Sure, please see my attached reproducer, to build and run please
> > follow these steps:
> >
> > 1) to build the test binaries run ./build.sh
> > 2) once built scp the mpi_connect_accept_sink to /tmp on your target
> > remote host
> > 3) from the remote host goto /tmp and run "mpiexec -n 1
> > ./mpi_connect_accept_sink" (this is the method in which this scenario
> > is launched) this binary will open a port and write the MPI port to a
> > file 4) From your main Host run "mpiexec -n 1 ./mpi_connect_accept
> > remote_hostname" where remote_hostname is the hostname of the system
> > that launched mpi_connect_accept_sink
> > 5) Once launched on the host the mpi_connect_accept will wait for a
> > key press from the user to read the remote host's opened port, then
> > attempt to connect.
> >
> > My current configuration is using an infiniband 1-1 connection
> > between two machines with OFED 3.12 with mellanox cards.
> >
> > So far I have tested this with both mpich and Intel MPI and both are
> > able to connect and exit cleanly.
> >
> > Thank you very much for looking into this issue!
> >
> > Respectfully,
> > Neil Spruit
> >
> > On Fri, Nov 21, 2014 at 1:49 PM, Jonathan Perkins <
> > perkinjo at cse.ohio-state.edu> wrote:
> >
> > > Thanks for the info Neil. Is there a simple reproducer that you can
> > > share with us? We'll take a look at it and see if what the problem
> > > may be. On Nov 21, 2014 3:47 PM, "Neil Spruit" <nrspruit at gmail.com>
> > > wrote:
> > >
> > >> Yes, I have  MV2_ENABLE_AFFINITY=0 and MV2_SUPPORT_DPM=1 both set
> > >> in this case since I am performing dynamic process creation and
> > >> using thread level "multiple". I have mvapich on my boxes built
> > >> with --enable-threads=multiple.
> > >>
> > >> Thanks,
> > >> Neil
> > >>
> > >> On Fri, Nov 21, 2014 at 12:35 PM, Jonathan Perkins <
> > >> perkinjo at cse.ohio-state.edu> wrote:
> > >>
> > >>> Hi, have you tried setting MV2_SUPPORT_DPM=1? Please take a look
> > >>> at
> > >>>
> > >>>
> http://mvapich.cse.ohio-state.edu/static/media/mvapich/mvapich2-2.0.1-userguide.html#x1-22700011.73
> > >>> for more information on this runtime variable.
> > >>>
> > >>
> > >>
>
>
>
> --
> Jian Lin
> http://linjian.org
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20141125/20c14414/attachment-0001.html>


More information about the mvapich-discuss mailing list