[mvapich-discuss] -DDISABLE_PTMALLOC, MPI_Bcast, and -DMCST_SUPPORT

David_Kewley at Dell.com David_Kewley at Dell.com
Mon Jun 23 23:16:08 EDT 2008


I should have added: VIADEV_USE_SHMEM_BCAST=0 does in fact prevent this
segfault.

David

> -----Original Message-----
> From: mvapich-discuss-bounces at cse.ohio-state.edu
[mailto:mvapich-discuss-
> bounces at cse.ohio-state.edu] On Behalf Of David_Kewley at dell.com
> Sent: Monday, June 23, 2008 4:29 PM
> To: koop at cse.ohio-state.edu
> Cc: mvapich-discuss at cse.ohio-state.edu
> Subject: RE: [mvapich-discuss] -DDISABLE_PTMALLOC, MPI_Bcast,and -
> DMCST_SUPPORT
> 
> Matt,
> 
> Thanks for clarifying the effect of -DDISABLE_PTMALLOC, and the fact
> that hardware-based multicast is not enabled right now.  I think
that's
> all I need to know on those topics for now.
> 
> I have a reproducer and observations about the apparent MPI_Bcast
> segfault bug.  This is on x86_64, using Intel Fortran 10.1.015 (Build
> 20080312), and the executable ends up using the Intel implementation
of
> memcpy(), in case that's significant -- see the backtrace below.  This
> is with MVAPICH 1.0.
> 
> The segfault occurs whenever these two conditions both hold:
> 
> 1) length of the character array sent is > 8MB-11kB
> 2) #procs is > (7 nodes) * (N procs per node)
> 
> For the second condition I tested with N=1,2,4 procs per node, in
which
> cases the segfault occurred when #procs in the job size exceeded
7,14,28
> procs respectively.
> 
> If either of the conditions does not hold, the segfault does not
occur.
> The threshold is exactly 8MB-11kB.  If the length of the char array is
> 8MB-11kB, it's fine, but if it's 8MB-11kB+1, it segfaults.
> 
> The segfault occurs in the memcpy function (again, it's the Intel
> memcpy), when it tries to copy into the rhandle->buf beyond the
8MB-11kB
> mark.  The backtrace is, for example:
> 
> #0  0x00000000004045c1 in __intel_new_memcpy ()
> #1  0x0000000000401ee8 in _intel_fast_memcpy.J ()
> #2  0x0000002a9560010e in MPID_VIA_self_start () from
> /opt/mvapich/1.0/intel/10.1.015/lib/shared/libmpich.so.1.0
> #3  0x0000002a955d8e82 in MPID_IsendContig () from
> /opt/mvapich/1.0/intel/10.1.015/lib/shared/libmpich.so.1.0
> #4  0x0000002a955d7564 in MPID_IsendDatatype () from
> /opt/mvapich/1.0/intel/10.1.015/lib/shared/libmpich.so.1.0
> #5  0x0000002a955cc4d6 in PMPI_Isend () from
> /opt/mvapich/1.0/intel/10.1.015/lib/shared/libmpich.so.1.0
> #6  0x0000002a955e95d2 in PMPI_Sendrecv () from
> /opt/mvapich/1.0/intel/10.1.015/lib/shared/libmpich.so.1.0
> #7  0x0000002a955bf7e9 in intra_Bcast_Large () from
> /opt/mvapich/1.0/intel/10.1.015/lib/shared/libmpich.so.1.0
> #8  0x0000002a955bcfa0 in intra_newBcast () from
> /opt/mvapich/1.0/intel/10.1.015/lib/shared/libmpich.so.1.0
> #9  0x0000002a95594e00 in PMPI_Bcast () from
> /opt/mvapich/1.0/intel/10.1.015/lib/shared/libmpich.so.1.0
> #10 0x0000000000401e3d in main ()
> 
> Attached find a simple reproducer C program.
> 
> David
> 
> > -----Original Message-----
> > From: Matthew Koop [mailto:koop at cse.ohio-state.edu]
> > Sent: Friday, June 20, 2008 4:15 AM
> > To: Kewley, David
> > Cc: mvapich-discuss at cse.ohio-state.edu
> > Subject: Re: [mvapich-discuss] -DDISABLE_PTMALLOC, MPI_Bcast, and -
> > DMCST_SUPPORT
> >
> > David,
> >
> > I'll answer your questions inline:
> >
> > > What are the likely performance impacts of using
-DDISABLE_PTMALLOC
> > > (including memory use)?  Does this differ between MVAPICH and
> MVAPICH2?
> > > We are considering seeing what effect this has on certain
> applications
> > > that have seen problems with realloc.
> >
> > The effects of turning off PTMALLOC (using -DDISABLE_PTMALLOC) will
be
> the
> > same between MVAPICH and MVAPICH2.
> >
> > The point of using the PTMALLOC library is to allow caching of
> InfiniBand
> > memory registrations. To ensure correctness we need to know if
memory
> is
> > being free'd, etc. Since registration for InfiniBand is very
expensive
> we
> > attempt to cache these registrations so if the same buffer is
re-used
> > again for communication it will already be registered (speeding up
the
> > application).
> >
> > So the performance change will be application-dependent. If the
> > application makes frequent re-use of buffers for communication the
> > performance will likely be hurt. On the flip side, if the
application
> has
> > very poor buffer re-use the performance may actually be better by
not
> > using the registration cache (you can always turn it off at runtime
> with
> > VIADEV_USE_DREG_CACHE=0 on MVAPICH). When the registration cache is
> not
> > turned on a copy-based approach is used for messages under a certain
> size
> > -- so no zero-copy that is normally used, but registration is not
> used.
> >
> > I hope this helps. Please let me know if you need additional
> > clarification.
> >
> > > Topic #2:
> > >
> > > We are using the OpenIB components of OFED 1.2.5.5, and are
building
> our
> > > own MVAPICH and MVAPICH2, with various versions of MV* and
compiler.
> > >
> > > We have an application apparently failing during MVAPICH MPI_Bcast
> of a
> > > many MB of data to dozens to hundreds of MPI ranks.  (Actually I
> believe
> > > it's Fortran, so I guess MPI_BCAST.)  We have already implemented
> > > VIADEV_USE_SHMEM_BCAST=0 just in case, but we are still having
> problems.
> > > (I'm not 100% reassured by the user's reports that the problem is
> still
> > > in MPI_Bcast, but I think it's likely.)
> >
> > We have not seen this error before, so we're very interested to
track
> this
> > down. If there is a reproducer for this we would be very interested
to
> try
> > to out here.
> >
> > Does the same error occur with MVAPICH2 as MVAPICH? Also, does
turning
> off
> > all shared memory collectives avoid the error?
> (VIADEV_USE_SHMEM_COLL=0)
> >
> > > Topic #3:
> > >
> > > As I looked through the MVAPICH code to see how MPI_Bcast is
> implemented
> > > for ch_gen2, I see MCST_SUPPORT repeatedly checked.  It appears
this
> is
> > > not set by default (by make.mvapich.gen2).
> > >
> > > If MCST_SUPPORT is disabled, what algorithm is used to implement
> > > MPI_Bcast?  If MCST_SUPPORT is enabled, does MPI_Bcast use IB
> multicast?
> > > Should it greatly speed up MPI_Bcast if enabled?
> > >
> > > It seems like MCST_SUPPORT would be beneficial, but the fact that
it
> is
> > > not enabled by default makes me wonder what the risks are of
> enabling
> > > it?
> >
> > MCST support (hardware-based multicast) is not supported right now.
> > InfiniBand's multicast is unreliable and supports sending only in
2KB
> > chunks and we haven't seen good performance for it on large systems.
> > Mellanox is planning on adding reliable multicast support to the
> ConnectX
> > adapter soon, at which point we'll re-evaluate the benefits. So at
> this
> > point the MCST support should not be enabled.
> >
> > Let us know if you have any more questions.
> >
> > Thanks,
> > Matt




More information about the mvapich-discuss mailing list