[mvapich-discuss] -DDISABLE_PTMALLOC, MPI_Bcast, and -DMCST_SUPPORT

David_Kewley at Dell.com David_Kewley at Dell.com
Mon Jun 23 22:28:31 EDT 2008


Matt,

Thanks for clarifying the effect of -DDISABLE_PTMALLOC, and the fact
that hardware-based multicast is not enabled right now.  I think that's
all I need to know on those topics for now.

I have a reproducer and observations about the apparent MPI_Bcast
segfault bug.  This is on x86_64, using Intel Fortran 10.1.015 (Build
20080312), and the executable ends up using the Intel implementation of
memcpy(), in case that's significant -- see the backtrace below.  This
is with MVAPICH 1.0.

The segfault occurs whenever these two conditions both hold:

1) length of the character array sent is > 8MB-11kB
2) #procs is > (7 nodes) * (N procs per node)

For the second condition I tested with N=1,2,4 procs per node, in which
cases the segfault occurred when #procs in the job size exceeded 7,14,28
procs respectively.

If either of the conditions does not hold, the segfault does not occur.
The threshold is exactly 8MB-11kB.  If the length of the char array is
8MB-11kB, it's fine, but if it's 8MB-11kB+1, it segfaults.

The segfault occurs in the memcpy function (again, it's the Intel
memcpy), when it tries to copy into the rhandle->buf beyond the 8MB-11kB
mark.  The backtrace is, for example:

#0  0x00000000004045c1 in __intel_new_memcpy ()
#1  0x0000000000401ee8 in _intel_fast_memcpy.J ()
#2  0x0000002a9560010e in MPID_VIA_self_start () from
/opt/mvapich/1.0/intel/10.1.015/lib/shared/libmpich.so.1.0
#3  0x0000002a955d8e82 in MPID_IsendContig () from
/opt/mvapich/1.0/intel/10.1.015/lib/shared/libmpich.so.1.0
#4  0x0000002a955d7564 in MPID_IsendDatatype () from
/opt/mvapich/1.0/intel/10.1.015/lib/shared/libmpich.so.1.0
#5  0x0000002a955cc4d6 in PMPI_Isend () from
/opt/mvapich/1.0/intel/10.1.015/lib/shared/libmpich.so.1.0
#6  0x0000002a955e95d2 in PMPI_Sendrecv () from
/opt/mvapich/1.0/intel/10.1.015/lib/shared/libmpich.so.1.0
#7  0x0000002a955bf7e9 in intra_Bcast_Large () from
/opt/mvapich/1.0/intel/10.1.015/lib/shared/libmpich.so.1.0
#8  0x0000002a955bcfa0 in intra_newBcast () from
/opt/mvapich/1.0/intel/10.1.015/lib/shared/libmpich.so.1.0
#9  0x0000002a95594e00 in PMPI_Bcast () from
/opt/mvapich/1.0/intel/10.1.015/lib/shared/libmpich.so.1.0
#10 0x0000000000401e3d in main ()

Attached find a simple reproducer C program.

David

> -----Original Message-----
> From: Matthew Koop [mailto:koop at cse.ohio-state.edu]
> Sent: Friday, June 20, 2008 4:15 AM
> To: Kewley, David
> Cc: mvapich-discuss at cse.ohio-state.edu
> Subject: Re: [mvapich-discuss] -DDISABLE_PTMALLOC, MPI_Bcast, and -
> DMCST_SUPPORT
> 
> David,
> 
> I'll answer your questions inline:
> 
> > What are the likely performance impacts of using -DDISABLE_PTMALLOC
> > (including memory use)?  Does this differ between MVAPICH and
MVAPICH2?
> > We are considering seeing what effect this has on certain
applications
> > that have seen problems with realloc.
> 
> The effects of turning off PTMALLOC (using -DDISABLE_PTMALLOC) will be
the
> same between MVAPICH and MVAPICH2.
> 
> The point of using the PTMALLOC library is to allow caching of
InfiniBand
> memory registrations. To ensure correctness we need to know if memory
is
> being free'd, etc. Since registration for InfiniBand is very expensive
we
> attempt to cache these registrations so if the same buffer is re-used
> again for communication it will already be registered (speeding up the
> application).
> 
> So the performance change will be application-dependent. If the
> application makes frequent re-use of buffers for communication the
> performance will likely be hurt. On the flip side, if the application
has
> very poor buffer re-use the performance may actually be better by not
> using the registration cache (you can always turn it off at runtime
with
> VIADEV_USE_DREG_CACHE=0 on MVAPICH). When the registration cache is
not
> turned on a copy-based approach is used for messages under a certain
size
> -- so no zero-copy that is normally used, but registration is not
used.
> 
> I hope this helps. Please let me know if you need additional
> clarification.
> 
> > Topic #2:
> >
> > We are using the OpenIB components of OFED 1.2.5.5, and are building
our
> > own MVAPICH and MVAPICH2, with various versions of MV* and compiler.
> >
> > We have an application apparently failing during MVAPICH MPI_Bcast
of a
> > many MB of data to dozens to hundreds of MPI ranks.  (Actually I
believe
> > it's Fortran, so I guess MPI_BCAST.)  We have already implemented
> > VIADEV_USE_SHMEM_BCAST=0 just in case, but we are still having
problems.
> > (I'm not 100% reassured by the user's reports that the problem is
still
> > in MPI_Bcast, but I think it's likely.)
> 
> We have not seen this error before, so we're very interested to track
this
> down. If there is a reproducer for this we would be very interested to
try
> to out here.
> 
> Does the same error occur with MVAPICH2 as MVAPICH? Also, does turning
off
> all shared memory collectives avoid the error?
(VIADEV_USE_SHMEM_COLL=0)
> 
> > Topic #3:
> >
> > As I looked through the MVAPICH code to see how MPI_Bcast is
implemented
> > for ch_gen2, I see MCST_SUPPORT repeatedly checked.  It appears this
is
> > not set by default (by make.mvapich.gen2).
> >
> > If MCST_SUPPORT is disabled, what algorithm is used to implement
> > MPI_Bcast?  If MCST_SUPPORT is enabled, does MPI_Bcast use IB
multicast?
> > Should it greatly speed up MPI_Bcast if enabled?
> >
> > It seems like MCST_SUPPORT would be beneficial, but the fact that it
is
> > not enabled by default makes me wonder what the risks are of
enabling
> > it?
> 
> MCST support (hardware-based multicast) is not supported right now.
> InfiniBand's multicast is unreliable and supports sending only in 2KB
> chunks and we haven't seen good performance for it on large systems.
> Mellanox is planning on adding reliable multicast support to the
ConnectX
> adapter soon, at which point we'll re-evaluate the benefits. So at
this
> point the MCST support should not be enabled.
> 
> Let us know if you have any more questions.
> 
> Thanks,
> Matt

-------------- next part --------------
A non-text attachment was scrubbed...
Name: test_bcast.c
Type: application/octet-stream
Size: 4292 bytes
Desc: test_bcast.c
Url : http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20080623/f2dcb8f2/test_bcast.obj


More information about the mvapich-discuss mailing list