[mvapich-discuss] -DDISABLE_PTMALLOC, MPI_Bcast,
and -DMCST_SUPPORT
David_Kewley at Dell.com
David_Kewley at Dell.com
Mon Jun 23 22:28:31 EDT 2008
Matt,
Thanks for clarifying the effect of -DDISABLE_PTMALLOC, and the fact
that hardware-based multicast is not enabled right now. I think that's
all I need to know on those topics for now.
I have a reproducer and observations about the apparent MPI_Bcast
segfault bug. This is on x86_64, using Intel Fortran 10.1.015 (Build
20080312), and the executable ends up using the Intel implementation of
memcpy(), in case that's significant -- see the backtrace below. This
is with MVAPICH 1.0.
The segfault occurs whenever these two conditions both hold:
1) length of the character array sent is > 8MB-11kB
2) #procs is > (7 nodes) * (N procs per node)
For the second condition I tested with N=1,2,4 procs per node, in which
cases the segfault occurred when #procs in the job size exceeded 7,14,28
procs respectively.
If either of the conditions does not hold, the segfault does not occur.
The threshold is exactly 8MB-11kB. If the length of the char array is
8MB-11kB, it's fine, but if it's 8MB-11kB+1, it segfaults.
The segfault occurs in the memcpy function (again, it's the Intel
memcpy), when it tries to copy into the rhandle->buf beyond the 8MB-11kB
mark. The backtrace is, for example:
#0 0x00000000004045c1 in __intel_new_memcpy ()
#1 0x0000000000401ee8 in _intel_fast_memcpy.J ()
#2 0x0000002a9560010e in MPID_VIA_self_start () from
/opt/mvapich/1.0/intel/10.1.015/lib/shared/libmpich.so.1.0
#3 0x0000002a955d8e82 in MPID_IsendContig () from
/opt/mvapich/1.0/intel/10.1.015/lib/shared/libmpich.so.1.0
#4 0x0000002a955d7564 in MPID_IsendDatatype () from
/opt/mvapich/1.0/intel/10.1.015/lib/shared/libmpich.so.1.0
#5 0x0000002a955cc4d6 in PMPI_Isend () from
/opt/mvapich/1.0/intel/10.1.015/lib/shared/libmpich.so.1.0
#6 0x0000002a955e95d2 in PMPI_Sendrecv () from
/opt/mvapich/1.0/intel/10.1.015/lib/shared/libmpich.so.1.0
#7 0x0000002a955bf7e9 in intra_Bcast_Large () from
/opt/mvapich/1.0/intel/10.1.015/lib/shared/libmpich.so.1.0
#8 0x0000002a955bcfa0 in intra_newBcast () from
/opt/mvapich/1.0/intel/10.1.015/lib/shared/libmpich.so.1.0
#9 0x0000002a95594e00 in PMPI_Bcast () from
/opt/mvapich/1.0/intel/10.1.015/lib/shared/libmpich.so.1.0
#10 0x0000000000401e3d in main ()
Attached find a simple reproducer C program.
David
> -----Original Message-----
> From: Matthew Koop [mailto:koop at cse.ohio-state.edu]
> Sent: Friday, June 20, 2008 4:15 AM
> To: Kewley, David
> Cc: mvapich-discuss at cse.ohio-state.edu
> Subject: Re: [mvapich-discuss] -DDISABLE_PTMALLOC, MPI_Bcast, and -
> DMCST_SUPPORT
>
> David,
>
> I'll answer your questions inline:
>
> > What are the likely performance impacts of using -DDISABLE_PTMALLOC
> > (including memory use)? Does this differ between MVAPICH and
MVAPICH2?
> > We are considering seeing what effect this has on certain
applications
> > that have seen problems with realloc.
>
> The effects of turning off PTMALLOC (using -DDISABLE_PTMALLOC) will be
the
> same between MVAPICH and MVAPICH2.
>
> The point of using the PTMALLOC library is to allow caching of
InfiniBand
> memory registrations. To ensure correctness we need to know if memory
is
> being free'd, etc. Since registration for InfiniBand is very expensive
we
> attempt to cache these registrations so if the same buffer is re-used
> again for communication it will already be registered (speeding up the
> application).
>
> So the performance change will be application-dependent. If the
> application makes frequent re-use of buffers for communication the
> performance will likely be hurt. On the flip side, if the application
has
> very poor buffer re-use the performance may actually be better by not
> using the registration cache (you can always turn it off at runtime
with
> VIADEV_USE_DREG_CACHE=0 on MVAPICH). When the registration cache is
not
> turned on a copy-based approach is used for messages under a certain
size
> -- so no zero-copy that is normally used, but registration is not
used.
>
> I hope this helps. Please let me know if you need additional
> clarification.
>
> > Topic #2:
> >
> > We are using the OpenIB components of OFED 1.2.5.5, and are building
our
> > own MVAPICH and MVAPICH2, with various versions of MV* and compiler.
> >
> > We have an application apparently failing during MVAPICH MPI_Bcast
of a
> > many MB of data to dozens to hundreds of MPI ranks. (Actually I
believe
> > it's Fortran, so I guess MPI_BCAST.) We have already implemented
> > VIADEV_USE_SHMEM_BCAST=0 just in case, but we are still having
problems.
> > (I'm not 100% reassured by the user's reports that the problem is
still
> > in MPI_Bcast, but I think it's likely.)
>
> We have not seen this error before, so we're very interested to track
this
> down. If there is a reproducer for this we would be very interested to
try
> to out here.
>
> Does the same error occur with MVAPICH2 as MVAPICH? Also, does turning
off
> all shared memory collectives avoid the error?
(VIADEV_USE_SHMEM_COLL=0)
>
> > Topic #3:
> >
> > As I looked through the MVAPICH code to see how MPI_Bcast is
implemented
> > for ch_gen2, I see MCST_SUPPORT repeatedly checked. It appears this
is
> > not set by default (by make.mvapich.gen2).
> >
> > If MCST_SUPPORT is disabled, what algorithm is used to implement
> > MPI_Bcast? If MCST_SUPPORT is enabled, does MPI_Bcast use IB
multicast?
> > Should it greatly speed up MPI_Bcast if enabled?
> >
> > It seems like MCST_SUPPORT would be beneficial, but the fact that it
is
> > not enabled by default makes me wonder what the risks are of
enabling
> > it?
>
> MCST support (hardware-based multicast) is not supported right now.
> InfiniBand's multicast is unreliable and supports sending only in 2KB
> chunks and we haven't seen good performance for it on large systems.
> Mellanox is planning on adding reliable multicast support to the
ConnectX
> adapter soon, at which point we'll re-evaluate the benefits. So at
this
> point the MCST support should not be enabled.
>
> Let us know if you have any more questions.
>
> Thanks,
> Matt
-------------- next part --------------
A non-text attachment was scrubbed...
Name: test_bcast.c
Type: application/octet-stream
Size: 4292 bytes
Desc: test_bcast.c
Url : http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20080623/f2dcb8f2/test_bcast.obj
More information about the mvapich-discuss
mailing list