[mvapich-discuss] -DDISABLE_PTMALLOC, MPI_Bcast, and -DMCST_SUPPORT

Matthew Koop koop at cse.ohio-state.edu
Fri Jun 20 10:14:38 EDT 2008


David,

I'll answer your questions inline:

> What are the likely performance impacts of using -DDISABLE_PTMALLOC
> (including memory use)?  Does this differ between MVAPICH and MVAPICH2?
> We are considering seeing what effect this has on certain applications
> that have seen problems with realloc.

The effects of turning off PTMALLOC (using -DDISABLE_PTMALLOC) will be the
same between MVAPICH and MVAPICH2.

The point of using the PTMALLOC library is to allow caching of InfiniBand
memory registrations. To ensure correctness we need to know if memory is
being free'd, etc. Since registration for InfiniBand is very expensive we
attempt to cache these registrations so if the same buffer is re-used
again for communication it will already be registered (speeding up the
application).

So the performance change will be application-dependent. If the
application makes frequent re-use of buffers for communication the
performance will likely be hurt. On the flip side, if the application has
very poor buffer re-use the performance may actually be better by not
using the registration cache (you can always turn it off at runtime with
VIADEV_USE_DREG_CACHE=0 on MVAPICH). When the registration cache is not
turned on a copy-based approach is used for messages under a certain size
-- so no zero-copy that is normally used, but registration is not used.

I hope this helps. Please let me know if you need additional
clarification.

> Topic #2:
>
> We are using the OpenIB components of OFED 1.2.5.5, and are building our
> own MVAPICH and MVAPICH2, with various versions of MV* and compiler.
>
> We have an application apparently failing during MVAPICH MPI_Bcast of a
> many MB of data to dozens to hundreds of MPI ranks.  (Actually I believe
> it's Fortran, so I guess MPI_BCAST.)  We have already implemented
> VIADEV_USE_SHMEM_BCAST=0 just in case, but we are still having problems.
> (I'm not 100% reassured by the user's reports that the problem is still
> in MPI_Bcast, but I think it's likely.)

We have not seen this error before, so we're very interested to track this
down. If there is a reproducer for this we would be very interested to try
to out here.

Does the same error occur with MVAPICH2 as MVAPICH? Also, does turning off
all shared memory collectives avoid the error? (VIADEV_USE_SHMEM_COLL=0)

> Topic #3:
>
> As I looked through the MVAPICH code to see how MPI_Bcast is implemented
> for ch_gen2, I see MCST_SUPPORT repeatedly checked.  It appears this is
> not set by default (by make.mvapich.gen2).
>
> If MCST_SUPPORT is disabled, what algorithm is used to implement
> MPI_Bcast?  If MCST_SUPPORT is enabled, does MPI_Bcast use IB multicast?
> Should it greatly speed up MPI_Bcast if enabled?
>
> It seems like MCST_SUPPORT would be beneficial, but the fact that it is
> not enabled by default makes me wonder what the risks are of enabling
> it?

MCST support (hardware-based multicast) is not supported right now.
InfiniBand's multicast is unreliable and supports sending only in 2KB
chunks and we haven't seen good performance for it on large systems.
Mellanox is planning on adding reliable multicast support to the ConnectX
adapter soon, at which point we'll re-evaluate the benefits. So at this
point the MCST support should not be enabled.

Let us know if you have any more questions.

Thanks,
Matt



More information about the mvapich-discuss mailing list