[mvapich-discuss] -DDISABLE_PTMALLOC, MPI_Bcast,
and -DMCST_SUPPORT
Rahul Kumar
kumarra at cse.ohio-state.edu
Fri Jun 27 01:43:23 EDT 2008
Hi David,
I have created a patch for the error reported by you. This patch
provides a runtime option to control the message upto which to use shmem
broadcast and prevents the use of shmem broadcast for very large message
sizes. The patch is taken against mvapich-1.0.1. I hope you will be able
to try this patch and do let us know your feedback
Thanks,
Rahul.
> Date: Mon, 23 Jun 2008 21:28:31 -0500
> From: David_Kewley at Dell.com
> To: koop at cse.ohio-state.edu
> Cc: mvapich-discuss at cse.ohio-state.edu
> Subject: RE: [mvapich-discuss] -DDISABLE_PTMALLOC, MPI_Bcast,
> and -DMCST_SUPPORT
>
> Matt,
>
> Thanks for clarifying the effect of -DDISABLE_PTMALLOC, and the fact
> that hardware-based multicast is not enabled right now. I think that's
> all I need to know on those topics for now.
>
> I have a reproducer and observations about the apparent MPI_Bcast
> segfault bug. This is on x86_64, using Intel Fortran 10.1.015 (Build
> 20080312), and the executable ends up using the Intel implementation of
> memcpy(), in case that's significant -- see the backtrace below. This
> is with MVAPICH 1.0.
>
> The segfault occurs whenever these two conditions both hold:
>
> 1) length of the character array sent is > 8MB-11kB
> 2) #procs is > (7 nodes) * (N procs per node)
>
> For the second condition I tested with N=1,2,4 procs per node, in which
> cases the segfault occurred when #procs in the job size exceeded 7,14,28
> procs respectively.
>
> If either of the conditions does not hold, the segfault does not occur.
> The threshold is exactly 8MB-11kB. If the length of the char array is
> 8MB-11kB, it's fine, but if it's 8MB-11kB+1, it segfaults.
>
> The segfault occurs in the memcpy function (again, it's the Intel
> memcpy), when it tries to copy into the rhandle->buf beyond the 8MB-11kB
> mark. The backtrace is, for example:
>
> #0 0x00000000004045c1 in __intel_new_memcpy ()
> #1 0x0000000000401ee8 in _intel_fast_memcpy.J ()
> #2 0x0000002a9560010e in MPID_VIA_self_start () from
> /opt/mvapich/1.0/intel/10.1.015/lib/shared/libmpich.so.1.0
> #3 0x0000002a955d8e82 in MPID_IsendContig () from
> /opt/mvapich/1.0/intel/10.1.015/lib/shared/libmpich.so.1.0
> #4 0x0000002a955d7564 in MPID_IsendDatatype () from
> /opt/mvapich/1.0/intel/10.1.015/lib/shared/libmpich.so.1.0
> #5 0x0000002a955cc4d6 in PMPI_Isend () from
> /opt/mvapich/1.0/intel/10.1.015/lib/shared/libmpich.so.1.0
> #6 0x0000002a955e95d2 in PMPI_Sendrecv () from
> /opt/mvapich/1.0/intel/10.1.015/lib/shared/libmpich.so.1.0
> #7 0x0000002a955bf7e9 in intra_Bcast_Large () from
> /opt/mvapich/1.0/intel/10.1.015/lib/shared/libmpich.so.1.0
> #8 0x0000002a955bcfa0 in intra_newBcast () from
> /opt/mvapich/1.0/intel/10.1.015/lib/shared/libmpich.so.1.0
> #9 0x0000002a95594e00 in PMPI_Bcast () from
> /opt/mvapich/1.0/intel/10.1.015/lib/shared/libmpich.so.1.0
> #10 0x0000000000401e3d in main ()
>
> Attached find a simple reproducer C program.
>
> David
>
>
>> -----Original Message-----
>> From: Matthew Koop [mailto:koop at cse.ohio-state.edu]
>> Sent: Friday, June 20, 2008 4:15 AM
>> To: Kewley, David
>> Cc: mvapich-discuss at cse.ohio-state.edu
>> Subject: Re: [mvapich-discuss] -DDISABLE_PTMALLOC, MPI_Bcast, and -
>> DMCST_SUPPORT
>>
>> David,
>>
>> I'll answer your questions inline:
>>
>>
>>> What are the likely performance impacts of using -DDISABLE_PTMALLOC
>>> (including memory use)? Does this differ between MVAPICH and
>>>
> MVAPICH2?
>
>>> We are considering seeing what effect this has on certain
>>>
> applications
>
>>> that have seen problems with realloc.
>>>
>> The effects of turning off PTMALLOC (using -DDISABLE_PTMALLOC) will be
>>
> the
>
>> same between MVAPICH and MVAPICH2.
>>
>> The point of using the PTMALLOC library is to allow caching of
>>
> InfiniBand
>
>> memory registrations. To ensure correctness we need to know if memory
>>
> is
>
>> being free'd, etc. Since registration for InfiniBand is very expensive
>>
> we
>
>> attempt to cache these registrations so if the same buffer is re-used
>> again for communication it will already be registered (speeding up the
>> application).
>>
>> So the performance change will be application-dependent. If the
>> application makes frequent re-use of buffers for communication the
>> performance will likely be hurt. On the flip side, if the application
>>
> has
>
>> very poor buffer re-use the performance may actually be better by not
>> using the registration cache (you can always turn it off at runtime
>>
> with
>
>> VIADEV_USE_DREG_CACHE=0 on MVAPICH). When the registration cache is
>>
> not
>
>> turned on a copy-based approach is used for messages under a certain
>>
> size
>
>> -- so no zero-copy that is normally used, but registration is not
>>
> used.
>
>> I hope this helps. Please let me know if you need additional
>> clarification.
>>
>>
>>> Topic #2:
>>>
>>> We are using the OpenIB components of OFED 1.2.5.5, and are building
>>>
> our
>
>>> own MVAPICH and MVAPICH2, with various versions of MV* and compiler.
>>>
>>> We have an application apparently failing during MVAPICH MPI_Bcast
>>>
> of a
>
>>> many MB of data to dozens to hundreds of MPI ranks. (Actually I
>>>
> believe
>
>>> it's Fortran, so I guess MPI_BCAST.) We have already implemented
>>> VIADEV_USE_SHMEM_BCAST=0 just in case, but we are still having
>>>
> problems.
>
>>> (I'm not 100% reassured by the user's reports that the problem is
>>>
> still
>
>>> in MPI_Bcast, but I think it's likely.)
>>>
>> We have not seen this error before, so we're very interested to track
>>
> this
>
>> down. If there is a reproducer for this we would be very interested to
>>
> try
>
>> to out here.
>>
>> Does the same error occur with MVAPICH2 as MVAPICH? Also, does turning
>>
> off
>
>> all shared memory collectives avoid the error?
>>
> (VIADEV_USE_SHMEM_COLL=0)
>
>>> Topic #3:
>>>
>>> As I looked through the MVAPICH code to see how MPI_Bcast is
>>>
> implemented
>
>>> for ch_gen2, I see MCST_SUPPORT repeatedly checked. It appears this
>>>
> is
>
>>> not set by default (by make.mvapich.gen2).
>>>
>>> If MCST_SUPPORT is disabled, what algorithm is used to implement
>>> MPI_Bcast? If MCST_SUPPORT is enabled, does MPI_Bcast use IB
>>>
> multicast?
>
>>> Should it greatly speed up MPI_Bcast if enabled?
>>>
>>> It seems like MCST_SUPPORT would be beneficial, but the fact that it
>>>
> is
>
>>> not enabled by default makes me wonder what the risks are of
>>>
> enabling
>
>>> it?
>>>
>> MCST support (hardware-based multicast) is not supported right now.
>> InfiniBand's multicast is unreliable and supports sending only in 2KB
>> chunks and we haven't seen good performance for it on large systems.
>> Mellanox is planning on adding reliable multicast support to the
>>
> ConnectX
>
>> adapter soon, at which point we'll re-evaluate the benefits. So at
>>
> this
>
>> point the MCST support should not be enabled.
>>
>> Let us know if you have any more questions.
>>
>> Thanks,
>> Matt
>>
>
>
> ------------------------------------------------------------------------
>
> _______________________________________________
>
> mvapich-discuss mailing list
>
> mvapich-discuss at cse.ohio-state.edu
>
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
>
-------------- next part --------------
Index: src/include/mpiimpl.h
===================================================================
--- src/include/mpiimpl.h (revision 2582)
+++ src/include/mpiimpl.h (working copy)
@@ -3406,8 +3406,13 @@
#define MPIR_BCAST_SHORT_MSG 12288
#define MPIR_BCAST_LONG_MSG 524288
#define MPIR_BCAST_MIN_PROCS 8
+#if defined(_OSU_MVAPICH_)
+#define MPIR_ALLTOALL_SHORT_MSG 8192
+#define MPIR_ALLTOALL_MEDIUM_MSG 8192
+#else
#define MPIR_ALLTOALL_SHORT_MSG 256
#define MPIR_ALLTOALL_MEDIUM_MSG 32768
+#endif
#define MPIR_REDSCAT_COMMUTATIVE_LONG_MSG 524288
#define MPIR_REDSCAT_NONCOMMUTATIVE_SHORT_MSG 512
#define MPIR_ALLGATHER_SHORT_MSG 81920
Index: src/mpi/coll/alltoallv.c
===================================================================
--- src/mpi/coll/alltoallv.c (revision 2582)
+++ src/mpi/coll/alltoallv.c (working copy)
@@ -63,6 +63,10 @@
MPI_Request *reqarray;
int dst, rank, req_cnt;
MPI_Comm comm;
+#if defined(_OSU_MVAPICH_)
+ int pof2, src;
+ MPI_Status status;
+#endif
comm = comm_ptr->handle;
comm_size = comm_ptr->local_size;
@@ -75,6 +79,58 @@
/* check if multiple threads are calling this collective function */
MPIDU_ERR_CHECK_MULTIPLE_THREADS_ENTER( comm_ptr );
+#if defined(_OSU_MVAPICH_)
+ mpi_errno = MPIR_Localcopy(((char *)sendbuf +
+ sdispls[rank]*send_extent),
+ sendcnts[rank], sendtype,
+ ((char *)recvbuf +
+ rdispls[rank]*recv_extent),
+ recvcnts[rank], recvtype);
+
+ if (mpi_errno)
+ {
+ mpi_errno = MPIR_Err_create_code(mpi_errno, MPIR_ERR_RECOVERABLE, FCNAME, __LINE__, MPI_ERR_OTHER, "**fail", 0);
+ return mpi_errno;
+ }
+
+ /* Is comm_size a power-of-two? */
+ i = 1;
+ while (i < comm_size)
+ i *= 2;
+ if (i == comm_size)
+ pof2 = 1;
+ else
+ pof2 = 0;
+
+ /* Do the pairwise exchanges */
+ for (i=1; i<comm_size; i++) {
+ if (pof2 == 1) {
+ /* use exclusive-or algorithm */
+ src = dst = rank ^ i;
+ }
+ else {
+ src = (rank - i + comm_size) % comm_size;
+ dst = (rank + i) % comm_size;
+ }
+
+ mpi_errno = MPIC_Sendrecv(((char *)sendbuf +
+ sdispls[dst]*send_extent),
+ sendcnts[dst], sendtype, dst,
+ MPIR_ALLTOALL_TAG,
+ ((char *)recvbuf +
+ rdispls[src]*recv_extent),
+ recvcnts[src], recvtype, src,
+ MPIR_ALLTOALL_TAG, comm, &status);
+
+ if (mpi_errno)
+ {
+ mpi_errno = MPIR_Err_create_code(mpi_errno, MPIR_ERR_RECOVERABLE, FCNAME, __LINE__, MPI_ERR_OTHER, "**fail", 0);
+ return mpi_errno;
+ }
+
+ }
+#else
+
starray = (MPI_Status *) MPIU_Malloc(2*comm_size*sizeof(MPI_Status));
/* --BEGIN ERROR HANDLING-- */
if (!starray) {
@@ -141,6 +197,7 @@
MPIU_Free(reqarray);
MPIU_Free(starray);
+#endif
/* check if multiple threads are calling this collective function */
MPIDU_ERR_CHECK_MULTIPLE_THREADS_EXIT( comm_ptr );
More information about the mvapich-discuss
mailing list