[mvapich-discuss] -DDISABLE_PTMALLOC, MPI_Bcast, and -DMCST_SUPPORT

Fri Jun 27 01:43:23 EDT 2008

Hi David,
I have created a patch for the error reported by you. This patch 
provides a runtime option to control the message upto which to use shmem 
broadcast and prevents the use of shmem broadcast for very large message 
sizes. The patch is taken against mvapich-1.0.1. I hope you will be able 
to try this patch and do let us know your feedback
Thanks,
Rahul.
> Date: Mon, 23 Jun 2008 21:28:31 -0500
> From: David_Kewley at Dell.com
> To: koop at cse.ohio-state.edu
> Cc: mvapich-discuss at cse.ohio-state.edu
> Subject: RE: [mvapich-discuss] -DDISABLE_PTMALLOC, MPI_Bcast,
>      and -DMCST_SUPPORT
>
> Matt,
>
> Thanks for clarifying the effect of -DDISABLE_PTMALLOC, and the fact
> that hardware-based multicast is not enabled right now.  I think that's
> all I need to know on those topics for now.
>
> I have a reproducer and observations about the apparent MPI_Bcast
> segfault bug.  This is on x86_64, using Intel Fortran 10.1.015 (Build
> 20080312), and the executable ends up using the Intel implementation of
> memcpy(), in case that's significant -- see the backtrace below.  This
> is with MVAPICH 1.0.
>
> The segfault occurs whenever these two conditions both hold:
>
> 1) length of the character array sent is > 8MB-11kB
> 2) #procs is > (7 nodes) * (N procs per node)
>
> For the second condition I tested with N=1,2,4 procs per node, in which
> cases the segfault occurred when #procs in the job size exceeded 7,14,28
> procs respectively.
>
> If either of the conditions does not hold, the segfault does not occur.
> The threshold is exactly 8MB-11kB.  If the length of the char array is
> 8MB-11kB, it's fine, but if it's 8MB-11kB+1, it segfaults.
>
> The segfault occurs in the memcpy function (again, it's the Intel
> memcpy), when it tries to copy into the rhandle->buf beyond the 8MB-11kB
> mark.  The backtrace is, for example:
>
> #0  0x00000000004045c1 in __intel_new_memcpy ()
> #1  0x0000000000401ee8 in _intel_fast_memcpy.J ()
> #2  0x0000002a9560010e in MPID_VIA_self_start () from
> /opt/mvapich/1.0/intel/10.1.015/lib/shared/libmpich.so.1.0
> #3  0x0000002a955d8e82 in MPID_IsendContig () from
> /opt/mvapich/1.0/intel/10.1.015/lib/shared/libmpich.so.1.0
> #4  0x0000002a955d7564 in MPID_IsendDatatype () from
> /opt/mvapich/1.0/intel/10.1.015/lib/shared/libmpich.so.1.0
> #5  0x0000002a955cc4d6 in PMPI_Isend () from
> /opt/mvapich/1.0/intel/10.1.015/lib/shared/libmpich.so.1.0
> #6  0x0000002a955e95d2 in PMPI_Sendrecv () from
> /opt/mvapich/1.0/intel/10.1.015/lib/shared/libmpich.so.1.0
> #7  0x0000002a955bf7e9 in intra_Bcast_Large () from
> /opt/mvapich/1.0/intel/10.1.015/lib/shared/libmpich.so.1.0
> #8  0x0000002a955bcfa0 in intra_newBcast () from
> /opt/mvapich/1.0/intel/10.1.015/lib/shared/libmpich.so.1.0
> #9  0x0000002a95594e00 in PMPI_Bcast () from
> /opt/mvapich/1.0/intel/10.1.015/lib/shared/libmpich.so.1.0
> #10 0x0000000000401e3d in main ()
>
> Attached find a simple reproducer C program.
>
> David
>
>   
>> -----Original Message-----
>> From: Matthew Koop [mailto:koop at cse.ohio-state.edu]
>> Sent: Friday, June 20, 2008 4:15 AM
>> To: Kewley, David
>> Cc: mvapich-discuss at cse.ohio-state.edu
>> Subject: Re: [mvapich-discuss] -DDISABLE_PTMALLOC, MPI_Bcast, and -
>> DMCST_SUPPORT
>>
>> David,
>>
>> I'll answer your questions inline:
>>
>>     
>>> What are the likely performance impacts of using -DDISABLE_PTMALLOC
>>> (including memory use)?  Does this differ between MVAPICH and
>>>       
> MVAPICH2?
>   
>>> We are considering seeing what effect this has on certain
>>>       
> applications
>   
>>> that have seen problems with realloc.
>>>       
>> The effects of turning off PTMALLOC (using -DDISABLE_PTMALLOC) will be
>>     
> the
>   
>> same between MVAPICH and MVAPICH2.
>>
>> The point of using the PTMALLOC library is to allow caching of
>>     
> InfiniBand
>   
>> memory registrations. To ensure correctness we need to know if memory
>>     
> is
>   
>> being free'd, etc. Since registration for InfiniBand is very expensive
>>     
> we
>   
>> attempt to cache these registrations so if the same buffer is re-used
>> again for communication it will already be registered (speeding up the
>> application).
>>
>> So the performance change will be application-dependent. If the
>> application makes frequent re-use of buffers for communication the
>> performance will likely be hurt. On the flip side, if the application
>>     
> has
>   
>> very poor buffer re-use the performance may actually be better by not
>> using the registration cache (you can always turn it off at runtime
>>     
> with
>   
>> VIADEV_USE_DREG_CACHE=0 on MVAPICH). When the registration cache is
>>     
> not
>   
>> turned on a copy-based approach is used for messages under a certain
>>     
> size
>   
>> -- so no zero-copy that is normally used, but registration is not
>>     
> used.
>   
>> I hope this helps. Please let me know if you need additional
>> clarification.
>>
>>     
>>> Topic #2:
>>>
>>> We are using the OpenIB components of OFED 1.2.5.5, and are building
>>>       
> our
>   
>>> own MVAPICH and MVAPICH2, with various versions of MV* and compiler.
>>>
>>> We have an application apparently failing during MVAPICH MPI_Bcast
>>>       
> of a
>   
>>> many MB of data to dozens to hundreds of MPI ranks.  (Actually I
>>>       
> believe
>   
>>> it's Fortran, so I guess MPI_BCAST.)  We have already implemented
>>> VIADEV_USE_SHMEM_BCAST=0 just in case, but we are still having
>>>       
> problems.
>   
>>> (I'm not 100% reassured by the user's reports that the problem is
>>>       
> still
>   
>>> in MPI_Bcast, but I think it's likely.)
>>>       
>> We have not seen this error before, so we're very interested to track
>>     
> this
>   
>> down. If there is a reproducer for this we would be very interested to
>>     
> try
>   
>> to out here.
>>
>> Does the same error occur with MVAPICH2 as MVAPICH? Also, does turning
>>     
> off
>   
>> all shared memory collectives avoid the error?
>>     
> (VIADEV_USE_SHMEM_COLL=0)
>   
>>> Topic #3:
>>>
>>> As I looked through the MVAPICH code to see how MPI_Bcast is
>>>       
> implemented
>   
>>> for ch_gen2, I see MCST_SUPPORT repeatedly checked.  It appears this
>>>       
> is
>   
>>> not set by default (by make.mvapich.gen2).
>>>
>>> If MCST_SUPPORT is disabled, what algorithm is used to implement
>>> MPI_Bcast?  If MCST_SUPPORT is enabled, does MPI_Bcast use IB
>>>       
> multicast?
>   
>>> Should it greatly speed up MPI_Bcast if enabled?
>>>
>>> It seems like MCST_SUPPORT would be beneficial, but the fact that it
>>>       
> is
>   
>>> not enabled by default makes me wonder what the risks are of
>>>       
> enabling
>   
>>> it?
>>>       
>> MCST support (hardware-based multicast) is not supported right now.
>> InfiniBand's multicast is unreliable and supports sending only in 2KB
>> chunks and we haven't seen good performance for it on large systems.
>> Mellanox is planning on adding reliable multicast support to the
>>     
> ConnectX
>   
>> adapter soon, at which point we'll re-evaluate the benefits. So at
>>     
> this
>   
>> point the MCST support should not be enabled.
>>
>> Let us know if you have any more questions.
>>
>> Thanks,
>> Matt
>>     
>
>   
> ------------------------------------------------------------------------
>
> _______________________________________________
>
> mvapich-discuss mailing list
>
> mvapich-discuss at cse.ohio-state.edu
>
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
>   
-------------- next part --------------
Index: src/include/mpiimpl.h
===================================================================

--- src/include/mpiimpl.h	(revision 2582)
+++ src/include/mpiimpl.h	(working copy)
@@ -3406,8 +3406,13 @@
 #define MPIR_BCAST_SHORT_MSG          12288
 #define MPIR_BCAST_LONG_MSG           524288
 #define MPIR_BCAST_MIN_PROCS          8
+#if defined(_OSU_MVAPICH_)
+#define MPIR_ALLTOALL_SHORT_MSG       8192
+#define MPIR_ALLTOALL_MEDIUM_MSG      8192
+#else
 #define MPIR_ALLTOALL_SHORT_MSG       256
 #define MPIR_ALLTOALL_MEDIUM_MSG      32768
+#endif
 #define MPIR_REDSCAT_COMMUTATIVE_LONG_MSG 524288
 #define MPIR_REDSCAT_NONCOMMUTATIVE_SHORT_MSG 512
 #define MPIR_ALLGATHER_SHORT_MSG      81920
Index: src/mpi/coll/alltoallv.c
===================================================================
--- src/mpi/coll/alltoallv.c	(revision 2582)
+++ src/mpi/coll/alltoallv.c	(working copy)
@@ -63,6 +63,10 @@
     MPI_Request *reqarray;
     int dst, rank, req_cnt;
     MPI_Comm comm;
+#if defined(_OSU_MVAPICH_)
+    int pof2, src;
+    MPI_Status status;
+#endif
     
     comm = comm_ptr->handle;
     comm_size = comm_ptr->local_size;
@@ -75,6 +79,58 @@
     /* check if multiple threads are calling this collective function */
     MPIDU_ERR_CHECK_MULTIPLE_THREADS_ENTER( comm_ptr );
 
+#if defined(_OSU_MVAPICH_)
+    mpi_errno = MPIR_Localcopy(((char *)sendbuf +
+                                sdispls[rank]*send_extent),
+                               sendcnts[rank], sendtype,
+                               ((char *)recvbuf +
+                                rdispls[rank]*recv_extent),
+                               recvcnts[rank], recvtype);
+
+    if (mpi_errno)
+    {
+        mpi_errno = MPIR_Err_create_code(mpi_errno, MPIR_ERR_RECOVERABLE, FCNAME, __LINE__, MPI_ERR_OTHER, "**fail", 0);
+        return mpi_errno;
+    }
+
+    /* Is comm_size a power-of-two? */
+    i = 1;
+    while (i < comm_size)
+        i *= 2;
+    if (i == comm_size)
+        pof2 = 1;
+    else
+        pof2 = 0;
+
+    /* Do the pairwise exchanges */
+    for (i=1; i<comm_size; i++) {
+        if (pof2 == 1) {
+            /* use exclusive-or algorithm */
+            src = dst = rank ^ i;
+        }
+        else {
+            src = (rank - i + comm_size) % comm_size;
+            dst = (rank + i) % comm_size;
+        }
+
+        mpi_errno = MPIC_Sendrecv(((char *)sendbuf +
+                                   sdispls[dst]*send_extent),
+                                  sendcnts[dst], sendtype, dst,
+                                  MPIR_ALLTOALL_TAG,
+                                  ((char *)recvbuf +
+                                   rdispls[src]*recv_extent),
+                                  recvcnts[src], recvtype, src,
+                                  MPIR_ALLTOALL_TAG, comm, &status);
+
+        if (mpi_errno)
+        {
+            mpi_errno = MPIR_Err_create_code(mpi_errno, MPIR_ERR_RECOVERABLE, FCNAME, __LINE__, MPI_ERR_OTHER, "**fail", 0);
+            return mpi_errno;
+        }
+
+    }
+#else
+
     starray = (MPI_Status *) MPIU_Malloc(2*comm_size*sizeof(MPI_Status));
     /* --BEGIN ERROR HANDLING-- */
     if (!starray) {
@@ -141,6 +197,7 @@
     
     MPIU_Free(reqarray);
     MPIU_Free(starray);
+#endif
     
     /* check if multiple threads are calling this collective function */
     MPIDU_ERR_CHECK_MULTIPLE_THREADS_EXIT( comm_ptr );