[mvapich-discuss] -DDISABLE_PTMALLOC, MPI_Bcast, and -DMCST_SUPPORT (fwd)

Tue Jun 24 12:20:50 EDT 2008

Hi David,
Thanks for the test program. We were able to reproduce the problem and 
have a temporary fix for you.
If you can increase the value of variable "file_size" in file 
src/coll/intra_fns_new.c around line 1655, it should solve your problem. 
Currently it is set to (1<<23) and this is around the message size you 
see the failure. You should increase it to much larger than the largest 
broadcast message size. In the mean time we will generate a better patch.
Thanks,
Rahul.
>
> ---------- Forwarded message ----------
> Date: Mon, 23 Jun 2008 21:28:31 -0500
> From: David_Kewley at Dell.com
> To: koop at cse.ohio-state.edu
> Cc: mvapich-discuss at cse.ohio-state.edu
> Subject: RE: [mvapich-discuss] -DDISABLE_PTMALLOC, MPI_Bcast,
>      and -DMCST_SUPPORT
>
> Matt,
>
> Thanks for clarifying the effect of -DDISABLE_PTMALLOC, and the fact
> that hardware-based multicast is not enabled right now.  I think that's
> all I need to know on those topics for now.
>
> I have a reproducer and observations about the apparent MPI_Bcast
> segfault bug.  This is on x86_64, using Intel Fortran 10.1.015 (Build
> 20080312), and the executable ends up using the Intel implementation of
> memcpy(), in case that's significant -- see the backtrace below.  This
> is with MVAPICH 1.0.
>
> The segfault occurs whenever these two conditions both hold:
>
> 1) length of the character array sent is > 8MB-11kB
> 2) #procs is > (7 nodes) * (N procs per node)
>
> For the second condition I tested with N=1,2,4 procs per node, in which
> cases the segfault occurred when #procs in the job size exceeded 7,14,28
> procs respectively.
>
> If either of the conditions does not hold, the segfault does not occur.
> The threshold is exactly 8MB-11kB.  If the length of the char array is
> 8MB-11kB, it's fine, but if it's 8MB-11kB+1, it segfaults.
>
> The segfault occurs in the memcpy function (again, it's the Intel
> memcpy), when it tries to copy into the rhandle->buf beyond the 8MB-11kB
> mark.  The backtrace is, for example:
>
> #0  0x00000000004045c1 in __intel_new_memcpy ()
> #1  0x0000000000401ee8 in _intel_fast_memcpy.J ()
> #2  0x0000002a9560010e in MPID_VIA_self_start () from
> /opt/mvapich/1.0/intel/10.1.015/lib/shared/libmpich.so.1.0
> #3  0x0000002a955d8e82 in MPID_IsendContig () from
> /opt/mvapich/1.0/intel/10.1.015/lib/shared/libmpich.so.1.0
> #4  0x0000002a955d7564 in MPID_IsendDatatype () from
> /opt/mvapich/1.0/intel/10.1.015/lib/shared/libmpich.so.1.0
> #5  0x0000002a955cc4d6 in PMPI_Isend () from
> /opt/mvapich/1.0/intel/10.1.015/lib/shared/libmpich.so.1.0
> #6  0x0000002a955e95d2 in PMPI_Sendrecv () from
> /opt/mvapich/1.0/intel/10.1.015/lib/shared/libmpich.so.1.0
> #7  0x0000002a955bf7e9 in intra_Bcast_Large () from
> /opt/mvapich/1.0/intel/10.1.015/lib/shared/libmpich.so.1.0
> #8  0x0000002a955bcfa0 in intra_newBcast () from
> /opt/mvapich/1.0/intel/10.1.015/lib/shared/libmpich.so.1.0
> #9  0x0000002a95594e00 in PMPI_Bcast () from
> /opt/mvapich/1.0/intel/10.1.015/lib/shared/libmpich.so.1.0
> #10 0x0000000000401e3d in main ()
>
> Attached find a simple reproducer C program.
>
> David
>
>   
>> -----Original Message-----
>> From: Matthew Koop [mailto:koop at cse.ohio-state.edu]
>> Sent: Friday, June 20, 2008 4:15 AM
>> To: Kewley, David
>> Cc: mvapich-discuss at cse.ohio-state.edu
>> Subject: Re: [mvapich-discuss] -DDISABLE_PTMALLOC, MPI_Bcast, and -
>> DMCST_SUPPORT
>>
>> David,
>>
>> I'll answer your questions inline:
>>
>>     
>>> What are the likely performance impacts of using -DDISABLE_PTMALLOC
>>> (including memory use)?  Does this differ between MVAPICH and
>>>       
> MVAPICH2?
>   
>>> We are considering seeing what effect this has on certain
>>>       
> applications
>   
>>> that have seen problems with realloc.
>>>       
>> The effects of turning off PTMALLOC (using -DDISABLE_PTMALLOC) will be
>>     
> the
>   
>> same between MVAPICH and MVAPICH2.
>>
>> The point of using the PTMALLOC library is to allow caching of
>>     
> InfiniBand
>   
>> memory registrations. To ensure correctness we need to know if memory
>>     
> is
>   
>> being free'd, etc. Since registration for InfiniBand is very expensive
>>     
> we
>   
>> attempt to cache these registrations so if the same buffer is re-used
>> again for communication it will already be registered (speeding up the
>> application).
>>
>> So the performance change will be application-dependent. If the
>> application makes frequent re-use of buffers for communication the
>> performance will likely be hurt. On the flip side, if the application
>>     
> has
>   
>> very poor buffer re-use the performance may actually be better by not
>> using the registration cache (you can always turn it off at runtime
>>     
> with
>   
>> VIADEV_USE_DREG_CACHE=0 on MVAPICH). When the registration cache is
>>     
> not
>   
>> turned on a copy-based approach is used for messages under a certain
>>     
> size
>   
>> -- so no zero-copy that is normally used, but registration is not
>>     
> used.
>   
>> I hope this helps. Please let me know if you need additional
>> clarification.
>>
>>     
>>> Topic #2:
>>>
>>> We are using the OpenIB components of OFED 1.2.5.5, and are building
>>>       
> our
>   
>>> own MVAPICH and MVAPICH2, with various versions of MV* and compiler.
>>>
>>> We have an application apparently failing during MVAPICH MPI_Bcast
>>>       
> of a
>   
>>> many MB of data to dozens to hundreds of MPI ranks.  (Actually I
>>>       
> believe
>   
>>> it's Fortran, so I guess MPI_BCAST.)  We have already implemented
>>> VIADEV_USE_SHMEM_BCAST=0 just in case, but we are still having
>>>       
> problems.
>   
>>> (I'm not 100% reassured by the user's reports that the problem is
>>>       
> still
>   
>>> in MPI_Bcast, but I think it's likely.)
>>>       
>> We have not seen this error before, so we're very interested to track
>>     
> this
>   
>> down. If there is a reproducer for this we would be very interested to
>>     
> try
>   
>> to out here.
>>
>> Does the same error occur with MVAPICH2 as MVAPICH? Also, does turning
>>     
> off
>   
>> all shared memory collectives avoid the error?
>>     
> (VIADEV_USE_SHMEM_COLL=0)
>   
>>> Topic #3:
>>>
>>> As I looked through the MVAPICH code to see how MPI_Bcast is
>>>       
> implemented
>   
>>> for ch_gen2, I see MCST_SUPPORT repeatedly checked.  It appears this
>>>       
> is
>   
>>> not set by default (by make.mvapich.gen2).
>>>
>>> If MCST_SUPPORT is disabled, what algorithm is used to implement
>>> MPI_Bcast?  If MCST_SUPPORT is enabled, does MPI_Bcast use IB
>>>       
> multicast?
>   
>>> Should it greatly speed up MPI_Bcast if enabled?
>>>
>>> It seems like MCST_SUPPORT would be beneficial, but the fact that it
>>>       
> is
>   
>>> not enabled by default makes me wonder what the risks are of
>>>       
> enabling
>   
>>> it?
>>>       
>> MCST support (hardware-based multicast) is not supported right now.
>> InfiniBand's multicast is unreliable and supports sending only in 2KB
>> chunks and we haven't seen good performance for it on large systems.
>> Mellanox is planning on adding reliable multicast support to the
>>     
> ConnectX
>   
>> adapter soon, at which point we'll re-evaluate the benefits. So at
>>     
> this
>   
>> point the MCST support should not be enabled.
>>
>> Let us know if you have any more questions.
>>
>> Thanks,
>> Matt
>>     
>
>   
> ------------------------------------------------------------------------
>
> _______________________________________________
>
> mvapich-discuss mailing list
>
> mvapich-discuss at cse.ohio-state.edu
>
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
>