[mvapich-discuss] Assertion failure

Martin Pokorny mpokorny at nrao.edu
Tue Jan 22 11:44:51 EST 2013


Hi Devendar,

On 01/16/2013 07:54 AM, Devendar Bureddy wrote:
> It seems the assertion is hitting during the large message transfer
> with non default RNDV protocol i.e R3 ( default is RPUT).  The large
> message transfer should have switched to R3 protocol only if the IB
> memory registration is failed internally.
>
> Can you try with forcing all the large message transfers to R3
> protocol  using run-time parameter MV2_RNDV_PROTOCOL=R3  and see if
> this increases the frequency of failure.

Good suggestion! When I set MV2_RNDV_PROTOCOL=R3, the failure occurs in 
every trial, immediately upon opening a file. From the backtrace I see 
that none of the modified ADIO Lustre code has been called, so I'm 
tempted to eliminate my modifications as a possible cause of the error 
(although I'm willing to do more to confirm that.)

Your statement about failed IB memory registration also got me looking 
at my IB configuration. It looks like I should increase the values of 
the log_num_mtt and log_mtts_per_seg parameter values of the mlx4_core 
kernel module, as they are currently pitifully low.

>
> -Devendar
>
> On Tue, Jan 15, 2013 at 10:24 AM, Martin Pokorny<mpokorny at nrao.edu>  wrote:
>> Hi Devendar,
>>
>>
>> On 01/14/2013 09:58 PM, Devendar Bureddy wrote:
>>>
>>> We haven't seen this assertion before.  Is this happening even without
>>> your modifications?  Did you specify any run-time parameters?
>>
>>
>> Reproducing the error without my modifications is certainly high on my list
>> of things to do, but I haven't done it yet. (To clarify my earlier comment
>> about modified MPI-IO routines, I should have been more specific: the
>> routines I'm working with are part of the ADIO Lustre code.) The fault
>> occurs only rarely, and I'm trying to find a way to increase its frequency
>> of occurrence to help with debugging. The only run-time parameters I've
>> currently set are MV2_USE_RDMA_CM=1 and MV2_ENABLE_AFFINITY=0.
>>
>>
>>> On Mon, Jan 14, 2013 at 7:01 PM, Martin Pokorny<mpokorny at nrao.edu>   wrote:
>>>>
>>>> Hello everyone,
>>>>
>>>> I've been occasionally seeing the following assertion error under
>>>> mvapich2-1.9a2. The conditions leading to the failure are not clear to me
>>>> (I'm working on a real-time data processing system), but this failure
>>>> only
>>>> occurs sporadically.
>>>>
>>>> Assertion failed in file
>>>> src/mpid/ch3/channels/mrail/src/rdma/ch3_rndvtransfer.c at line 922:
>>>> vc->ch.pending_r3_data == 0
>>>>
>>>> Note the I have been making some modifications to MPI-IO routines, so
>>>> that
>>>> muddies the waters a bit, but are there any known conditions that might
>>>> trigger this assertion failure? Are there any configuration variables
>>>> that I
>>>> might change to (try to) avoid this failure?

-- 
Martin Pokorny
Software Engineer - Karl G. Jansky Very Large Array
National Radio Astronomy Observatory - New Mexico Operations


More information about the mvapich-discuss mailing list