[mvapich-discuss] Assertion failure

Devendar Bureddy bureddy at cse.ohio-state.edu
Wed Jan 16 09:54:46 EST 2013


Hi Martin

It seems the assertion is hitting during the large message transfer
with non default RNDV protocol i.e R3 ( default is RPUT).  The large
message transfer should have switched to R3 protocol only if the IB
memory registration is failed internally.

Can you try with forcing all the large message transfers to R3
protocol  using run-time parameter MV2_RNDV_PROTOCOL=R3  and see if
this increases the frequency of failure.

-Devendar

On Tue, Jan 15, 2013 at 10:24 AM, Martin Pokorny <mpokorny at nrao.edu> wrote:
> Hi Devendar,
>
>
> On 01/14/2013 09:58 PM, Devendar Bureddy wrote:
>>
>> We haven't seen this assertion before.  Is this happening even without
>> your modifications?  Did you specify any run-time parameters?
>
>
> Reproducing the error without my modifications is certainly high on my list
> of things to do, but I haven't done it yet. (To clarify my earlier comment
> about modified MPI-IO routines, I should have been more specific: the
> routines I'm working with are part of the ADIO Lustre code.) The fault
> occurs only rarely, and I'm trying to find a way to increase its frequency
> of occurrence to help with debugging. The only run-time parameters I've
> currently set are MV2_USE_RDMA_CM=1 and MV2_ENABLE_AFFINITY=0.
>
>
>> On Mon, Jan 14, 2013 at 7:01 PM, Martin Pokorny<mpokorny at nrao.edu>  wrote:
>>>
>>> Hello everyone,
>>>
>>> I've been occasionally seeing the following assertion error under
>>> mvapich2-1.9a2. The conditions leading to the failure are not clear to me
>>> (I'm working on a real-time data processing system), but this failure
>>> only
>>> occurs sporadically.
>>>
>>> Assertion failed in file
>>> src/mpid/ch3/channels/mrail/src/rdma/ch3_rndvtransfer.c at line 922:
>>> vc->ch.pending_r3_data == 0
>>>
>>> Note the I have been making some modifications to MPI-IO routines, so
>>> that
>>> muddies the waters a bit, but are there any known conditions that might
>>> trigger this assertion failure? Are there any configuration variables
>>> that I
>>> might change to (try to) avoid this failure?
>
>
> --
> Martin Pokorny
> Software Engineer - Karl G. Jansky Very Large Array
> National Radio Astronomy Observatory - New Mexico Operations



-- 
Devendar


More information about the mvapich-discuss mailing list