[mvapich-discuss] mvapich2 over uDAPL gives QP_FATAL error

Lei Chai chai.15 at osu.edu
Mon Mar 23 17:30:49 EDT 2009


Hi Ajay,

Thanks for your report and detailed information. We found a bug in this 
part. Basically the noop message is to send credit to the remote side. 
So after certain number of receives 
(udapl_dynamic_credit_threshold/udapl_credit_notify_threshold) the 
process sends a noop message. Since the total number of receive buffers 
are 80 by default (udapl_prepost_depth), there could be as many as 8 
noop messages outstanding and the remote side has to have at least 8 
receive buffers for receiving these noop messages. The current value 
(udapl_prepost_noop_extra) is 5 and that's why it failed. So your 
workarounds are actually fixes. I suggest you use 
udapl_prepost_noop_extra=8. We will fix it in our code base also.

Thanks,
Lei


Ajay wrote:
> Hello,
>
> I am using "mvapich2-1.2p1" of "OFED-1.4.1-RC1". I am trying to run 
> mvapich2 over uDAPL ("compat-dapl-1.2.12" with "OpenIB-CMA") and while 
> running IMB-EXT application (of Intel MPI Benchmarks 3.2), I am 
> getting "IBV_EVENT_QP_FATAL" event. I am getting error in "Bidir_Get" 
> API of IMB-EXT. If I try to run only Bidir_Get of IMB-EXT then test 
> works fine.
>
> Basically error is I have received some data but I don't have buffers 
> posted for same.
>
> I found some workarounds and will like to understand more. Following 
> are some queries:
>
>    1. If value of "MV2_PREPOST_DEPTH" variable (udapl_prepost_depth)
>       is <= 59; then this test works fine. I don't understand how
>       reducing number of prepost buffer resolved this issue. Any
>       suggestions?
>    2. If value of variable "udapl_prepost_noop_extra" is changed from
>       5 to 8 then this test works fine. Subsequently, if value of
>       "udapl_initial_credits" is changed from 5 to 2 then test works
>       fine (with no change in "udapl_prepost_noop_extra" variable).
>       Similarly, if value of "remote_credit" is changed from 5 to 2
>       inside function MRAILI_Init_vc(), then this test works fine. I
>       will like to know what's use of "remote_credit" variable. And,
>       what's use of "remote_cc" and "rdma_credit" variables?
>    3. If value of variables "udapl_dynamic_credit_threshold" and
>       "udapl_credit_notify_threshold" are changed from 10 to >=13 then
>       this test works fine (basically I was trying to send NOOP after
>       15 post_recvs() instead of 10). I think NOOP sends are for
>       synchronization; so how does less no. of NOOPs are resolving
>       this issue?
>
> Thanks in Advance.
>
> Regards,
> Ajay
>
>
>
> Dear *mvapich-discuss!* Get Yourself a cool, short *@in.com* Email ID 
> now! 
> <http://mail.in.com/mails/new_reg.php?utm_source=invite&utm_medium=outgoing>
> ------------------------------------------------------------------------
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>   



More information about the mvapich-discuss mailing list