[mvapich-discuss] mvapich2 over uDAPL gives QP_FATAL error
Lei Chai
chai.15 at osu.edu
Mon Mar 23 17:30:49 EDT 2009
Hi Ajay,
Thanks for your report and detailed information. We found a bug in this
part. Basically the noop message is to send credit to the remote side.
So after certain number of receives
(udapl_dynamic_credit_threshold/udapl_credit_notify_threshold) the
process sends a noop message. Since the total number of receive buffers
are 80 by default (udapl_prepost_depth), there could be as many as 8
noop messages outstanding and the remote side has to have at least 8
receive buffers for receiving these noop messages. The current value
(udapl_prepost_noop_extra) is 5 and that's why it failed. So your
workarounds are actually fixes. I suggest you use
udapl_prepost_noop_extra=8. We will fix it in our code base also.
Thanks,
Lei
Ajay wrote:
> Hello,
>
> I am using "mvapich2-1.2p1" of "OFED-1.4.1-RC1". I am trying to run
> mvapich2 over uDAPL ("compat-dapl-1.2.12" with "OpenIB-CMA") and while
> running IMB-EXT application (of Intel MPI Benchmarks 3.2), I am
> getting "IBV_EVENT_QP_FATAL" event. I am getting error in "Bidir_Get"
> API of IMB-EXT. If I try to run only Bidir_Get of IMB-EXT then test
> works fine.
>
> Basically error is I have received some data but I don't have buffers
> posted for same.
>
> I found some workarounds and will like to understand more. Following
> are some queries:
>
> 1. If value of "MV2_PREPOST_DEPTH" variable (udapl_prepost_depth)
> is <= 59; then this test works fine. I don't understand how
> reducing number of prepost buffer resolved this issue. Any
> suggestions?
> 2. If value of variable "udapl_prepost_noop_extra" is changed from
> 5 to 8 then this test works fine. Subsequently, if value of
> "udapl_initial_credits" is changed from 5 to 2 then test works
> fine (with no change in "udapl_prepost_noop_extra" variable).
> Similarly, if value of "remote_credit" is changed from 5 to 2
> inside function MRAILI_Init_vc(), then this test works fine. I
> will like to know what's use of "remote_credit" variable. And,
> what's use of "remote_cc" and "rdma_credit" variables?
> 3. If value of variables "udapl_dynamic_credit_threshold" and
> "udapl_credit_notify_threshold" are changed from 10 to >=13 then
> this test works fine (basically I was trying to send NOOP after
> 15 post_recvs() instead of 10). I think NOOP sends are for
> synchronization; so how does less no. of NOOPs are resolving
> this issue?
>
> Thanks in Advance.
>
> Regards,
> Ajay
>
>
>
> Dear *mvapich-discuss!* Get Yourself a cool, short *@in.com* Email ID
> now!
> <http://mail.in.com/mails/new_reg.php?utm_source=invite&utm_medium=outgoing>
> ------------------------------------------------------------------------
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
More information about the mvapich-discuss
mailing list