[mvapich-discuss] cm_enable_qp_init_to_rtr error

Sayantan Sur surs at cse.ohio-state.edu
Mon Mar 29 10:15:59 EDT 2010


Hi Steve,

On Mon, Mar 29, 2010 at 8:31 AM, Repsher, Stephen J
<stephen.j.repsher at boeing.com> wrote:
> Hello,
>
> Let me answer your questions first...
> (1) OFED 1.4.2
> (2) Have seen it on a few different problems with the same application - 48 processes typically (8 processes x 6 nodes) (One user even told he experienced it (non-repeatably) with 6 nodes, but not with 5 or 7.
> (3) The application is a NASA-developed CFD code called OVERFLOW, which I assume you've heard of (?) and know is free to government contractors.
> (4) Have not tested with other compilers.
>
> By on-demand you mean XRC right?  So far, we have only seen the issue on our newer cluster nodes where I do enable VIADEV_USE_XRC=1 because the HCAs support it.  I also set VIADEV_CLUSTER_SIZE=AUTO for all cases.
>
> As a debug measure, I've disabled XRC and we'll see if anyone still has the issue.

Thanks for this information. It helps to narrow down the problem a
little further.

By on-demand connections, I meant the mode in which MVAPICH does not
initialize all connections. Rather, connections are set up in a lazy
fashion depending on the application's actual communication pattern.
XRC is a different connection type which scales better than regular
reliable connection (RC). However, when VIADEV_USE_XRC=1, on-demand
connections are turned on by default irrespective of the number of
processes in the job. When VIADEV_USE_XRC=0, then on-demand
connections are only attempted for process counts > 32.

It will be useful to keep us in the loop about the outcome of your
tests with VIADEV_USE_XRC=0. I searched online a little for NASA
Overflow, and although I found descriptions of the code, I couldn't
find anything I could download and try.

Thanks.


>
> Hope that helps.
>
> Steve
>
>
> -----Original Message-----
> From: sayantan.sur at gmail.com [mailto:sayantan.sur at gmail.com] On Behalf Of Sayantan Sur
> Sent: Friday, March 26, 2010 5:43 PM
> To: Repsher, Stephen J
> Cc: mvapich-discuss at cse.ohio-state.edu
> Subject: Re: [mvapich-discuss] cm_enable_qp_init_to_rtr error
>
> Hi Steve,
>
> Thanks for your report. This may be related to the on-demand connection manager in MVAPICH. It could also be some weird IB stack issue where connection creation fails sometimes. It will be hard to say which way without more details about the system and the workload you are running:
>
> 1) Which version of OFED are you running? what is your platform?
> 2) At how many processes do you see this failure?
> 3) Can the application code be tried out by others to see if they reproduce this error?
> 4) Do you see this failure with any other compilers, or is it specific to icc?
>
> Hopefully, with this information we will better understand your problem.
>
> Thanks.
>
> On Fri, Mar 26, 2010 at 11:55 AM, Repsher, Stephen J <stephen.j.repsher at boeing.com> wrote:
>> Hello,
>>
>> I'm experiencing some random hanging behavior with my application compiled with MVAPICH 1.1 and the Intel 11.1 compiler.  Most of the time there are no errors and the code hangs, but once in a while I get an error like this...
>>
>> [Rank 33][cm.c: line 398]Failed to modify QP to RTR [Rank 33][cm.c:
>> line 582]cm_enable_qp_init_to_rtr failed
>>
>> Anyone have an idea what this might be related to?
>>
>> Thanks for your help.
>>
>> ============================================
>> Steve Repsher
>> Boeing Defense, Space, & Security - Rotorcraft Aerodynamics/CFD
>> Phone: (610) 591-1510
>> Fax: (610) 591-6263
>> stephen.j.repsher at boeing.com
>>
>>
>>
>> _______________________________________________
>> mvapich-discuss mailing list
>> mvapich-discuss at cse.ohio-state.edu
>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>
>>
>
>
>
> --
> Sayantan Sur
>
> Research Scientist
> Department of Computer Science
> The Ohio State University.
>
>



-- 
Sayantan Sur

Research Scientist
Department of Computer Science
The Ohio State University.



More information about the mvapich-discuss mailing list