[mvapich-discuss] cm_enable_qp_init_to_rtr error

Repsher, Stephen J stephen.j.repsher at boeing.com
Mon Mar 29 08:31:10 EDT 2010


Hello,

Let me answer your questions first...
(1) OFED 1.4.2
(2) Have seen it on a few different problems with the same application - 48 processes typically (8 processes x 6 nodes) (One user even told he experienced it (non-repeatably) with 6 nodes, but not with 5 or 7.
(3) The application is a NASA-developed CFD code called OVERFLOW, which I assume you've heard of (?) and know is free to government contractors.
(4) Have not tested with other compilers.

By on-demand you mean XRC right?  So far, we have only seen the issue on our newer cluster nodes where I do enable VIADEV_USE_XRC=1 because the HCAs support it.  I also set VIADEV_CLUSTER_SIZE=AUTO for all cases.

As a debug measure, I've disabled XRC and we'll see if anyone still has the issue.

Hope that helps.

Steve
 

-----Original Message-----
From: sayantan.sur at gmail.com [mailto:sayantan.sur at gmail.com] On Behalf Of Sayantan Sur
Sent: Friday, March 26, 2010 5:43 PM
To: Repsher, Stephen J
Cc: mvapich-discuss at cse.ohio-state.edu
Subject: Re: [mvapich-discuss] cm_enable_qp_init_to_rtr error

Hi Steve,

Thanks for your report. This may be related to the on-demand connection manager in MVAPICH. It could also be some weird IB stack issue where connection creation fails sometimes. It will be hard to say which way without more details about the system and the workload you are running:

1) Which version of OFED are you running? what is your platform?
2) At how many processes do you see this failure?
3) Can the application code be tried out by others to see if they reproduce this error?
4) Do you see this failure with any other compilers, or is it specific to icc?

Hopefully, with this information we will better understand your problem.

Thanks.

On Fri, Mar 26, 2010 at 11:55 AM, Repsher, Stephen J <stephen.j.repsher at boeing.com> wrote:
> Hello,
>
> I'm experiencing some random hanging behavior with my application compiled with MVAPICH 1.1 and the Intel 11.1 compiler.  Most of the time there are no errors and the code hangs, but once in a while I get an error like this...
>
> [Rank 33][cm.c: line 398]Failed to modify QP to RTR [Rank 33][cm.c: 
> line 582]cm_enable_qp_init_to_rtr failed
>
> Anyone have an idea what this might be related to?
>
> Thanks for your help.
>
> ============================================
> Steve Repsher
> Boeing Defense, Space, & Security - Rotorcraft Aerodynamics/CFD
> Phone: (610) 591-1510
> Fax: (610) 591-6263
> stephen.j.repsher at boeing.com
>
>
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
>



-- 
Sayantan Sur

Research Scientist
Department of Computer Science
The Ohio State University.



More information about the mvapich-discuss mailing list