[mvapich-discuss] Multi-rail mvapich issue with more than 10 systems

Matthew Koop koop at cse.ohio-state.edu
Sun Apr 13 23:47:26 EDT 2008


Tatek,

If you run 192 cores on 12 nodes and skip over cn35 and cn36 do you still
seen any issue? Also, is this a new cluster? Code 12 errors can occur when
there are loose cables or bad internal switch links.

Also, I'd encourage you to use MVAPICH2 since it has additional multirail
features that are not present in MVAPICH.

Thanks,

Matt

On Sun, 13 Apr 2008, tek mobster wrote:

>
> Hello,
>  I have a 64 node cluster that I am trying to run linpak on.  Ecah
> node has 16 cores and 4 HCAs.  After building multirail mvapich, I can
> sucessfuly run linpak on 10 nodes (160 cores).  However, when I try to
> run any more than that, I get the following errors.  Note that node
> cn35 and cn36 are my 11th and 12th nodes in this case.  I used
> mpirun_rsh -np 192 -hostfile hostfile ./xhpl as the command. If I
> delete say 2 other nodes and use cn35 and cn36 as two of my 10 nodes
> and run with np 160, it completes just fine.  I did set ulimit -l to
> be unlimited and each node has MaxStartups set to 32 in
> /etc/ssh/sshd_config.  Any help would be greatly appreciated.
>
>
> [176] Abort: [cn35:176] Got completion with error code 12 at line 1277 in file viacheck.c[175] Abort: [cn35:175] Got completion with error code 12 at line 1277 in file viacheck.c[172] Abort: [cn35:172] Got completion with error code 12 at line 1277 in file viacheck.c[173] Abort: [cn35:173] Got completion with error code 12 at line 1277 in file viacheck.c[170] Abort: [cn35:170] Got completion with error code 12 at line 1277 in file viacheck.c[171] Abort: [cn35:171] Got completion with error code 12 at line 1277 in file viacheck.c[174] Abort: [cn35:174] Got completion with error code 12 at line 1277 in file viacheck.c[178] Abort: [cn36:178] Got completion with error code 12 at line 1277 in file viacheck.c[181] Abort: [cn36:181] Got completion with error code 12 at line 1277 in file viacheck.c[180] Abort: [cn36:180] Got completion with error code 12 at line 1277 in file viacheck.c[189] Abort: [cn36:189] Got completion with error code 12 at line 1277 in file viacheck.c[183] Abort: [cn36:183] Got completion with error code 12 at line 1277 in file viacheck.c[187] Abort: [191] Abort: [cn36:191] Got completion with error code 12 at line 1277 in file viacheck.c[179] Abort: [182] Abort: [184] Abort: [cn36:184] Got completion with error code 12 at line 1277 in file viacheck.c[186] Abort: [cn36:186] Got completion with error code 12 at line 1277 in file viacheck.c[185] Abort: [cn36:185] Got completion with error code 12 at line 1277 in file viacheck.c[188] Abort: [cn36:188] Got completion with error code 12 at line 1277 in file viacheck.c[177] Abort: [cn36:177] Got completion with error code 12 at line 1277 in file viacheck.c[190] Abort: [cn36:190] Got completion with error code 12 at line 1277 in file viacheck.c[cn36:187] Got completion with error code 12 at line 1277 in file viacheck.c[cn36:182] Got completion with error code 12 at line 1277 in file viacheck.c[cn36:179] Got completion with error code 12 at line 1277 in file viacheck.cTimeout alarm signaledCleaning up all processes ...done.
>
> Thanks
> Tatek
> _________________________________________________________________
> More immediate than e-mail? Get instant access with Windows Live Messenger.
> http://www.windowslive.com/messenger/overview.html?ocid=TXT_TAGLM_WL_Refresh_instantaccess_042008




More information about the mvapich-discuss mailing list