[mvapich-discuss] Multi-rail mvapich issue with more than 10 systems

tek mobster tgetachew at hotmail.com
Sun Apr 13 22:30:13 EDT 2008





Hello, I have a 64 node cluster that I am trying to run linpak on.  Ecah node has 16 cores and 4 HCAs.  After building multirail mvapich, I can sucessfuly run linpak on 10 nodes (160 cores).  However, when I try to run any more than that, I get the following errors.  Note that node cn35 and cn36 are my 11th and 12th nodes in this case.  I used mpirun_rsh -np 192 -hostfile hostfile ./xhpl as the command. If I delete say 2 other nodes and use cn35 and cn36 as two of my 10 nodes and run with np 160, it completes just fine.  I did set ulimit -l to be unlimited and each node has MaxStartups set to 32 in /etc/ssh/sshd_config.  Any help would be greatly appreciated.    [176] Abort: [cn35:176] Got completion with error code 12 at line 1277 in file viacheck.c[175] Abort: [cn35:175] Got completion with error code 12 at line 1277 in file viacheck.c[172] Abort: [cn35:172] Got completion with error code 12 at line 1277 in file viacheck.c[173] Abort: [cn35:173] Got completion with error code 12 at line 1277 in file viacheck.c[170] Abort: [cn35:170] Got completion with error code 12 at line 1277 in file viacheck.c[171] Abort: [cn35:171] Got completion with error code 12 at line 1277 in file viacheck.c[174] Abort: [cn35:174] Got completion with error code 12 at line 1277 in file viacheck.c[178] Abort: [cn36:178] Got completion with error code 12 at line 1277 in file viacheck.c[181] Abort: [cn36:181] Got completion with error code 12 at line 1277 in file viacheck.c[180] Abort: [cn36:180] Got completion with error code 12 at line 1277 in file viacheck.c[189] Abort: [cn36:189] Got completion with error code 12 at line 1277 in file viacheck.c[183] Abort: [cn36:183] Got completion with error code 12 at line 1277 in file viacheck.c[187] Abort: [191] Abort: [cn36:191] Got completion with error code 12 at line 1277 in file viacheck.c[179] Abort: [182] Abort: [184] Abort: [cn36:184] Got completion with error code 12 at line 1277 in file viacheck.c[186] Abort: [cn36:186] Got completion with error code 12 at line 1277 in file viacheck.c[185] Abort: [cn36:185] Got completion with error code 12 at line 1277 in file viacheck.c[188] Abort: [cn36:188] Got completion with error code 12 at line 1277 in file viacheck.c[177] Abort: [cn36:177] Got completion with error code 12 at line 1277 in file viacheck.c[190] Abort: [cn36:190] Got completion with error code 12 at line 1277 in file viacheck.c[cn36:187] Got completion with error code 12 at line 1277 in file viacheck.c[cn36:182] Got completion with error code 12 at line 1277 in file viacheck.c[cn36:179] Got completion with error code 12 at line 1277 in file viacheck.cTimeout alarm signaledCleaning up all processes ...done. ThanksTatek
_________________________________________________________________
Get in touch in an instant. Get Windows Live Messenger now.
http://www.windowslive.com/messenger/overview.html?ocid=TXT_TAGLM_WL_Refresh_getintouch_042008
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20080414/2151a1c2/attachment.html


More information about the mvapich-discuss mailing list