[mvapich-discuss] Multi-rail mvapich issue with more than 10 systems

tek mobster tgetachew at hotmail.com
Mon Apr 14 04:10:38 EDT 2008


Hello Matt,
 
Thanks for the quick reply.  Yes, I have tried different sets of 12 nodes and same error occurs.  It always happens on the last 2 nodes in my hostfile.  If I move cn35 and cn36 2 nodes up and have the last 2 nodes be cn33 and cn34 the exact error occurs now on cn33 and cn34.  I have also verified the IB fabric already and was able to run linpak using openmpi on all 64 nodes.  Also, using mvapich, I can run 10 nodes at a time without any problem (including the nodes that give the errors).  So, I think the issue is something else.  I will try mvapich2 and see what I get but I think the issue seems to be a limit of some sort on how many nodes I can run.
 
Thanks
Tatek> Date: Sun, 13 Apr 2008 23:47:26 -0400> From: koop at cse.ohio-state.edu> To: tgetachew at hotmail.com> CC: mvapich-discuss at cse.ohio-state.edu> Subject: Re: [mvapich-discuss] Multi-rail mvapich issue with more than 10 systems> > Tatek,> > If you run 192 cores on 12 nodes and skip over cn35 and cn36 do you still> seen any issue? Also, is this a new cluster? Code 12 errors can occur when> there are loose cables or bad internal switch links.> > Also, I'd encourage you to use MVAPICH2 since it has additional multirail> features that are not present in MVAPICH.> > Thanks,> > Matt> > On Sun, 13 Apr 2008, tek mobster wrote:> > >> > Hello,> > I have a 64 node cluster that I am trying to run linpak on. Ecah> > node has 16 cores and 4 HCAs. After building multirail mvapich, I can> > sucessfuly run linpak on 10 nodes (160 cores). However, when I try to> > run any more than that, I get the following errors. Note that node> > cn35 and cn36 are my 11th and 12th nodes in this case. I used> > mpirun_rsh -np 192 -hostfile hostfile ./xhpl as the command. If I> > delete say 2 other nodes and use cn35 and cn36 as two of my 10 nodes> > and run with np 160, it completes just fine. I did set ulimit -l to> > be unlimited and each node has MaxStartups set to 32 in> > /etc/ssh/sshd_config. Any help would be greatly appreciated.> >> >> > [176] Abort: [cn35:176] Got completion with error code 12 at line 1277 in file viacheck.c[175] Abort: [cn35:175] Got completion with error code 12 at line 1277 in file viacheck.c[172] Abort: [cn35:172] Got completion with error code 12 at line 1277 in file viacheck.c[173] Abort: [cn35:173] Got completion with error code 12 at line 1277 in file viacheck.c[170] Abort: [cn35:170] Got completion with error code 12 at line 1277 in file viacheck.c[171] Abort: [cn35:171] Got completion with error code 12 at line 1277 in file viacheck.c[174] Abort: [cn35:174] Got completion with error code 12 at line 1277 in file viacheck.c[178] Abort: [cn36:178] Got completion with error code 12 at line 1277 in file viacheck.c[181] Abort: [cn36:181] Got completion with error code 12 at line 1277 in file viacheck.c[180] Abort: [cn36:180] Got completion with error code 12 at line 1277 in file viacheck.c[189] Abort: [cn36:189] Got completion with error code 12 at line 1277 in file viacheck.c[183] Abort: [cn36:183] Got completion with error code 12 at line 1277 in file viacheck.c[187] Abort: [191] Abort: [cn36:191] Got completion with error code 12 at line 1277 in file viacheck.c[179] Abort: [182] Abort: [184] Abort: [cn36:184] Got completion with error code 12 at line 1277 in file viacheck.c[186] Abort: [cn36:186] Got completion with error code 12 at line 1277 in file viacheck.c[185] Abort: [cn36:185] Got completion with error code 12 at line 1277 in file viacheck.c[188] Abort: [cn36:188] Got completion with error code 12 at line 1277 in file viacheck.c[177] Abort: [cn36:177] Got completion with error code 12 at line 1277 in file viacheck.c[190] Abort: [cn36:190] Got completion with error code 12 at line 1277 in file viacheck.c[cn36:187] Got completion with error code 12 at line 1277 in file viacheck.c[cn36:182] Got completion with error code 12 at line 1277 in file viacheck.c[cn36:179] Got completion with error code 12 at line 1277 in file viacheck.cTimeout alarm signaledCleaning up all processes ...done.> >> > Thanks> > Tatek> > _________________________________________________________________> > More immediate than e-mail? Get instant access with Windows Live Messenger.> > http://www.windowslive.com/messenger/overview.html?ocid=TXT_TAGLM_WL_Refresh_instantaccess_042008> 
_________________________________________________________________
More immediate than e-mail? Get instant access with Windows Live Messenger.
http://www.windowslive.com/messenger/overview.html?ocid=TXT_TAGLM_WL_Refresh_instantaccess_042008
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20080414/df90ce00/attachment-0001.html


More information about the mvapich-discuss mailing list