[mvapich-discuss] MVAPICH on large clusters - timeouts - any advice?

Jonathan Follows jonathan_follows at uk.ibm.com
Thu Feb 22 13:30:51 EST 2007


Hello,
I'm running on a relatively large cluster (160 nodes, dual-core 
dual-socket) with IB connecting all nodes.
I recompiled MVAPICH 0.9.8 because I wanted to run under IBM's batch 
scheduler, LoadLeveler, and that worked fine.
The IB implementation is with Voltaire PCIe adapters and I compiled 
MVAPICH using the "make.mvapich.gen2" script with appropriate 
modifications. I'm using Pathscale compilers, for example.
With anything like a "reasonable" number of nodes (sometimes even 16, but 
>=64 for sure) I'm getting failures:
[chpcc022:14] Got completion with error, code=12, dest rank=78 at line 397 
in file viacheck.c

I have now recompiled MVAPICH with -DON_DEMAND and, at run-time, 
VIADEV_CM_TIMEOUT=5000000.
[REQUEST: the documentation is unclear but the value for this parameter 
needs to be specified in microseconds, I believe]
Now my job is running, but it's probably running very badly; in due course 
I plan on changing this timeout value to something less (but greater than 
the default).
Just looking for now for any comments, ideas, experiences, advice?
Gratefully received of course,
Thanks,
Jonathan Follows
Deep Computing, Consulting I/T Specialist
IBM UK, Manchester [Internal 487099]
POST: c/o IBM UK Limited, NHBR-1PH, Portsmouth PO6 3AU
Tel: (+44) 1619057099 FAX: (+44) 870 1385642
Mobile: (+44) 7764660714 MOBX 273842
E-mail: Jonathan_Follows at uk.ibm.com
Text messaging: http://www.jonathanfollows.com/pageme.html






Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 
741598. 
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU





-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20070222/9dacaa0b/attachment.html


More information about the mvapich-discuss mailing list