[mvapich-discuss] MVAPICH on large clusters - timeouts - any advice?
Jonathan Follows
jonathan_follows at uk.ibm.com
Thu Feb 22 13:30:51 EST 2007
Hello,
I'm running on a relatively large cluster (160 nodes, dual-core
dual-socket) with IB connecting all nodes.
I recompiled MVAPICH 0.9.8 because I wanted to run under IBM's batch
scheduler, LoadLeveler, and that worked fine.
The IB implementation is with Voltaire PCIe adapters and I compiled
MVAPICH using the "make.mvapich.gen2" script with appropriate
modifications. I'm using Pathscale compilers, for example.
With anything like a "reasonable" number of nodes (sometimes even 16, but
>=64 for sure) I'm getting failures:
[chpcc022:14] Got completion with error, code=12, dest rank=78 at line 397
in file viacheck.c
I have now recompiled MVAPICH with -DON_DEMAND and, at run-time,
VIADEV_CM_TIMEOUT=5000000.
[REQUEST: the documentation is unclear but the value for this parameter
needs to be specified in microseconds, I believe]
Now my job is running, but it's probably running very badly; in due course
I plan on changing this timeout value to something less (but greater than
the default).
Just looking for now for any comments, ideas, experiences, advice?
Gratefully received of course,
Thanks,
Jonathan Follows
Deep Computing, Consulting I/T Specialist
IBM UK, Manchester [Internal 487099]
POST: c/o IBM UK Limited, NHBR-1PH, Portsmouth PO6 3AU
Tel: (+44) 1619057099 FAX: (+44) 870 1385642
Mobile: (+44) 7764660714 MOBX 273842
E-mail: Jonathan_Follows at uk.ibm.com
Text messaging: http://www.jonathanfollows.com/pageme.html
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number
741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20070222/9dacaa0b/attachment.html
More information about the mvapich-discuss
mailing list