[mvapich-discuss] MVAPICH on large clusters - timeouts - any advice?

Qi Gao gaoq at cse.ohio-state.edu
Thu Feb 22 14:10:35 EST 2007


Hi Jonathan,

Thanks for using MVAPICH. We are glad to work with you to solve the problems.

For the Got completion with error, code=12, it is not about VIADEV_CM_TIMEOUT env variable. You can try to increase VIADEV_DEFAULT_TIME_OUT to 22. The unit of VIADEV_DEFAULT_TIME_OUT is specified by IB Spec, page 340, which is 4.096 us * 2 ^ (<5 bits time out value>) 

And about VIADEV_CM_TIMEOUT, it's only used for connection setup, and its unit is in milliseconds (the default value for this is 500 milliseconds). Thanks for your suggestion and we will modify the userguide to make it more clear.

Please let us know if you have any questions.

Regards,
--Qi
  ----- Original Message ----- 
  From: Jonathan Follows 
  To: mvapich-discuss at cse.ohio-state.edu 
  Sent: Thursday, February 22, 2007 1:30 PM
  Subject: [mvapich-discuss] MVAPICH on large clusters - timeouts - any advice?



  Hello, 
  I'm running on a relatively large cluster (160 nodes, dual-core dual-socket) with IB connecting all nodes. 

  I recompiled MVAPICH 0.9.8 because I wanted to run under IBM's batch scheduler, LoadLeveler, and that worked fine. 

  The IB implementation is with Voltaire PCIe adapters and I compiled MVAPICH using the "make.mvapich.gen2" script with appropriate modifications. I'm using Pathscale compilers, for example. 

  With anything like a "reasonable" number of nodes (sometimes even 16, but >=64 for sure) I'm getting failures: 

  [chpcc022:14] Got completion with error, code=12, dest rank=78 at line 397 in file viacheck.c 

  I have now recompiled MVAPICH with -DON_DEMAND and, at run-time, VIADEV_CM_TIMEOUT=5000000. 

  [REQUEST: the documentation is unclear but the value for this parameter needs to be specified in microseconds, I believe] 

  Now my job is running, but it's probably running very badly; in due course I plan on changing this timeout value to something less (but greater than the default). 

  Just looking for now for any comments, ideas, experiences, advice? 

  Gratefully received of course, 

  Thanks, 

  Jonathan Follows
  Deep Computing, Consulting I/T Specialist
  IBM UK, Manchester [Internal 487099]
  POST: c/o IBM UK Limited, NHBR-1PH, Portsmouth PO6 3AU
  Tel: (+44) 1619057099 FAX: (+44) 870 1385642
  Mobile: (+44) 7764660714 MOBX 273842
  E-mail: Jonathan_Follows at uk.ibm.com
  Text messaging: http://www.jonathanfollows.com/pageme.html






------------------------------------------------------------------------------



  Unless stated otherwise above:
  IBM United Kingdom Limited - Registered in England and Wales with number 741598. 
  Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU 










------------------------------------------------------------------------------


  _______________________________________________
  mvapich-discuss mailing list
  mvapich-discuss at cse.ohio-state.edu
  http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20070222/e4e8f50d/attachment.html


More information about the mvapich-discuss mailing list