[mvapich-discuss] cpmd job failure

Sangamesh B forum.san at gmail.com
Sat Jan 31 03:03:06 EST 2009


Hello mvapich2 team,

     The CPMD (www.cpmd.org) application is installed with intel
compilers on a Rocks4.3 Linux based infiniband supported cluster,
mvapich2 version 1.2p1.

The 40 process job runs for some time and then fails with following output:

 LINE SEARCH : LAMBDA=.164E-01 PREDICTED ENERGY = -1890.824133217
  57  9.731E-05   7.571E-06   -1890.824133   -8.483E-07     47.38
 LINE SEARCH : LAMBDA=.166E-01 PREDICTED ENERGY = -1890.824133946
  58  9.831E-05   7.265E-06   -1890.824134   -7.234E-07     47.41
 LINE SEARCH : LAMBDA=.178E-01 PREDICTED ENERGY = -1890.824134657
  59  9.529E-05   6.389E-06   -1890.824135   -6.945E-07     47.36
rank 17 in job 1  node-0-5.local_32810   caused collective abort of all ranks
  exit status of rank 17: killed by signal 9
rank 1 in job 1  node-0-5.local_32810   caused collective abort of all ranks
  exit status of rank 1: killed by signal 9

For several same jobs, it fails around same point (but not exactly at
same step).

What could be the solution for this?

Thanks,
Sangamesh


More information about the mvapich-discuss mailing list