[mvapich-discuss] cpmd job failure
Sangamesh B
forum.san at gmail.com
Sat Jan 31 03:03:06 EST 2009
Hello mvapich2 team,
The CPMD (www.cpmd.org) application is installed with intel
compilers on a Rocks4.3 Linux based infiniband supported cluster,
mvapich2 version 1.2p1.
The 40 process job runs for some time and then fails with following output:
LINE SEARCH : LAMBDA=.164E-01 PREDICTED ENERGY = -1890.824133217
57 9.731E-05 7.571E-06 -1890.824133 -8.483E-07 47.38
LINE SEARCH : LAMBDA=.166E-01 PREDICTED ENERGY = -1890.824133946
58 9.831E-05 7.265E-06 -1890.824134 -7.234E-07 47.41
LINE SEARCH : LAMBDA=.178E-01 PREDICTED ENERGY = -1890.824134657
59 9.529E-05 6.389E-06 -1890.824135 -6.945E-07 47.36
rank 17 in job 1 node-0-5.local_32810 caused collective abort of all ranks
exit status of rank 17: killed by signal 9
rank 1 in job 1 node-0-5.local_32810 caused collective abort of all ranks
exit status of rank 1: killed by signal 9
For several same jobs, it fails around same point (but not exactly at
same step).
What could be the solution for this?
Thanks,
Sangamesh
More information about the mvapich-discuss
mailing list