[mvapich-discuss] cpmd job failure

Dhabaleswar Panda panda at cse.ohio-state.edu
Sat Jan 31 09:38:08 EST 2009


Thanks for reporting this. Are you running MVAPICH2 1.2p1 with the
`default' mode or with any environment variables? Can you also indicate
the details on your platform (processor, number of cores/node, amount of
memory per core, IB HCA speed, etc.).

Thanks,

DK

On Sat, 31 Jan 2009, Sangamesh B wrote:

> Hello mvapich2 team,
>
>      The CPMD (www.cpmd.org) application is installed with intel
> compilers on a Rocks4.3 Linux based infiniband supported cluster,
> mvapich2 version 1.2p1.
>
> The 40 process job runs for some time and then fails with following output:
>
>  LINE SEARCH : LAMBDA=.164E-01 PREDICTED ENERGY = -1890.824133217
>   57  9.731E-05   7.571E-06   -1890.824133   -8.483E-07     47.38
>  LINE SEARCH : LAMBDA=.166E-01 PREDICTED ENERGY = -1890.824133946
>   58  9.831E-05   7.265E-06   -1890.824134   -7.234E-07     47.41
>  LINE SEARCH : LAMBDA=.178E-01 PREDICTED ENERGY = -1890.824134657
>   59  9.529E-05   6.389E-06   -1890.824135   -6.945E-07     47.36
> rank 17 in job 1  node-0-5.local_32810   caused collective abort of all ranks
>   exit status of rank 17: killed by signal 9
> rank 1 in job 1  node-0-5.local_32810   caused collective abort of all ranks
>   exit status of rank 1: killed by signal 9
>
> For several same jobs, it fails around same point (but not exactly at
> same step).
>
> What could be the solution for this?
>
> Thanks,
> Sangamesh
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>



More information about the mvapich-discuss mailing list