[mvapich-discuss] cpmd job failure
Sangamesh B
forum.san at gmail.com
Sun Feb 1 10:22:04 EST 2009
Hello Sir,
On Sat, Jan 31, 2009 at 8:08 PM, Dhabaleswar Panda
<panda at cse.ohio-state.edu> wrote:
> Thanks for reporting this. Are you running MVAPICH2 1.2p1 with the
> `default' mode or with any environment variables? Can you also indicate
> the details on your platform (processor, number of cores/node, amount of
> memory per core, IB HCA speed, etc.).
>
I'm running it in 'default' mode. I've not used any additional variables.
Intel Xeon Quad core Dual processor (8 cores/node).
4GB RAM/node (512 MB/core)
Intel compilers 10
The same job runs fine with Open MPI.
Thanks,
Sangamesh
> Thanks,
>
> DK
>
> On Sat, 31 Jan 2009, Sangamesh B wrote:
>
>> Hello mvapich2 team,
>>
>> The CPMD (www.cpmd.org) application is installed with intel
>> compilers on a Rocks4.3 Linux based infiniband supported cluster,
>> mvapich2 version 1.2p1.
>>
>> The 40 process job runs for some time and then fails with following output:
>>
>> LINE SEARCH : LAMBDA=.164E-01 PREDICTED ENERGY = -1890.824133217
>> 57 9.731E-05 7.571E-06 -1890.824133 -8.483E-07 47.38
>> LINE SEARCH : LAMBDA=.166E-01 PREDICTED ENERGY = -1890.824133946
>> 58 9.831E-05 7.265E-06 -1890.824134 -7.234E-07 47.41
>> LINE SEARCH : LAMBDA=.178E-01 PREDICTED ENERGY = -1890.824134657
>> 59 9.529E-05 6.389E-06 -1890.824135 -6.945E-07 47.36
>> rank 17 in job 1 node-0-5.local_32810 caused collective abort of all ranks
>> exit status of rank 17: killed by signal 9
>> rank 1 in job 1 node-0-5.local_32810 caused collective abort of all ranks
>> exit status of rank 1: killed by signal 9
>>
>> For several same jobs, it fails around same point (but not exactly at
>> same step).
>>
>> What could be the solution for this?
>>
>> Thanks,
>> Sangamesh
>> _______________________________________________
>> mvapich-discuss mailing list
>> mvapich-discuss at cse.ohio-state.edu
>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>
>
>
More information about the mvapich-discuss
mailing list