[mvapich-discuss] mvapich2 runtime failure
Dhabaleswar Panda
panda at cse.ohio-state.edu
Mon Oct 12 11:19:25 EDT 2009
Can you try your siesta application with the latest version from the trunk
available from the following URL:
http://mvapich.cse.ohio-state.edu/nightly/mvapich2/trunk/
Several fixes have gone into this version after the RC2 release. If the
problem persists with the latest trunk version, we will take a look at it
in detail.
DK
On Mon, 12 Oct 2009, Sangamesh B wrote:
> Hi,
>
> The mvapich2(1.2p1 and 1.4rc1) is installed with Intel 10.1 compilers on a
> Rocks5.1 HPC Linux cluster.
>
> The siesta-2.0.2 (Fortran) application is compiled with MKL library support.
>
> The job fails after running 20-30 minutes.
>
> $ cat err.362.mvapi2_24h_12
> Warning! Rndv Receiver is receiving (36864 < 46080) less than as expected
> Fatal error in MPI_Bcast:
> Message truncated, error stack:
> MPI_Bcast(1145)...................: MPI_Bcast(buf=0x3c14e90, count=1,
> dtype=USER<vector>, root=0, comm=0xc4000005) failed
> MPIR_Bcast(229)...................:
> MPIDI_CH3U_Receive_data_found(254): Message from rank 0 and tag 2 truncated;
> 46080 bytes received but buffer size is 36864
> Fatal error in MPI_Bcast:
> Message truncated, error stack:
> MPI_Bcast(1145)........................: MPI_Bcast(buf=0x866c130, count=1,
> dtype=USER<vector>, root=0, comm=0xc4000006) failed
> MPIR_Bcast(229)........................:
> MPIDI_CH3U_Post_data_receive_found(439): Message from rank 0 and tag 2
> truncated; 46080 bytes received but buffer size is 36864
> rm: cannot remove `/tmp/362.1.all.q/rsh': No such file or directory
>
>
> The siesta output file end with following error:
>
> siesta: 27 -8036.3459 -8035.3935 -8035.4038 0.0751 -3.9174
> siesta: 28 -8036.3396 -8035.4433 -8035.4554 0.0707 -3.9601
> siesta: 29 -8036.3531 -8035.5953 -8035.6096 0.0709 -3.9417
> rank 9 in job 1 compute-0-12.local_50891 caused collective abort of all
> ranks
> exit status of rank 9: killed by signal 9
>
>
> The HCA card is Mellanox:
>
> # ibstat
> CA 'mthca0'
> CA type: MT25204
> Number of ports: 1
> Firmware version: 1.2.0
> Hardware version: a0
> Node GUID: 0x0002c9020028de58
> System image GUID: 0x0002c9020028de5b
> Port 1:
> State: Active
> Physical state: LinkUp
> Rate: 20
> Base lid: 1
> LMC: 0
> SM lid: 1
> Capability mask: 0x02510a6a
> Port GUID: 0x0002c9020028de59
>
> We've used OFED-1.4.
>
> The same job fails even with mvapich2-1.4rc1, at same point.
>
> Why this error? How to resolve it? Is there any problem IB setup?
>
> The ib pingpong tests work fine for all the nodes. So there could not be a
> problem with ofed drivers.
>
> Please help us to resolve the error.
>
> Thanks in advance
>
More information about the mvapich-discuss
mailing list