[mvapich-discuss] mvapich2 runtime failure

Dhabaleswar Panda panda at cse.ohio-state.edu
Mon Oct 12 11:19:25 EDT 2009


Can you try your siesta application with the latest version from the trunk
available from the following URL:

http://mvapich.cse.ohio-state.edu/nightly/mvapich2/trunk/

Several fixes have gone into this version after the RC2 release. If the
problem persists with the latest trunk version, we will take a look at it
in detail.

DK

On Mon, 12 Oct 2009, Sangamesh B wrote:

> Hi,
>
>   The mvapich2(1.2p1 and 1.4rc1) is installed with Intel 10.1 compilers on a
> Rocks5.1 HPC Linux cluster.
>
> The siesta-2.0.2 (Fortran) application is compiled with MKL library support.
>
> The job fails after running 20-30 minutes.
>
> $ cat err.362.mvapi2_24h_12
> Warning! Rndv Receiver is receiving (36864 < 46080) less than as expected
> Fatal error in MPI_Bcast:
> Message truncated, error stack:
> MPI_Bcast(1145)...................: MPI_Bcast(buf=0x3c14e90, count=1,
> dtype=USER<vector>, root=0, comm=0xc4000005) failed
> MPIR_Bcast(229)...................:
> MPIDI_CH3U_Receive_data_found(254): Message from rank 0 and tag 2 truncated;
> 46080 bytes received but buffer size is 36864
> Fatal error in MPI_Bcast:
> Message truncated, error stack:
> MPI_Bcast(1145)........................: MPI_Bcast(buf=0x866c130, count=1,
> dtype=USER<vector>, root=0, comm=0xc4000006) failed
> MPIR_Bcast(229)........................:
> MPIDI_CH3U_Post_data_receive_found(439): Message from rank 0 and tag 2
> truncated; 46080 bytes received but buffer size is 36864
> rm: cannot remove `/tmp/362.1.all.q/rsh': No such file or directory
>
>
> The siesta output file end with following error:
>
> siesta:   27    -8036.3459    -8035.3935    -8035.4038  0.0751 -3.9174
> siesta:   28    -8036.3396    -8035.4433    -8035.4554  0.0707 -3.9601
> siesta:   29    -8036.3531    -8035.5953    -8035.6096  0.0709 -3.9417
> rank 9 in job 1  compute-0-12.local_50891   caused collective abort of all
> ranks
>   exit status of rank 9: killed by signal 9
>
>
> The HCA card is Mellanox:
>
> # ibstat
> CA 'mthca0'
>         CA type: MT25204
>         Number of ports: 1
>         Firmware version: 1.2.0
>         Hardware version: a0
>         Node GUID: 0x0002c9020028de58
>         System image GUID: 0x0002c9020028de5b
>         Port 1:
>                 State: Active
>                 Physical state: LinkUp
>                 Rate: 20
>                 Base lid: 1
>                 LMC: 0
>                 SM lid: 1
>                 Capability mask: 0x02510a6a
>                 Port GUID: 0x0002c9020028de59
>
> We've used OFED-1.4.
>
> The same job fails even with mvapich2-1.4rc1, at same point.
>
> Why this error? How to resolve it?  Is there any problem IB setup?
>
> The ib pingpong tests work fine for all the nodes. So there could not be a
> problem with ofed drivers.
>
> Please help us to resolve the error.
>
> Thanks in advance
>



More information about the mvapich-discuss mailing list