[mvapich-discuss] mvapich2 runtime failure

Sangamesh B forum.san at gmail.com
Mon Oct 12 10:29:01 EDT 2009


Hi,

  The mvapich2(1.2p1 and 1.4rc1) is installed with Intel 10.1 compilers on a
Rocks5.1 HPC Linux cluster.

The siesta-2.0.2 (Fortran) application is compiled with MKL library support.

The job fails after running 20-30 minutes.

$ cat err.362.mvapi2_24h_12
Warning! Rndv Receiver is receiving (36864 < 46080) less than as expected
Fatal error in MPI_Bcast:
Message truncated, error stack:
MPI_Bcast(1145)...................: MPI_Bcast(buf=0x3c14e90, count=1,
dtype=USER<vector>, root=0, comm=0xc4000005) failed
MPIR_Bcast(229)...................:
MPIDI_CH3U_Receive_data_found(254): Message from rank 0 and tag 2 truncated;
46080 bytes received but buffer size is 36864
Fatal error in MPI_Bcast:
Message truncated, error stack:
MPI_Bcast(1145)........................: MPI_Bcast(buf=0x866c130, count=1,
dtype=USER<vector>, root=0, comm=0xc4000006) failed
MPIR_Bcast(229)........................:
MPIDI_CH3U_Post_data_receive_found(439): Message from rank 0 and tag 2
truncated; 46080 bytes received but buffer size is 36864
rm: cannot remove `/tmp/362.1.all.q/rsh': No such file or directory


The siesta output file end with following error:

siesta:   27    -8036.3459    -8035.3935    -8035.4038  0.0751 -3.9174
siesta:   28    -8036.3396    -8035.4433    -8035.4554  0.0707 -3.9601
siesta:   29    -8036.3531    -8035.5953    -8035.6096  0.0709 -3.9417
rank 9 in job 1  compute-0-12.local_50891   caused collective abort of all
ranks
  exit status of rank 9: killed by signal 9


The HCA card is Mellanox:

# ibstat
CA 'mthca0'
        CA type: MT25204
        Number of ports: 1
        Firmware version: 1.2.0
        Hardware version: a0
        Node GUID: 0x0002c9020028de58
        System image GUID: 0x0002c9020028de5b
        Port 1:
                State: Active
                Physical state: LinkUp
                Rate: 20
                Base lid: 1
                LMC: 0
                SM lid: 1
                Capability mask: 0x02510a6a
                Port GUID: 0x0002c9020028de59

We've used OFED-1.4.

The same job fails even with mvapich2-1.4rc1, at same point.

Why this error? How to resolve it?  Is there any problem IB setup?

The ib pingpong tests work fine for all the nodes. So there could not be a
problem with ofed drivers.

Please help us to resolve the error.

Thanks in advance
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20091012/8eed9534/attachment-0001.html


More information about the mvapich-discuss mailing list