[mvapich-discuss] mvapich2 runtime failure

Sangamesh B forum.san at gmail.com
Tue Oct 13 02:14:18 EDT 2009


Hi,

   The latest links 10/10 and 09/10 did not work. The 08/10 trunk got
downloaded.
The trunk was not having configure script,  also autoconf didn't work.

I copied configure script and other required header files from
mvapich2-1.2p1. But that failed with following error:

/opt/intel/cce/10.1.018/bin/icc -DHAVE_CONFIG_H -I.
-I/root/Packages/softwares_for_newcluster/mvapich2-trunk-2009-10-08/src/mpid/common/locks
-I/root/Packages/softwares_for_newcluster/mvapich2-trunk-2009-10-08/src/mpid/common/locks
-I../../../include
-I/root/Packages/softwares_for_newcluster/mvapich2-trunk-2009-10-08/src/include
-O3 -xT -DNDEBUG
-I/root/Packages/softwares_for_newcluster/mvapich2-trunk-2009-10-08/src/mpid/ch3/include
-I/root/Packages/softwares_for_newcluster/mvapich2-trunk-2009-10-08/src/mpid/ch3/include
-I/root/Packages/softwares_for_newcluster/mvapich2-trunk-2009-10-08/src/mpid/common/datatype
-I/root/Packages/softwares_for_newcluster/mvapich2-trunk-2009-10-08/src/mpid/common/datatype
-I/root/Packages/softwares_for_newcluster/mvapich2-trunk-2009-10-08/src/mpid/common/locks
-I/root/Packages/softwares_for_newcluster/mvapich2-trunk-2009-10-08/src/mpid/common/locks
-I/root/Packages/softwares_for_newcluster/mvapich2-trunk-2009-10-08/src/mpid/ch3/channels/mrail/include
-I/root/Packages/softwares_for_newcluster/mvapich2-trunk-2009-10-08/src/mpid/ch3/channels/mrail/include
-I/root/Packages/softwares_for_newcluster/mvapich2-trunk-2009-10-08/src/mpid/ch3/channels/mrail/src/gen2
-I/root/Packages/softwares_for_newcluster/mvapich2-trunk-2009-10-08/src/mpid/ch3/channels/mrail/src/gen2
-I/root/Packages/softwares_for_newcluster/mvapich2-trunk-2009-10-08/src/mpid/common/locks
-I/root/Packages/softwares_for_newcluster/mvapich2-trunk-2009-10-08/src/mpid/common/locks
-c mpidu_process_locks.c
/root/Packages/softwares_for_newcluster/mvapich2-trunk-2009-10-08/src/mpid/ch3/include/mpidpre.h(34):
error: identifier "MPIR_Pint" is undefined
  typedef MPIR_Pint MPIDI_msg_sz_t;
          ^

compilation aborted for mpidu_process_locks.c (code 2)
make[4]: *** [mpidu_process_locks.o] Error 2

Which version shall I use?

Thanks


On Mon, Oct 12, 2009 at 8:49 PM, Dhabaleswar Panda <panda at cse.ohio-state.edu
> wrote:

> Can you try your siesta application with the latest version from the trunk
> available from the following URL:
>
> http://mvapich.cse.ohio-state.edu/nightly/mvapich2/trunk/
>
> Several fixes have gone into this version after the RC2 release. If the
> problem persists with the latest trunk version, we will take a look at it
> in detail.
>
> DK
>
> On Mon, 12 Oct 2009, Sangamesh B wrote:
>
> > Hi,
> >
> >   The mvapich2(1.2p1 and 1.4rc1) is installed with Intel 10.1 compilers
> on a
> > Rocks5.1 HPC Linux cluster.
> >
> > The siesta-2.0.2 (Fortran) application is compiled with MKL library
> support.
> >
> > The job fails after running 20-30 minutes.
> >
> > $ cat err.362.mvapi2_24h_12
> > Warning! Rndv Receiver is receiving (36864 < 46080) less than as expected
> > Fatal error in MPI_Bcast:
> > Message truncated, error stack:
> > MPI_Bcast(1145)...................: MPI_Bcast(buf=0x3c14e90, count=1,
> > dtype=USER<vector>, root=0, comm=0xc4000005) failed
> > MPIR_Bcast(229)...................:
> > MPIDI_CH3U_Receive_data_found(254): Message from rank 0 and tag 2
> truncated;
> > 46080 bytes received but buffer size is 36864
> > Fatal error in MPI_Bcast:
> > Message truncated, error stack:
> > MPI_Bcast(1145)........................: MPI_Bcast(buf=0x866c130,
> count=1,
> > dtype=USER<vector>, root=0, comm=0xc4000006) failed
> > MPIR_Bcast(229)........................:
> > MPIDI_CH3U_Post_data_receive_found(439): Message from rank 0 and tag 2
> > truncated; 46080 bytes received but buffer size is 36864
> > rm: cannot remove `/tmp/362.1.all.q/rsh': No such file or directory
> >
> >
> > The siesta output file end with following error:
> >
> > siesta:   27    -8036.3459    -8035.3935    -8035.4038  0.0751 -3.9174
> > siesta:   28    -8036.3396    -8035.4433    -8035.4554  0.0707 -3.9601
> > siesta:   29    -8036.3531    -8035.5953    -8035.6096  0.0709 -3.9417
> > rank 9 in job 1  compute-0-12.local_50891   caused collective abort of
> all
> > ranks
> >   exit status of rank 9: killed by signal 9
> >
> >
> > The HCA card is Mellanox:
> >
> > # ibstat
> > CA 'mthca0'
> >         CA type: MT25204
> >         Number of ports: 1
> >         Firmware version: 1.2.0
> >         Hardware version: a0
> >         Node GUID: 0x0002c9020028de58
> >         System image GUID: 0x0002c9020028de5b
> >         Port 1:
> >                 State: Active
> >                 Physical state: LinkUp
> >                 Rate: 20
> >                 Base lid: 1
> >                 LMC: 0
> >                 SM lid: 1
> >                 Capability mask: 0x02510a6a
> >                 Port GUID: 0x0002c9020028de59
> >
> > We've used OFED-1.4.
> >
> > The same job fails even with mvapich2-1.4rc1, at same point.
> >
> > Why this error? How to resolve it?  Is there any problem IB setup?
> >
> > The ib pingpong tests work fine for all the nodes. So there could not be
> a
> > problem with ofed drivers.
> >
> > Please help us to resolve the error.
> >
> > Thanks in advance
> >
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20091013/5ec99e4a/attachment.html


More information about the mvapich-discuss mailing list