[mvapich-discuss] need advise for the program blocking problem

Jonathan Perkins perkinjo at cse.ohio-state.edu
Thu Aug 26 17:00:44 EDT 2010


If it is problem is related to the mvapich2 library it may be a good
idea to try the latest release mvapich2-1.5 to see if it has been
fixed.  This will also provide you with many other
performance/usabilty enhancements.

You can download our latest version by following this link...
http://mvapich.cse.ohio-state.edu/download/mvapich2/

On Thu, Aug 26, 2010 at 4:43 PM, Yunfang Sun <ysun4 at umassd.edu> wrote:
> Hi, all
>
>        I use Mvapich2 1.0.2, on a cluster, there is no problem to compile a
> FVCOM program.
>
>        When I run the program using 64 processors , there is no problem at the
> early time steps. After about 20000 time steps, the program stop moving
> forward, but the processors are still running on line without crash. And
> I found that blocking point happened in the command: 'mpi_send' and
> 'mpi_recv'.
>
>        Also the computing results before that blocking are correct.
>
>        Then I change the blocking send and receive ('mpi_send','mpi_recv') into
> the nonblocking send and receive('mpi_isend','mpi_irecv').  And at the
> same time step (21194), the program stops,
> and the error output are as follows:
>
> [0][ch3_rndvtransfer.c:110] Unknown protocol 0 type from rndv req to send
>
> Internal Error: invalid error code ffffffff (Ring Index out of range) in
> MPIDI_CH3_iStartRndvTransfer:156
>
> Fatal error in MPI_Irecv:
>
> Other MPI error, error stack:
>
> MPI_Irecv(144)...................: MPI_Irecv(buf=0x1da707c0, count=58138,
> MPI_REAL, src=6, tag=30222, MPI_COMM_WORLD, request=0x7fffe90c25ec) failed
>
> MPID_Irecv(124)..................:
>
> MPIDI_CH3_RndvSend(500)..........: failure occurred while attempting to
> send CTS packet
>
> MPIDI_CH3_iStartRndvTransfer(156): failure occurred while attempting to
> send CTS packet
>
> forrtl: error (78): process killed (SIGTERM)
>
> Image              PC                Routine            Line        Source
>
> libmpich.so        00002AE32CF9534D  Unknown               Unknown  Unknown
>
> libmpich.so        00002AE32CF90C3D  Unknown               Unknown  Unknown
>
> libmpich.so        00002AE32D0786F8  Unknown               Unknown  Unknown
>
> libmpich.so        00002AE32D0783B8  Unknown               Unknown  Unknown
>
> fvcom              0000000000463268  Unknown               Unknown  Unknown
>
> fvcom              0000000000462268  Unknown               Unknown  Unknown
>
> fvcom              0000000000666F18  Unknown               Unknown  Unknown
>
> fvcom              000000000071070D  Unknown               Unknown  Unknown
>
> fvcom              00000000006E7617  Unknown               Unknown  Unknown
>
> fvcom              0000000000404BCC  Unknown               Unknown  Unknown
>
> libc.so.6          00000037D7A1D974  Unknown               Unknown  Unknown
>
> fvcom              0000000000404AD9  Unknown               Unknown  Unknown
>
>
>
>        And the computing result are also correct before the crash.
>        Any advise to solve this problem?
>
> Thanks very much!
>
> Yunfang Sun
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
>



-- 
Jonathan Perkins



More information about the mvapich-discuss mailing list