[mvapich-discuss] need advise for the program blocking problem
Jonathan Perkins
perkinjo at cse.ohio-state.edu
Thu Aug 26 17:00:44 EDT 2010
If it is problem is related to the mvapich2 library it may be a good
idea to try the latest release mvapich2-1.5 to see if it has been
fixed. This will also provide you with many other
performance/usabilty enhancements.
You can download our latest version by following this link...
http://mvapich.cse.ohio-state.edu/download/mvapich2/
On Thu, Aug 26, 2010 at 4:43 PM, Yunfang Sun <ysun4 at umassd.edu> wrote:
> Hi, all
>
> I use Mvapich2 1.0.2, on a cluster, there is no problem to compile a
> FVCOM program.
>
> When I run the program using 64 processors , there is no problem at the
> early time steps. After about 20000 time steps, the program stop moving
> forward, but the processors are still running on line without crash. And
> I found that blocking point happened in the command: 'mpi_send' and
> 'mpi_recv'.
>
> Also the computing results before that blocking are correct.
>
> Then I change the blocking send and receive ('mpi_send','mpi_recv') into
> the nonblocking send and receive('mpi_isend','mpi_irecv'). And at the
> same time step (21194), the program stops,
> and the error output are as follows:
>
> [0][ch3_rndvtransfer.c:110] Unknown protocol 0 type from rndv req to send
>
> Internal Error: invalid error code ffffffff (Ring Index out of range) in
> MPIDI_CH3_iStartRndvTransfer:156
>
> Fatal error in MPI_Irecv:
>
> Other MPI error, error stack:
>
> MPI_Irecv(144)...................: MPI_Irecv(buf=0x1da707c0, count=58138,
> MPI_REAL, src=6, tag=30222, MPI_COMM_WORLD, request=0x7fffe90c25ec) failed
>
> MPID_Irecv(124)..................:
>
> MPIDI_CH3_RndvSend(500)..........: failure occurred while attempting to
> send CTS packet
>
> MPIDI_CH3_iStartRndvTransfer(156): failure occurred while attempting to
> send CTS packet
>
> forrtl: error (78): process killed (SIGTERM)
>
> Image PC Routine Line Source
>
> libmpich.so 00002AE32CF9534D Unknown Unknown Unknown
>
> libmpich.so 00002AE32CF90C3D Unknown Unknown Unknown
>
> libmpich.so 00002AE32D0786F8 Unknown Unknown Unknown
>
> libmpich.so 00002AE32D0783B8 Unknown Unknown Unknown
>
> fvcom 0000000000463268 Unknown Unknown Unknown
>
> fvcom 0000000000462268 Unknown Unknown Unknown
>
> fvcom 0000000000666F18 Unknown Unknown Unknown
>
> fvcom 000000000071070D Unknown Unknown Unknown
>
> fvcom 00000000006E7617 Unknown Unknown Unknown
>
> fvcom 0000000000404BCC Unknown Unknown Unknown
>
> libc.so.6 00000037D7A1D974 Unknown Unknown Unknown
>
> fvcom 0000000000404AD9 Unknown Unknown Unknown
>
>
>
> And the computing result are also correct before the crash.
> Any advise to solve this problem?
>
> Thanks very much!
>
> Yunfang Sun
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
>
--
Jonathan Perkins
More information about the mvapich-discuss
mailing list