[mvapich-discuss] need advise for the program blocking problem
Yunfang Sun
ysun4 at umassd.edu
Thu Aug 26 16:43:05 EDT 2010
Hi, all
I use Mvapich2 1.0.2, on a cluster, there is no problem to compile a
FVCOM program.
When I run the program using 64 processors , there is no problem at the
early time steps. After about 20000 time steps, the program stop moving
forward, but the processors are still running on line without crash. And
I found that blocking point happened in the command: 'mpi_send' and
'mpi_recv'.
Also the computing results before that blocking are correct.
Then I change the blocking send and receive ('mpi_send','mpi_recv') into
the nonblocking send and receive('mpi_isend','mpi_irecv'). And at the
same time step (21194), the program stops,
and the error output are as follows:
[0][ch3_rndvtransfer.c:110] Unknown protocol 0 type from rndv req to send
Internal Error: invalid error code ffffffff (Ring Index out of range) in
MPIDI_CH3_iStartRndvTransfer:156
Fatal error in MPI_Irecv:
Other MPI error, error stack:
MPI_Irecv(144)...................: MPI_Irecv(buf=0x1da707c0, count=58138,
MPI_REAL, src=6, tag=30222, MPI_COMM_WORLD, request=0x7fffe90c25ec) failed
MPID_Irecv(124)..................:
MPIDI_CH3_RndvSend(500)..........: failure occurred while attempting to
send CTS packet
MPIDI_CH3_iStartRndvTransfer(156): failure occurred while attempting to
send CTS packet
forrtl: error (78): process killed (SIGTERM)
Image PC Routine Line Source
libmpich.so 00002AE32CF9534D Unknown Unknown Unknown
libmpich.so 00002AE32CF90C3D Unknown Unknown Unknown
libmpich.so 00002AE32D0786F8 Unknown Unknown Unknown
libmpich.so 00002AE32D0783B8 Unknown Unknown Unknown
fvcom 0000000000463268 Unknown Unknown Unknown
fvcom 0000000000462268 Unknown Unknown Unknown
fvcom 0000000000666F18 Unknown Unknown Unknown
fvcom 000000000071070D Unknown Unknown Unknown
fvcom 00000000006E7617 Unknown Unknown Unknown
fvcom 0000000000404BCC Unknown Unknown Unknown
libc.so.6 00000037D7A1D974 Unknown Unknown Unknown
fvcom 0000000000404AD9 Unknown Unknown Unknown
And the computing result are also correct before the crash.
Any advise to solve this problem?
Thanks very much!
Yunfang Sun
More information about the mvapich-discuss
mailing list