[mvapich-discuss] (no subject)

Yunfang Sun ysun4 at umassd.edu
Thu Aug 26 16:23:30 EDT 2010


Hi, all

	I use Mvapich2 1.0.2, on a cluster, there is no problem to compile a 
program.

	When I run the program using 64 processors , there is no problem at the
early time steps. After about 20000 time steps, the program stop moving
forward, but the processors are still running on line without crash. And
I found that blocking point happened in the command: 'mpi_send' and
'mpi_recv'.

	Also the computing results before that blocking are correct.

	Then I change the blocking send and receive ('mpi_send','mpi_recv') into
the nonblocking send and receive('mpi_isend','mpi_irecv').  And at the
same time step (21194), the program stops,
and the error output are as follows:

[0][ch3_rndvtransfer.c:110] Unknown protocol 0 type from rndv req to send

Internal Error: invalid error code ffffffff (Ring Index out of range) in
MPIDI_CH3_iStartRndvTransfer:156

Fatal error in MPI_Irecv:

Other MPI error, error stack:

MPI_Irecv(144)...................: MPI_Irecv(buf=0x1da707c0, count=58138,
MPI_REAL, src=6, tag=30222, MPI_COMM_WORLD, request=0x7fffe90c25ec) failed

MPID_Irecv(124)..................:

MPIDI_CH3_RndvSend(500)..........: failure occurred while attempting to
send CTS packet

MPIDI_CH3_iStartRndvTransfer(156): failure occurred while attempting to
send CTS packet

forrtl: error (78): process killed (SIGTERM)

Image              PC                Routine            Line        Source

libmpich.so        00002AE32CF9534D  Unknown               Unknown  Unknown

libmpich.so        00002AE32CF90C3D  Unknown               Unknown  Unknown

libmpich.so        00002AE32D0786F8  Unknown               Unknown  Unknown

libmpich.so        00002AE32D0783B8  Unknown               Unknown  Unknown

fvcom              0000000000463268  Unknown               Unknown  Unknown

fvcom              0000000000462268  Unknown               Unknown  Unknown

fvcom              0000000000666F18  Unknown               Unknown  Unknown

fvcom              000000000071070D  Unknown               Unknown  Unknown

fvcom              00000000006E7617  Unknown               Unknown  Unknown

fvcom              0000000000404BCC  Unknown               Unknown  Unknown

libc.so.6          00000037D7A1D974  Unknown               Unknown  Unknown

fvcom              0000000000404AD9  Unknown               Unknown  Unknown



	And the computing result are also correct before the crash.
	Any advise to solve this problem?

Thanks very much!

Yunfang Sun




More information about the mvapich-discuss mailing list