[mvapich-discuss] (no subject)

Dhabaleswar Panda panda at cse.ohio-state.edu
Thu Aug 26 16:43:41 EDT 2010


Hi,

Thanks for your report. You are using a very old version of MVAPICH2
(released during Feb '08). Please upgrade your installation to the latest
1.5 release. There are many enhancements and bugfixes during the last two
and half years.

Let us know if you see this issue with the latest 1.5 release (the branch
version has a few more fixes after the release) and we will be happy to
take a look at this issue further.

Thanks,

DK

On Thu, 26 Aug 2010, Yunfang Sun wrote:

> Hi, all
>
> 	I use Mvapich2 1.0.2, on a cluster, there is no problem to compile a
> program.
>
> 	When I run the program using 64 processors , there is no problem at the
> early time steps. After about 20000 time steps, the program stop moving
> forward, but the processors are still running on line without crash. And
> I found that blocking point happened in the command: 'mpi_send' and
> 'mpi_recv'.
>
> 	Also the computing results before that blocking are correct.
>
> 	Then I change the blocking send and receive ('mpi_send','mpi_recv') into
> the nonblocking send and receive('mpi_isend','mpi_irecv').  And at the
> same time step (21194), the program stops,
> and the error output are as follows:
>
> [0][ch3_rndvtransfer.c:110] Unknown protocol 0 type from rndv req to send
>
> Internal Error: invalid error code ffffffff (Ring Index out of range) in
> MPIDI_CH3_iStartRndvTransfer:156
>
> Fatal error in MPI_Irecv:
>
> Other MPI error, error stack:
>
> MPI_Irecv(144)...................: MPI_Irecv(buf=0x1da707c0, count=58138,
> MPI_REAL, src=6, tag=30222, MPI_COMM_WORLD, request=0x7fffe90c25ec) failed
>
> MPID_Irecv(124)..................:
>
> MPIDI_CH3_RndvSend(500)..........: failure occurred while attempting to
> send CTS packet
>
> MPIDI_CH3_iStartRndvTransfer(156): failure occurred while attempting to
> send CTS packet
>
> forrtl: error (78): process killed (SIGTERM)
>
> Image              PC                Routine            Line        Source
>
> libmpich.so        00002AE32CF9534D  Unknown               Unknown  Unknown
>
> libmpich.so        00002AE32CF90C3D  Unknown               Unknown  Unknown
>
> libmpich.so        00002AE32D0786F8  Unknown               Unknown  Unknown
>
> libmpich.so        00002AE32D0783B8  Unknown               Unknown  Unknown
>
> fvcom              0000000000463268  Unknown               Unknown  Unknown
>
> fvcom              0000000000462268  Unknown               Unknown  Unknown
>
> fvcom              0000000000666F18  Unknown               Unknown  Unknown
>
> fvcom              000000000071070D  Unknown               Unknown  Unknown
>
> fvcom              00000000006E7617  Unknown               Unknown  Unknown
>
> fvcom              0000000000404BCC  Unknown               Unknown  Unknown
>
> libc.so.6          00000037D7A1D974  Unknown               Unknown  Unknown
>
> fvcom              0000000000404AD9  Unknown               Unknown  Unknown
>
>
>
> 	And the computing result are also correct before the crash.
> 	Any advise to solve this problem?
>
> Thanks very much!
>
> Yunfang Sun
>
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>



More information about the mvapich-discuss mailing list