[mvapich-discuss] need advise for the program blocking problem
Yunfang Sun
ysun4 at umassd.edu
Thu Aug 26 17:15:43 EDT 2010
Thanks very much!
I will try the mvapich2-1.5. and update the result.
Yunfang
> If it is problem is related to the mvapich2 library it may be a good
> idea to try the latest release mvapich2-1.5 to see if it has been
> fixed. This will also provide you with many other
> performance/usabilty enhancements.
>
> You can download our latest version by following this link...
> http://mvapich.cse.ohio-state.edu/download/mvapich2/
>
> On Thu, Aug 26, 2010 at 4:43 PM, Yunfang Sun <ysun4 at umassd.edu> wrote:
>> Hi, all
>>
>> I use Mvapich2 1.0.2, on a cluster, there is no problem to
>> compile a
>> FVCOM program.
>>
>> When I run the program using 64 processors , there is no problem
>> at the
>> early time steps. After about 20000 time steps, the program stop moving
>> forward, but the processors are still running on line without crash. And
>> I found that blocking point happened in the command: 'mpi_send' and
>> 'mpi_recv'.
>>
>> Also the computing results before that blocking are correct.
>>
>> Then I change the blocking send and receive
>> ('mpi_send','mpi_recv') into
>> the nonblocking send and receive('mpi_isend','mpi_irecv'). And at the
>> same time step (21194), the program stops,
>> and the error output are as follows:
>>
>> [0][ch3_rndvtransfer.c:110] Unknown protocol 0 type from rndv req to
>> send
>>
>> Internal Error: invalid error code ffffffff (Ring Index out of range) in
>> MPIDI_CH3_iStartRndvTransfer:156
>>
>> Fatal error in MPI_Irecv:
>>
>> Other MPI error, error stack:
>>
>> MPI_Irecv(144)...................: MPI_Irecv(buf=0x1da707c0,
>> count=58138,
>> MPI_REAL, src=6, tag=30222, MPI_COMM_WORLD, request=0x7fffe90c25ec)
>> failed
>>
>> MPID_Irecv(124)..................:
>>
>> MPIDI_CH3_RndvSend(500)..........: failure occurred while attempting to
>> send CTS packet
>>
>> MPIDI_CH3_iStartRndvTransfer(156): failure occurred while attempting to
>> send CTS packet
>>
>> forrtl: error (78): process killed (SIGTERM)
>>
>> Image PC Routine Line
>> Source
>>
>> libmpich.so 00002AE32CF9534D Unknown Unknown
>> Unknown
>>
>> libmpich.so 00002AE32CF90C3D Unknown Unknown
>> Unknown
>>
>> libmpich.so 00002AE32D0786F8 Unknown Unknown
>> Unknown
>>
>> libmpich.so 00002AE32D0783B8 Unknown Unknown
>> Unknown
>>
>> fvcom 0000000000463268 Unknown Unknown
>> Unknown
>>
>> fvcom 0000000000462268 Unknown Unknown
>> Unknown
>>
>> fvcom 0000000000666F18 Unknown Unknown
>> Unknown
>>
>> fvcom 000000000071070D Unknown Unknown
>> Unknown
>>
>> fvcom 00000000006E7617 Unknown Unknown
>> Unknown
>>
>> fvcom 0000000000404BCC Unknown Unknown
>> Unknown
>>
>> libc.so.6 00000037D7A1D974 Unknown Unknown
>> Unknown
>>
>> fvcom 0000000000404AD9 Unknown Unknown
>> Unknown
>>
>>
>>
>> And the computing result are also correct before the crash.
>> Any advise to solve this problem?
>>
>> Thanks very much!
>>
>> Yunfang Sun
>>
>> _______________________________________________
>> mvapich-discuss mailing list
>> mvapich-discuss at cse.ohio-state.edu
>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>
>>
>
>
>
> --
> Jonathan Perkins
>
More information about the mvapich-discuss
mailing list