[mvapich-discuss] need advise for the program blocking problem

Yunfang Sun ysun4 at umassd.edu
Thu Aug 26 17:15:43 EDT 2010


Thanks very much!

I will try the mvapich2-1.5. and update the result.

Yunfang

> If it is problem is related to the mvapich2 library it may be a good
> idea to try the latest release mvapich2-1.5 to see if it has been
> fixed.  This will also provide you with many other
> performance/usabilty enhancements.
>
> You can download our latest version by following this link...
> http://mvapich.cse.ohio-state.edu/download/mvapich2/
>
> On Thu, Aug 26, 2010 at 4:43 PM, Yunfang Sun <ysun4 at umassd.edu> wrote:
>> Hi, all
>>
>>        I use Mvapich2 1.0.2, on a cluster, there is no problem to
>> compile a
>> FVCOM program.
>>
>>        When I run the program using 64 processors , there is no problem
>> at the
>> early time steps. After about 20000 time steps, the program stop moving
>> forward, but the processors are still running on line without crash. And
>> I found that blocking point happened in the command: 'mpi_send' and
>> 'mpi_recv'.
>>
>>        Also the computing results before that blocking are correct.
>>
>>        Then I change the blocking send and receive
>> ('mpi_send','mpi_recv') into
>> the nonblocking send and receive('mpi_isend','mpi_irecv').  And at the
>> same time step (21194), the program stops,
>> and the error output are as follows:
>>
>> [0][ch3_rndvtransfer.c:110] Unknown protocol 0 type from rndv req to
>> send
>>
>> Internal Error: invalid error code ffffffff (Ring Index out of range) in
>> MPIDI_CH3_iStartRndvTransfer:156
>>
>> Fatal error in MPI_Irecv:
>>
>> Other MPI error, error stack:
>>
>> MPI_Irecv(144)...................: MPI_Irecv(buf=0x1da707c0,
>> count=58138,
>> MPI_REAL, src=6, tag=30222, MPI_COMM_WORLD, request=0x7fffe90c25ec)
>> failed
>>
>> MPID_Irecv(124)..................:
>>
>> MPIDI_CH3_RndvSend(500)..........: failure occurred while attempting to
>> send CTS packet
>>
>> MPIDI_CH3_iStartRndvTransfer(156): failure occurred while attempting to
>> send CTS packet
>>
>> forrtl: error (78): process killed (SIGTERM)
>>
>> Image              PC                Routine            Line      
>>  Source
>>
>> libmpich.so        00002AE32CF9534D  Unknown               Unknown
>>  Unknown
>>
>> libmpich.so        00002AE32CF90C3D  Unknown               Unknown
>>  Unknown
>>
>> libmpich.so        00002AE32D0786F8  Unknown               Unknown
>>  Unknown
>>
>> libmpich.so        00002AE32D0783B8  Unknown               Unknown
>>  Unknown
>>
>> fvcom              0000000000463268  Unknown               Unknown
>>  Unknown
>>
>> fvcom              0000000000462268  Unknown               Unknown
>>  Unknown
>>
>> fvcom              0000000000666F18  Unknown               Unknown
>>  Unknown
>>
>> fvcom              000000000071070D  Unknown               Unknown
>>  Unknown
>>
>> fvcom              00000000006E7617  Unknown               Unknown
>>  Unknown
>>
>> fvcom              0000000000404BCC  Unknown               Unknown
>>  Unknown
>>
>> libc.so.6          00000037D7A1D974  Unknown               Unknown
>>  Unknown
>>
>> fvcom              0000000000404AD9  Unknown               Unknown
>>  Unknown
>>
>>
>>
>>        And the computing result are also correct before the crash.
>>        Any advise to solve this problem?
>>
>> Thanks very much!
>>
>> Yunfang Sun
>>
>> _______________________________________________
>> mvapich-discuss mailing list
>> mvapich-discuss at cse.ohio-state.edu
>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>
>>
>
>
>
> --
> Jonathan Perkins
>




More information about the mvapich-discuss mailing list