[mvapich-discuss] need advise for the program blocking problem

Yunfang Sun ysun4 at umassd.edu
Tue Aug 31 14:58:11 EDT 2010


Hi Dr. Panda,



After the installation of MVAPICH2 1.5, with the same case before, this
issue got the different results:



When I use the code with nonblocking send and
receive('mpi_isend','mpi_irecv'), the program keep running without any
error, but the speed is about 50% slower than before.



And when use the code with blocking send and
receive('mpi_send','mpi_recv'), the program stopped at the 13660 time
steps. And at this time the processors stopped running.
 the error information are as follows:

forrtl: severe (174): SIGSEGV, segmentation fault occurred

Image              PC                Routine            Line        Source

libifcore.so.5     00002B0534F042F1  Unknown               Unknown  Unknown

libpthread.so.0    0000003E39E0E4C0  Unknown               Unknown  Unknown

libifcore.so.5     00002B0534F042CA  Unknown               Unknown  Unknown

libpthread.so.0    0000003E39E0E4C0  Unknown               Unknown  Unknown



Stack trace terminated abnormally.

forrtl: error (78): process killed (SIGTERM)

Image              PC                Routine            Line        Source

fvcom              00000000008E02E4  Unknown               Unknown  Unknown

fvcom              00000000008D0BD8  Unknown               Unknown  Unknown

fvcom              00000000008D08EE  Unknown               Unknown  Unknown

fvcom              00000000009124FF  Unknown               Unknown  Unknown

fvcom              00000000008C529B  Unknown               Unknown  Unknown

fvcom              0000000000479ADF  Unknown               Unknown  Unknown

fvcom              000000000072D02F  Unknown               Unknown  Unknown

fvcom              0000000000711439  Unknown               Unknown  Unknown

fvcom              00000000006E9145  Unknown               Unknown  Unknown

fvcom              0000000000406F3C  Unknown               Unknown  Unknown

libc.so.6          0000003E0081D974  Unknown               Unknown  Unknown

fvcom              0000000000406E49  Unknown               Unknown  Unknown

forrtl: error (78): process killed (SIGTERM)

Image              PC                Routine            Line        Source

fvcom              00000000008D5F63  Unknown               Unknown  Unknown

fvcom              0000000000C34F10  Unknown               Unknown  Unknown

forrtl: error (78): process killed (SIGTERM)

Image              PC                Routine            Line        Source

fvcom              00000000008E02E4  Unknown               Unknown  Unknown

fvcom              00000000008D0BD8  Unknown               Unknown  Unknown

fvcom              00000000008D08EE  Unknown               Unknown  Unknown

fvcom              00000000009124FF  Unknown               Unknown  Unknown

fvcom              00000000008C529B  Unknown               Unknown  Unknown

fvcom              00000000004821CA  Unknown               Unknown  Unknown

fvcom              0000000000481448  Unknown               Unknown  Unknown

fvcom              000000000072D16A  Unknown               Unknown  Unknown

fvcom              0000000000711439  Unknown               Unknown  Unknown

fvcom              00000000006E9145  Unknown               Unknown  Unknown

fvcom              0000000000406F3C  Unknown               Unknown  Unknown

libc.so.6          00000036A1E1D974  Unknown               Unknown  Unknown

fvcom              0000000000406E49  Unknown               Unknown  Unknown




Can you give me any advise for this phenomena?

Thanks very much.

Yunfang Sun



> Hi,
>
> Thanks for your report. You are using a very old version of MVAPICH2
> (released during Feb '08). Please upgrade your installation to the latest
> 1.5 release. There are many enhancements and bugfixes during the last two
> and half years.
>
> Let us know if you see this issue with the latest 1.5 release (the branch
> version has a few more fixes after the release) and we will be happy to
> take a look at this issue further.
>
> Thanks,
>
> DK
>
> On Thu, 26 Aug 2010, Yunfang Sun wrote:
>
>> Hi, all
>>
>> 	I use Mvapich2 1.0.2, on a cluster, there is no problem to compile a
>> program.
>>
>> 	When I run the program using 64 processors , there is no problem at the
>> early time steps. After about 20000 time steps, the program stop moving
>> forward, but the processors are still running on line without crash. And
>> I found that blocking point happened in the command: 'mpi_send' and
>> 'mpi_recv'.
>>
>> 	Also the computing results before that blocking are correct.
>>
>> 	Then I change the blocking send and receive ('mpi_send','mpi_recv')
>> into
>> the nonblocking send and receive('mpi_isend','mpi_irecv').  And at the
>> same time step (21194), the program stops,
>> and the error output are as follows:
>>
>> [0][ch3_rndvtransfer.c:110] Unknown protocol 0 type from rndv req to
>> send
>>
>> Internal Error: invalid error code ffffffff (Ring Index out of range) in
>> MPIDI_CH3_iStartRndvTransfer:156
>>
>> Fatal error in MPI_Irecv:
>>
>> Other MPI error, error stack:
>>
>> MPI_Irecv(144)...................: MPI_Irecv(buf=0x1da707c0,
>> count=58138,
>> MPI_REAL, src=6, tag=30222, MPI_COMM_WORLD, request=0x7fffe90c25ec)
>> failed
>>
>> MPID_Irecv(124)..................:
>>
>> MPIDI_CH3_RndvSend(500)..........: failure occurred while attempting to
>> send CTS packet
>>
>> MPIDI_CH3_iStartRndvTransfer(156): failure occurred while attempting to
>> send CTS packet
>>
>> forrtl: error (78): process killed (SIGTERM)
>>
>> Image              PC                Routine            Line
>> Source
>>
>> libmpich.so        00002AE32CF9534D  Unknown               Unknown
>> Unknown
>>
>> libmpich.so        00002AE32CF90C3D  Unknown               Unknown
>> Unknown
>>
>> libmpich.so        00002AE32D0786F8  Unknown               Unknown
>> Unknown
>>
>> libmpich.so        00002AE32D0783B8  Unknown               Unknown
>> Unknown
>>
>> fvcom              0000000000463268  Unknown               Unknown
>> Unknown
>>
>> fvcom              0000000000462268  Unknown               Unknown
>> Unknown
>>
>> fvcom              0000000000666F18  Unknown               Unknown
>> Unknown
>>
>> fvcom              000000000071070D  Unknown               Unknown
>> Unknown
>>
>> fvcom              00000000006E7617  Unknown               Unknown
>> Unknown
>>
>> fvcom              0000000000404BCC  Unknown               Unknown
>> Unknown
>>
>> libc.so.6          00000037D7A1D974  Unknown               Unknown
>> Unknown
>>
>> fvcom              0000000000404AD9  Unknown               Unknown
>> Unknown
>>
>>
>>
>> 	And the computing result are also correct before the crash.
>> 	Any advise to solve this problem?
>>
>> Thanks very much!
>>
>> Yunfang Sun
>>
>>
>> _______________________________________________
>> mvapich-discuss mailing list
>> mvapich-discuss at cse.ohio-state.edu
>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>
>
>




More information about the mvapich-discuss mailing list