[mvapich-discuss] need advise for the program blocking problem
Yunfang Sun
ysun4 at umassd.edu
Tue Aug 31 14:58:11 EDT 2010
Hi Dr. Panda,
After the installation of MVAPICH2 1.5, with the same case before, this
issue got the different results:
When I use the code with nonblocking send and
receive('mpi_isend','mpi_irecv'), the program keep running without any
error, but the speed is about 50% slower than before.
And when use the code with blocking send and
receive('mpi_send','mpi_recv'), the program stopped at the 13660 time
steps. And at this time the processors stopped running.
the error information are as follows:
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image PC Routine Line Source
libifcore.so.5 00002B0534F042F1 Unknown Unknown Unknown
libpthread.so.0 0000003E39E0E4C0 Unknown Unknown Unknown
libifcore.so.5 00002B0534F042CA Unknown Unknown Unknown
libpthread.so.0 0000003E39E0E4C0 Unknown Unknown Unknown
Stack trace terminated abnormally.
forrtl: error (78): process killed (SIGTERM)
Image PC Routine Line Source
fvcom 00000000008E02E4 Unknown Unknown Unknown
fvcom 00000000008D0BD8 Unknown Unknown Unknown
fvcom 00000000008D08EE Unknown Unknown Unknown
fvcom 00000000009124FF Unknown Unknown Unknown
fvcom 00000000008C529B Unknown Unknown Unknown
fvcom 0000000000479ADF Unknown Unknown Unknown
fvcom 000000000072D02F Unknown Unknown Unknown
fvcom 0000000000711439 Unknown Unknown Unknown
fvcom 00000000006E9145 Unknown Unknown Unknown
fvcom 0000000000406F3C Unknown Unknown Unknown
libc.so.6 0000003E0081D974 Unknown Unknown Unknown
fvcom 0000000000406E49 Unknown Unknown Unknown
forrtl: error (78): process killed (SIGTERM)
Image PC Routine Line Source
fvcom 00000000008D5F63 Unknown Unknown Unknown
fvcom 0000000000C34F10 Unknown Unknown Unknown
forrtl: error (78): process killed (SIGTERM)
Image PC Routine Line Source
fvcom 00000000008E02E4 Unknown Unknown Unknown
fvcom 00000000008D0BD8 Unknown Unknown Unknown
fvcom 00000000008D08EE Unknown Unknown Unknown
fvcom 00000000009124FF Unknown Unknown Unknown
fvcom 00000000008C529B Unknown Unknown Unknown
fvcom 00000000004821CA Unknown Unknown Unknown
fvcom 0000000000481448 Unknown Unknown Unknown
fvcom 000000000072D16A Unknown Unknown Unknown
fvcom 0000000000711439 Unknown Unknown Unknown
fvcom 00000000006E9145 Unknown Unknown Unknown
fvcom 0000000000406F3C Unknown Unknown Unknown
libc.so.6 00000036A1E1D974 Unknown Unknown Unknown
fvcom 0000000000406E49 Unknown Unknown Unknown
Can you give me any advise for this phenomena?
Thanks very much.
Yunfang Sun
> Hi,
>
> Thanks for your report. You are using a very old version of MVAPICH2
> (released during Feb '08). Please upgrade your installation to the latest
> 1.5 release. There are many enhancements and bugfixes during the last two
> and half years.
>
> Let us know if you see this issue with the latest 1.5 release (the branch
> version has a few more fixes after the release) and we will be happy to
> take a look at this issue further.
>
> Thanks,
>
> DK
>
> On Thu, 26 Aug 2010, Yunfang Sun wrote:
>
>> Hi, all
>>
>> I use Mvapich2 1.0.2, on a cluster, there is no problem to compile a
>> program.
>>
>> When I run the program using 64 processors , there is no problem at the
>> early time steps. After about 20000 time steps, the program stop moving
>> forward, but the processors are still running on line without crash. And
>> I found that blocking point happened in the command: 'mpi_send' and
>> 'mpi_recv'.
>>
>> Also the computing results before that blocking are correct.
>>
>> Then I change the blocking send and receive ('mpi_send','mpi_recv')
>> into
>> the nonblocking send and receive('mpi_isend','mpi_irecv'). And at the
>> same time step (21194), the program stops,
>> and the error output are as follows:
>>
>> [0][ch3_rndvtransfer.c:110] Unknown protocol 0 type from rndv req to
>> send
>>
>> Internal Error: invalid error code ffffffff (Ring Index out of range) in
>> MPIDI_CH3_iStartRndvTransfer:156
>>
>> Fatal error in MPI_Irecv:
>>
>> Other MPI error, error stack:
>>
>> MPI_Irecv(144)...................: MPI_Irecv(buf=0x1da707c0,
>> count=58138,
>> MPI_REAL, src=6, tag=30222, MPI_COMM_WORLD, request=0x7fffe90c25ec)
>> failed
>>
>> MPID_Irecv(124)..................:
>>
>> MPIDI_CH3_RndvSend(500)..........: failure occurred while attempting to
>> send CTS packet
>>
>> MPIDI_CH3_iStartRndvTransfer(156): failure occurred while attempting to
>> send CTS packet
>>
>> forrtl: error (78): process killed (SIGTERM)
>>
>> Image PC Routine Line
>> Source
>>
>> libmpich.so 00002AE32CF9534D Unknown Unknown
>> Unknown
>>
>> libmpich.so 00002AE32CF90C3D Unknown Unknown
>> Unknown
>>
>> libmpich.so 00002AE32D0786F8 Unknown Unknown
>> Unknown
>>
>> libmpich.so 00002AE32D0783B8 Unknown Unknown
>> Unknown
>>
>> fvcom 0000000000463268 Unknown Unknown
>> Unknown
>>
>> fvcom 0000000000462268 Unknown Unknown
>> Unknown
>>
>> fvcom 0000000000666F18 Unknown Unknown
>> Unknown
>>
>> fvcom 000000000071070D Unknown Unknown
>> Unknown
>>
>> fvcom 00000000006E7617 Unknown Unknown
>> Unknown
>>
>> fvcom 0000000000404BCC Unknown Unknown
>> Unknown
>>
>> libc.so.6 00000037D7A1D974 Unknown Unknown
>> Unknown
>>
>> fvcom 0000000000404AD9 Unknown Unknown
>> Unknown
>>
>>
>>
>> And the computing result are also correct before the crash.
>> Any advise to solve this problem?
>>
>> Thanks very much!
>>
>> Yunfang Sun
>>
>>
>> _______________________________________________
>> mvapich-discuss mailing list
>> mvapich-discuss at cse.ohio-state.edu
>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>
>
>
More information about the mvapich-discuss
mailing list