[mvapich-discuss] need advise for the program blocking problem

Dhabaleswar Panda panda at cse.ohio-state.edu
Tue Aug 31 15:56:05 EDT 2010


Hi Yunfang,

> After the installation of MVAPICH2 1.5, with the same case before, this
> issue got the different results:

Thanks for trying out 1.5 version and getting back to us.

> When I use the code with nonblocking send and
> receive('mpi_isend','mpi_irecv'), the program keep running without any
> error, but the speed is about 50% slower than before.

Good to know that it is running without any error. Sorry to know that it
is running so slow. Can you provide some description on your cluster
(number of nodes, number of cores/node and InfiniBand HCA type), etc. Are
you running a single job or multiple jobs sharing the CPUs in a node in
your application? If it is a multi-core cluster and you are running
multiple jobs (within the same application) sharing the CPUs in a node,
you need to disable affinity (MV2_ENABLE_AFFINITY flag).  Otherwise, you
may have multiple processes being mapped to the same CPU and performance
degradation will happen.

You can find more details on the MV2_ENABLE_AFFINITY flag from the 1.5
user guide as follows:

http://mvapich.cse.ohio-state.edu/support/user_guide_mvapich2-1.5.html#x1-10600011.17

Let us know if disabling affinity helps here.

> And when use the code with blocking send and
> receive('mpi_send','mpi_recv'), the program stopped at the 13660 time
> steps. And at this time the processors stopped running.
>  the error information are as follows:

As you have indicated, there is a segmentation fault here. Is it possible
for us to know what application you are running and at what scale, etc.
If it will be possible for you to send us a `code snippet' which can
reproduce this error, it will be very helpful to debug this further.

> forrtl: severe (174): SIGSEGV, segmentation fault occurred
>
> Image              PC                Routine            Line        Source
>
> libifcore.so.5     00002B0534F042F1  Unknown               Unknown  Unknown
>
> libpthread.so.0    0000003E39E0E4C0  Unknown               Unknown  Unknown
>
> libifcore.so.5     00002B0534F042CA  Unknown               Unknown  Unknown
>
> libpthread.so.0    0000003E39E0E4C0  Unknown               Unknown  Unknown
>
>
>
> Stack trace terminated abnormally.
>
> forrtl: error (78): process killed (SIGTERM)
>
> Image              PC                Routine            Line        Source
>
> fvcom              00000000008E02E4  Unknown               Unknown  Unknown
>
> fvcom              00000000008D0BD8  Unknown               Unknown  Unknown
>
> fvcom              00000000008D08EE  Unknown               Unknown  Unknown
>
> fvcom              00000000009124FF  Unknown               Unknown  Unknown
>
> fvcom              00000000008C529B  Unknown               Unknown  Unknown
>
> fvcom              0000000000479ADF  Unknown               Unknown  Unknown
>
> fvcom              000000000072D02F  Unknown               Unknown  Unknown
>
> fvcom              0000000000711439  Unknown               Unknown  Unknown
>
> fvcom              00000000006E9145  Unknown               Unknown  Unknown
>
> fvcom              0000000000406F3C  Unknown               Unknown  Unknown
>
> libc.so.6          0000003E0081D974  Unknown               Unknown  Unknown
>
> fvcom              0000000000406E49  Unknown               Unknown  Unknown
>
> forrtl: error (78): process killed (SIGTERM)
>
> Image              PC                Routine            Line        Source
>
> fvcom              00000000008D5F63  Unknown               Unknown  Unknown
>
> fvcom              0000000000C34F10  Unknown               Unknown  Unknown
>
> forrtl: error (78): process killed (SIGTERM)
>
> Image              PC                Routine            Line        Source
>
> fvcom              00000000008E02E4  Unknown               Unknown  Unknown
>
> fvcom              00000000008D0BD8  Unknown               Unknown  Unknown
>
> fvcom              00000000008D08EE  Unknown               Unknown  Unknown
>
> fvcom              00000000009124FF  Unknown               Unknown  Unknown
>
> fvcom              00000000008C529B  Unknown               Unknown  Unknown
>
> fvcom              00000000004821CA  Unknown               Unknown  Unknown
>
> fvcom              0000000000481448  Unknown               Unknown  Unknown
>
> fvcom              000000000072D16A  Unknown               Unknown  Unknown
>
> fvcom              0000000000711439  Unknown               Unknown  Unknown
>
> fvcom              00000000006E9145  Unknown               Unknown  Unknown
>
> fvcom              0000000000406F3C  Unknown               Unknown  Unknown
>
> libc.so.6          00000036A1E1D974  Unknown               Unknown  Unknown
>
> fvcom              0000000000406E49  Unknown               Unknown  Unknown

Thanks,

DK

>
>
>
> Can you give me any advise for this phenomena?
>
> Thanks very much.
>
> Yunfang Sun
>
>
>
> > Hi,
> >
> > Thanks for your report. You are using a very old version of MVAPICH2
> > (released during Feb '08). Please upgrade your installation to the latest
> > 1.5 release. There are many enhancements and bugfixes during the last two
> > and half years.
> >
> > Let us know if you see this issue with the latest 1.5 release (the branch
> > version has a few more fixes after the release) and we will be happy to
> > take a look at this issue further.
> >
> > Thanks,
> >
> > DK
> >
> > On Thu, 26 Aug 2010, Yunfang Sun wrote:
> >
> >> Hi, all
> >>
> >> 	I use Mvapich2 1.0.2, on a cluster, there is no problem to compile a
> >> program.
> >>
> >> 	When I run the program using 64 processors , there is no problem at the
> >> early time steps. After about 20000 time steps, the program stop moving
> >> forward, but the processors are still running on line without crash. And
> >> I found that blocking point happened in the command: 'mpi_send' and
> >> 'mpi_recv'.
> >>
> >> 	Also the computing results before that blocking are correct.
> >>
> >> 	Then I change the blocking send and receive ('mpi_send','mpi_recv')
> >> into
> >> the nonblocking send and receive('mpi_isend','mpi_irecv').  And at the
> >> same time step (21194), the program stops,
> >> and the error output are as follows:
> >>
> >> [0][ch3_rndvtransfer.c:110] Unknown protocol 0 type from rndv req to
> >> send
> >>
> >> Internal Error: invalid error code ffffffff (Ring Index out of range) in
> >> MPIDI_CH3_iStartRndvTransfer:156
> >>
> >> Fatal error in MPI_Irecv:
> >>
> >> Other MPI error, error stack:
> >>
> >> MPI_Irecv(144)...................: MPI_Irecv(buf=0x1da707c0,
> >> count=58138,
> >> MPI_REAL, src=6, tag=30222, MPI_COMM_WORLD, request=0x7fffe90c25ec)
> >> failed
> >>
> >> MPID_Irecv(124)..................:
> >>
> >> MPIDI_CH3_RndvSend(500)..........: failure occurred while attempting to
> >> send CTS packet
> >>
> >> MPIDI_CH3_iStartRndvTransfer(156): failure occurred while attempting to
> >> send CTS packet
> >>
> >> forrtl: error (78): process killed (SIGTERM)
> >>
> >> Image              PC                Routine            Line
> >> Source
> >>
> >> libmpich.so        00002AE32CF9534D  Unknown               Unknown
> >> Unknown
> >>
> >> libmpich.so        00002AE32CF90C3D  Unknown               Unknown
> >> Unknown
> >>
> >> libmpich.so        00002AE32D0786F8  Unknown               Unknown
> >> Unknown
> >>
> >> libmpich.so        00002AE32D0783B8  Unknown               Unknown
> >> Unknown
> >>
> >> fvcom              0000000000463268  Unknown               Unknown
> >> Unknown
> >>
> >> fvcom              0000000000462268  Unknown               Unknown
> >> Unknown
> >>
> >> fvcom              0000000000666F18  Unknown               Unknown
> >> Unknown
> >>
> >> fvcom              000000000071070D  Unknown               Unknown
> >> Unknown
> >>
> >> fvcom              00000000006E7617  Unknown               Unknown
> >> Unknown
> >>
> >> fvcom              0000000000404BCC  Unknown               Unknown
> >> Unknown
> >>
> >> libc.so.6          00000037D7A1D974  Unknown               Unknown
> >> Unknown
> >>
> >> fvcom              0000000000404AD9  Unknown               Unknown
> >> Unknown
> >>
> >>
> >>
> >> 	And the computing result are also correct before the crash.
> >> 	Any advise to solve this problem?
> >>
> >> Thanks very much!
> >>
> >> Yunfang Sun
> >>
> >>
> >> _______________________________________________
> >> mvapich-discuss mailing list
> >> mvapich-discuss at cse.ohio-state.edu
> >> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> >>
> >
> >
>
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>



More information about the mvapich-discuss mailing list