[mvapich-discuss] need advise for the program blocking problem
Dhabaleswar Panda
panda at cse.ohio-state.edu
Tue Aug 31 15:56:05 EDT 2010
Hi Yunfang,
> After the installation of MVAPICH2 1.5, with the same case before, this
> issue got the different results:
Thanks for trying out 1.5 version and getting back to us.
> When I use the code with nonblocking send and
> receive('mpi_isend','mpi_irecv'), the program keep running without any
> error, but the speed is about 50% slower than before.
Good to know that it is running without any error. Sorry to know that it
is running so slow. Can you provide some description on your cluster
(number of nodes, number of cores/node and InfiniBand HCA type), etc. Are
you running a single job or multiple jobs sharing the CPUs in a node in
your application? If it is a multi-core cluster and you are running
multiple jobs (within the same application) sharing the CPUs in a node,
you need to disable affinity (MV2_ENABLE_AFFINITY flag). Otherwise, you
may have multiple processes being mapped to the same CPU and performance
degradation will happen.
You can find more details on the MV2_ENABLE_AFFINITY flag from the 1.5
user guide as follows:
http://mvapich.cse.ohio-state.edu/support/user_guide_mvapich2-1.5.html#x1-10600011.17
Let us know if disabling affinity helps here.
> And when use the code with blocking send and
> receive('mpi_send','mpi_recv'), the program stopped at the 13660 time
> steps. And at this time the processors stopped running.
> the error information are as follows:
As you have indicated, there is a segmentation fault here. Is it possible
for us to know what application you are running and at what scale, etc.
If it will be possible for you to send us a `code snippet' which can
reproduce this error, it will be very helpful to debug this further.
> forrtl: severe (174): SIGSEGV, segmentation fault occurred
>
> Image PC Routine Line Source
>
> libifcore.so.5 00002B0534F042F1 Unknown Unknown Unknown
>
> libpthread.so.0 0000003E39E0E4C0 Unknown Unknown Unknown
>
> libifcore.so.5 00002B0534F042CA Unknown Unknown Unknown
>
> libpthread.so.0 0000003E39E0E4C0 Unknown Unknown Unknown
>
>
>
> Stack trace terminated abnormally.
>
> forrtl: error (78): process killed (SIGTERM)
>
> Image PC Routine Line Source
>
> fvcom 00000000008E02E4 Unknown Unknown Unknown
>
> fvcom 00000000008D0BD8 Unknown Unknown Unknown
>
> fvcom 00000000008D08EE Unknown Unknown Unknown
>
> fvcom 00000000009124FF Unknown Unknown Unknown
>
> fvcom 00000000008C529B Unknown Unknown Unknown
>
> fvcom 0000000000479ADF Unknown Unknown Unknown
>
> fvcom 000000000072D02F Unknown Unknown Unknown
>
> fvcom 0000000000711439 Unknown Unknown Unknown
>
> fvcom 00000000006E9145 Unknown Unknown Unknown
>
> fvcom 0000000000406F3C Unknown Unknown Unknown
>
> libc.so.6 0000003E0081D974 Unknown Unknown Unknown
>
> fvcom 0000000000406E49 Unknown Unknown Unknown
>
> forrtl: error (78): process killed (SIGTERM)
>
> Image PC Routine Line Source
>
> fvcom 00000000008D5F63 Unknown Unknown Unknown
>
> fvcom 0000000000C34F10 Unknown Unknown Unknown
>
> forrtl: error (78): process killed (SIGTERM)
>
> Image PC Routine Line Source
>
> fvcom 00000000008E02E4 Unknown Unknown Unknown
>
> fvcom 00000000008D0BD8 Unknown Unknown Unknown
>
> fvcom 00000000008D08EE Unknown Unknown Unknown
>
> fvcom 00000000009124FF Unknown Unknown Unknown
>
> fvcom 00000000008C529B Unknown Unknown Unknown
>
> fvcom 00000000004821CA Unknown Unknown Unknown
>
> fvcom 0000000000481448 Unknown Unknown Unknown
>
> fvcom 000000000072D16A Unknown Unknown Unknown
>
> fvcom 0000000000711439 Unknown Unknown Unknown
>
> fvcom 00000000006E9145 Unknown Unknown Unknown
>
> fvcom 0000000000406F3C Unknown Unknown Unknown
>
> libc.so.6 00000036A1E1D974 Unknown Unknown Unknown
>
> fvcom 0000000000406E49 Unknown Unknown Unknown
Thanks,
DK
>
>
>
> Can you give me any advise for this phenomena?
>
> Thanks very much.
>
> Yunfang Sun
>
>
>
> > Hi,
> >
> > Thanks for your report. You are using a very old version of MVAPICH2
> > (released during Feb '08). Please upgrade your installation to the latest
> > 1.5 release. There are many enhancements and bugfixes during the last two
> > and half years.
> >
> > Let us know if you see this issue with the latest 1.5 release (the branch
> > version has a few more fixes after the release) and we will be happy to
> > take a look at this issue further.
> >
> > Thanks,
> >
> > DK
> >
> > On Thu, 26 Aug 2010, Yunfang Sun wrote:
> >
> >> Hi, all
> >>
> >> I use Mvapich2 1.0.2, on a cluster, there is no problem to compile a
> >> program.
> >>
> >> When I run the program using 64 processors , there is no problem at the
> >> early time steps. After about 20000 time steps, the program stop moving
> >> forward, but the processors are still running on line without crash. And
> >> I found that blocking point happened in the command: 'mpi_send' and
> >> 'mpi_recv'.
> >>
> >> Also the computing results before that blocking are correct.
> >>
> >> Then I change the blocking send and receive ('mpi_send','mpi_recv')
> >> into
> >> the nonblocking send and receive('mpi_isend','mpi_irecv'). And at the
> >> same time step (21194), the program stops,
> >> and the error output are as follows:
> >>
> >> [0][ch3_rndvtransfer.c:110] Unknown protocol 0 type from rndv req to
> >> send
> >>
> >> Internal Error: invalid error code ffffffff (Ring Index out of range) in
> >> MPIDI_CH3_iStartRndvTransfer:156
> >>
> >> Fatal error in MPI_Irecv:
> >>
> >> Other MPI error, error stack:
> >>
> >> MPI_Irecv(144)...................: MPI_Irecv(buf=0x1da707c0,
> >> count=58138,
> >> MPI_REAL, src=6, tag=30222, MPI_COMM_WORLD, request=0x7fffe90c25ec)
> >> failed
> >>
> >> MPID_Irecv(124)..................:
> >>
> >> MPIDI_CH3_RndvSend(500)..........: failure occurred while attempting to
> >> send CTS packet
> >>
> >> MPIDI_CH3_iStartRndvTransfer(156): failure occurred while attempting to
> >> send CTS packet
> >>
> >> forrtl: error (78): process killed (SIGTERM)
> >>
> >> Image PC Routine Line
> >> Source
> >>
> >> libmpich.so 00002AE32CF9534D Unknown Unknown
> >> Unknown
> >>
> >> libmpich.so 00002AE32CF90C3D Unknown Unknown
> >> Unknown
> >>
> >> libmpich.so 00002AE32D0786F8 Unknown Unknown
> >> Unknown
> >>
> >> libmpich.so 00002AE32D0783B8 Unknown Unknown
> >> Unknown
> >>
> >> fvcom 0000000000463268 Unknown Unknown
> >> Unknown
> >>
> >> fvcom 0000000000462268 Unknown Unknown
> >> Unknown
> >>
> >> fvcom 0000000000666F18 Unknown Unknown
> >> Unknown
> >>
> >> fvcom 000000000071070D Unknown Unknown
> >> Unknown
> >>
> >> fvcom 00000000006E7617 Unknown Unknown
> >> Unknown
> >>
> >> fvcom 0000000000404BCC Unknown Unknown
> >> Unknown
> >>
> >> libc.so.6 00000037D7A1D974 Unknown Unknown
> >> Unknown
> >>
> >> fvcom 0000000000404AD9 Unknown Unknown
> >> Unknown
> >>
> >>
> >>
> >> And the computing result are also correct before the crash.
> >> Any advise to solve this problem?
> >>
> >> Thanks very much!
> >>
> >> Yunfang Sun
> >>
> >>
> >> _______________________________________________
> >> mvapich-discuss mailing list
> >> mvapich-discuss at cse.ohio-state.edu
> >> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> >>
> >
> >
>
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
More information about the mvapich-discuss
mailing list