[mvapich-discuss] Mvapich2-1.2 for OpenFabrics IB/iWARP : Jobterminates with error

Dhabaleswar Panda panda at cse.ohio-state.edu
Fri Feb 20 08:17:16 EST 2009


Thanks for providing the details and pointer to the code. We will take a
look at it.

Can you also indicate which version of OFED you are using and the platform
details (Intel or AMD and HCA type).

DK

On Fri, 20 Feb 2009, Vivek Gavane wrote:

> Sir,
>       I have tried for different set of nodes for various runs, the same
> error is reported. But when I tried for small number of cores i.e 8 the
> job never came out even though it was complete and the output file was
> generated. Also the processes were showing 99.9% CPU usage even after
> complete output was generated.
>
> The application code I am using is MEME version meme3.0.3
> http://meme.nbcr.net/downloads/old_versions/
>
> Also I installed the newer version of MEME version meme_4.1.0
> http://meme.nbcr.net/downloads/
>
> It is also giving the following error everytime on different set of nodes:
> -----------------------------------
> Exit code -5 signaled from ibc0-27
> Killing remote processes...MPI process terminated unexpectedly
> DONE
> -----------------------------------
>
> The redirected output file of the application contains:
> -----------------------------
> cleanupSignal 15 received.
> -----------------------------
>
> Thanks.
> --
> Regards,
> Vivek Gavane
>
> Member Technical Staff
> Bioinformatics team,
> Scientific & Engineering Computing Group,
> National PARAM Supercomputing Facility,
> Centre for Development of Advanced Computing,
> Pune-411007.
>
> Phone:       +91 20 25704100 ext. 195
> Direct Line: +91 20 25704195
>
> On Thu, Feb 19, 2009, Dhabaleswar Panda <panda at cse.ohio-state.edu> said:
>
> > Vivek,
> >
> > Do you see this error always when you run this application? Do you see
> > this error when you run your application on different set of nodes? If
> > this happens always (irrespective of runs and nodes), will it be possible
> > for you to send us a code snippet which reproduces this problem. This will
> > help us to investigate this issue further.
> >
> > Thanks,
> >
> > DK
> >
> >> Sir,
> >>     Thank you for the reply but the cable and switch seems to be fine. Is
> >> there any other reason/solution for the errors. And also the application
> >> program is giving complete and correct output except for the errors at the
> >> end.
> >>
> >> Thanks.
> >> --
> >> Regards,
> >> Vivek Gavane
> >>
> >> Member Technical Staff
> >> Bioinformatics team,
> >> Scientific & Engineering Computing Group,
> >> National PARAM Supercomputing Facility,
> >> Centre for Development of Advanced Computing,
> >> Pune-411007.
> >>
> >> Phone:       +91 20 25704100 ext. 195
> >> Direct Line: +91 20 25704195
> >>
> >> On Tue, Feb 17, 2009, Dhabaleswar Panda <panda at cse.ohio-state.edu> said:
> >>
> >> > Code 12 is a timeout -- could be a bad cable/HCA/switch leaf. If the
> >> > system is really large then it could be congestion.
> >> >
> >> > Thanks,
> >> >
> >> > DK
> >> >
> >> > On Tue, 17 Feb 2009, Vivek Gavane wrote:
> >> >
> >> >> Hello,
> >> >>      I have mvapich2-1.2 compiled with the following options:
> >> >>
> >> >>
> >> >> /configure --with-rdma=gen2 --enable-sharedlibs=gcc --enable-g=dbg
> >> >> --enable-debuginfo --with-ib-include=/opt/OFED/include
> >> >> --with-ib-libpath=/opt/OFED/lib64 --prefix=/home/apps/mvapich2-1.2
> >> >>
> >> >> After I submit a job, the job completes but the following errors are
> >> >> reported on the console:
> >> >>
> >> >> -------------------------------------------------------------
> >> >> send desc error
> >> >> Exit code -5 signaled from ibc0-16
> >> >> Killing remote processes...[14] Abort: [] Got completion with error 12,
> >> >> vendor code=81, dest rank=0
> >> >>  at line 553 in file ibv_channel_manager.c
> >> >> MPI process terminated unexpectedly
> >> >> DONE
> >> >> ------------------------------------------------------------
> >> >>
> >> >> And in the redirected output file, following errors are reported at the
> >> >> end:
> >> >> -----------------------------------------
> >> >> cleanupSignal 15 received.
> >> >> Signal 15 received.
> >> >> Signal 15 received.
> >> >> Signal 15 received.
> >> >> -----------------------------------------
> >> >>
> >> >> Do anyone know the reason for this?
> >> >>
> >> >> Thanks in advance.
> >> >> --
> >> >> Regards,
> >> >> Vivek Gavane
> >> >> _______________________________________________
> >> >> mvapich-discuss mailing list
> >> >> mvapich-discuss at cse.ohio-state.edu
> >> >> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> >> >>
> >> >
> >>
> >>
> >
>
>
>
>



More information about the mvapich-discuss mailing list