[mvapich-discuss] Mvapich2-1.2 for OpenFabrics IB/iWARP : Jobterminates with error

Jonathan Perkins perkinjo at cse.ohio-state.edu
Mon Mar 2 13:17:41 EST 2009


Vivek:
We do not have an environment setup that can easily support the
installation of this MEME Suite.  Is there a simpler MPI program that
this error can be reproduced with.  This will greatly assist us in
debugging this issue.

On Fri, Feb 20, 2009 at 11:32:30AM +0530, Vivek Gavane wrote:
> Sir,
>       I have tried for different set of nodes for various runs, the same
> error is reported. But when I tried for small number of cores i.e 8 the
> job never came out even though it was complete and the output file was
> generated. Also the processes were showing 99.9% CPU usage even after
> complete output was generated.
> 
> The application code I am using is MEME version meme3.0.3
> http://meme.nbcr.net/downloads/old_versions/
> 
> Also I installed the newer version of MEME version meme_4.1.0
> http://meme.nbcr.net/downloads/
> 
> It is also giving the following error everytime on different set of nodes:
> -----------------------------------
> Exit code -5 signaled from ibc0-27
> Killing remote processes...MPI process terminated unexpectedly
> DONE
> -----------------------------------
> 
> The redirected output file of the application contains:
> -----------------------------
> cleanupSignal 15 received.
> -----------------------------
> 
> Thanks.
> -- 
> Regards,
> Vivek Gavane
> 
> Member Technical Staff
> Bioinformatics team,
> Scientific & Engineering Computing Group,
> National PARAM Supercomputing Facility,
> Centre for Development of Advanced Computing,
> Pune-411007.
> 
> Phone:       +91 20 25704100 ext. 195
> Direct Line: +91 20 25704195
> 
> On Thu, Feb 19, 2009, Dhabaleswar Panda <panda at cse.ohio-state.edu> said:
> 
> > Vivek,
> > 
> > Do you see this error always when you run this application? Do you see
> > this error when you run your application on different set of nodes? If
> > this happens always (irrespective of runs and nodes), will it be possible
> > for you to send us a code snippet which reproduces this problem. This will
> > help us to investigate this issue further.
> > 
> > Thanks,
> > 
> > DK
> > 
> >> Sir,
> >>     Thank you for the reply but the cable and switch seems to be fine. Is
> >> there any other reason/solution for the errors. And also the application
> >> program is giving complete and correct output except for the errors at the
> >> end.
> >>
> >> Thanks.
> >> --
> >> Regards,
> >> Vivek Gavane
> >>
> >> Member Technical Staff
> >> Bioinformatics team,
> >> Scientific & Engineering Computing Group,
> >> National PARAM Supercomputing Facility,
> >> Centre for Development of Advanced Computing,
> >> Pune-411007.
> >>
> >> Phone:       +91 20 25704100 ext. 195
> >> Direct Line: +91 20 25704195
> >>
> >> On Tue, Feb 17, 2009, Dhabaleswar Panda <panda at cse.ohio-state.edu> said:
> >>
> >> > Code 12 is a timeout -- could be a bad cable/HCA/switch leaf. If the
> >> > system is really large then it could be congestion.
> >> >
> >> > Thanks,
> >> >
> >> > DK
> >> >
> >> > On Tue, 17 Feb 2009, Vivek Gavane wrote:
> >> >
> >> >> Hello,
> >> >>      I have mvapich2-1.2 compiled with the following options:
> >> >>
> >> >>
> >> >> /configure --with-rdma=gen2 --enable-sharedlibs=gcc --enable-g=dbg
> >> >> --enable-debuginfo --with-ib-include=/opt/OFED/include
> >> >> --with-ib-libpath=/opt/OFED/lib64 --prefix=/home/apps/mvapich2-1.2
> >> >>
> >> >> After I submit a job, the job completes but the following errors are
> >> >> reported on the console:
> >> >>
> >> >> -------------------------------------------------------------
> >> >> send desc error
> >> >> Exit code -5 signaled from ibc0-16
> >> >> Killing remote processes...[14] Abort: [] Got completion with error 12,
> >> >> vendor code=81, dest rank=0
> >> >>  at line 553 in file ibv_channel_manager.c
> >> >> MPI process terminated unexpectedly
> >> >> DONE
> >> >> ------------------------------------------------------------
> >> >>
> >> >> And in the redirected output file, following errors are reported at the
> >> >> end:
> >> >> -----------------------------------------
> >> >> cleanupSignal 15 received.
> >> >> Signal 15 received.
> >> >> Signal 15 received.
> >> >> Signal 15 received.
> >> >> -----------------------------------------
> >> >>
> >> >> Do anyone know the reason for this?
> >> >>
> >> >> Thanks in advance.
> >> >> --
> >> >> Regards,
> >> >> Vivek Gavane
> >> >> _______________________________________________
> >> >> mvapich-discuss mailing list
> >> >> mvapich-discuss at cse.ohio-state.edu
> >> >> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> >> >>
> >> >
> >>
> >>
> > 
> 
> 
> 
> 
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss

-- 
Jonathan Perkins
http://www.cse.ohio-state.edu/~perkinjo


More information about the mvapich-discuss mailing list