[mvapich-discuss] Mvapich2-1.2 for OpenFabrics IB/iWARP : Jobterminates with error

Vivek Gavane vivekg at cdac.in
Fri Feb 20 08:53:03 EST 2009


Sir,
       I am using OFED 1.2.5 and the platform is AMD Opteron. We are using
"MT47396 Infiniscale-III Mellanox Technologies" switch. The VERBS version
is 1.1.0

Thanks.
-- 
Regards,
Vivek Gavane


On Fri, Feb 20, 2009, Dhabaleswar Panda <panda at cse.ohio-state.edu> said:

> Thanks for providing the details and pointer to the code. We will take a
> look at it.
> 
> Can you also indicate which version of OFED you are using and the platform
> details (Intel or AMD and HCA type).
> 
> DK
> 
> On Fri, 20 Feb 2009, Vivek Gavane wrote:
> 
>> Sir,
>>       I have tried for different set of nodes for various runs, the same
>> error is reported. But when I tried for small number of cores i.e 8 the
>> job never came out even though it was complete and the output file was
>> generated. Also the processes were showing 99.9% CPU usage even after
>> complete output was generated.
>>
>> The application code I am using is MEME version meme3.0.3
>> http://meme.nbcr.net/downloads/old_versions/
>>
>> Also I installed the newer version of MEME version meme_4.1.0
>> http://meme.nbcr.net/downloads/
>>
>> It is also giving the following error everytime on different set of nodes:
>> -----------------------------------
>> Exit code -5 signaled from ibc0-27
>> Killing remote processes...MPI process terminated unexpectedly
>> DONE
>> -----------------------------------
>>
>> The redirected output file of the application contains:
>> -----------------------------
>> cleanupSignal 15 received.
>> -----------------------------
>>
>> Thanks.
>> --
>> Regards,
>> Vivek Gavane
>>
>> Member Technical Staff
>> Bioinformatics team,
>> Scientific & Engineering Computing Group,
>> National PARAM Supercomputing Facility,
>> Centre for Development of Advanced Computing,
>> Pune-411007.
>>
>> Phone:       +91 20 25704100 ext. 195
>> Direct Line: +91 20 25704195
>>
>> On Thu, Feb 19, 2009, Dhabaleswar Panda <panda at cse.ohio-state.edu> said:
>>
>> > Vivek,
>> >
>> > Do you see this error always when you run this application? Do you see
>> > this error when you run your application on different set of nodes? If
>> > this happens always (irrespective of runs and nodes), will it be possible
>> > for you to send us a code snippet which reproduces this problem. This will
>> > help us to investigate this issue further.
>> >
>> > Thanks,
>> >
>> > DK
>> >
>> >> Sir,
>> >>     Thank you for the reply but the cable and switch seems to be fine. Is
>> >> there any other reason/solution for the errors. And also the application
>> >> program is giving complete and correct output except for the errors at the
>> >> end.
>> >>
>> >> Thanks.
>> >> --
>> >> Regards,
>> >> Vivek Gavane
>> >>
>> >> Member Technical Staff
>> >> Bioinformatics team,
>> >> Scientific & Engineering Computing Group,
>> >> National PARAM Supercomputing Facility,
>> >> Centre for Development of Advanced Computing,
>> >> Pune-411007.
>> >>
>> >> Phone:       +91 20 25704100 ext. 195
>> >> Direct Line: +91 20 25704195
>> >>
>> >> On Tue, Feb 17, 2009, Dhabaleswar Panda <panda at cse.ohio-state.edu> said:
>> >>
>> >> > Code 12 is a timeout -- could be a bad cable/HCA/switch leaf. If the
>> >> > system is really large then it could be congestion.
>> >> >
>> >> > Thanks,
>> >> >
>> >> > DK
>> >> >
>> >> > On Tue, 17 Feb 2009, Vivek Gavane wrote:
>> >> >
>> >> >> Hello,
>> >> >>      I have mvapich2-1.2 compiled with the following options:
>> >> >>
>> >> >>
>> >> >> /configure --with-rdma=gen2 --enable-sharedlibs=gcc --enable-g=dbg
>> >> >> --enable-debuginfo --with-ib-include=/opt/OFED/include
>> >> >> --with-ib-libpath=/opt/OFED/lib64 --prefix=/home/apps/mvapich2-1.2
>> >> >>
>> >> >> After I submit a job, the job completes but the following errors are
>> >> >> reported on the console:
>> >> >>
>> >> >> -------------------------------------------------------------
>> >> >> send desc error
>> >> >> Exit code -5 signaled from ibc0-16
>> >> >> Killing remote processes...[14] Abort: [] Got completion with error 12,
>> >> >> vendor code=81, dest rank=0
>> >> >>  at line 553 in file ibv_channel_manager.c
>> >> >> MPI process terminated unexpectedly
>> >> >> DONE
>> >> >> ------------------------------------------------------------
>> >> >>
>> >> >> And in the redirected output file, following errors are reported at the
>> >> >> end:
>> >> >> -----------------------------------------
>> >> >> cleanupSignal 15 received.
>> >> >> Signal 15 received.
>> >> >> Signal 15 received.
>> >> >> Signal 15 received.
>> >> >> -----------------------------------------
>> >> >>
>> >> >> Do anyone know the reason for this?
>> >> >>
>> >> >> Thanks in advance.
>> >> >> --
>> >> >> Regards,
>> >> >> Vivek Gavane
>> >> >> _______________________________________________
>> >> >> mvapich-discuss mailing list
>> >> >> mvapich-discuss at cse.ohio-state.edu
>> >> >> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>> >> >>
>> >> >
>> >>
>> >>
>> >
>>
>>
>>
>>
> 





More information about the mvapich-discuss mailing list