[mvapich-discuss] poll_or_block_event

Jonathan Perkins perkinjo at cse.ohio-state.edu
Mon Mar 14 14:45:14 EDT 2011


If things were running well before it seems that maybe something has
changed in the background either with the software or hardware on your
system.

Regarding the backtrace issue.  When running applications it can be
useful to attach the process to a debugger such as gdb.  If an
application crashes due to something like a segmentation fault there
may be a core.9999 (replace the 9999 with the pid of the failed
program) generated that you can use to inspect the status of the
program at the time that it crashed.  Usually you'll need to have the
program and potentially the libraries that its using built with
debugging symbols for this to give you any meaningful information.

For more information on this please see
http://sourceware.org/gdb/onlinedocs/gdb.html.

On Mon, Mar 14, 2011 at 2:12 PM, Jacob Harvey <jaharvey at chem.umass.edu> wrote:
> Jonathan,
>
> Thank you so much for your response. We are using mvapich2-1.4.1 and
> this problem happens with all of the programs that we run in parallel
> (right now just DL Poly and CPMD).
>
> What do you mean by back trace for the process? I'm not familiar with that.
>
> I'll definitely upgrade the version of mvapich but in the mean time
> I'd like to narrow down where the problem is coming from. It seems odd
> that the cluster would be running just fine a week ago and now its not
> without anything really changing.
>
> I've contacted OSC about the problem as well but I haven't heard
> anything back yet.
>
> Jacob
>
> On Mon, Mar 14, 2011 at 12:34 PM, Jonathan Perkins
> <perkinjo at cse.ohio-state.edu> wrote:
>> Jacob:
>> Thanks for your note.  Let's try to narrow down this issue a bit further.
>>
>> What version of mvapich or mvapich2 are you using?  Does this problem
>> only happen with this application?  Are you able to get a back trace
>> from the process when this happens?
>>
>> You may want to also considering upgrading to mvapich2-1.6 if you're
>> not already using this.  With this version, you're able to use the
>> mpiexec (hydra) that ships with mvapich2 itself.  This can help
>> isolate the issue as well.  For more information on this please see
>> http://mvapich.cse.ohio-state.edu/support/user_guide_mvapich2-1.6.html#x1-250005.2.2
>>
>> On Mon, Mar 14, 2011 at 11:18 AM, Jacob Harvey <jaharvey at chem.umass.edu> wrote:
>>> MVAPICH users,
>>>
>>> I'm running into a problem on our cluster that I don't really know
>>> much about. Basically what happens is the when you submit a
>>> calculation the job runs for some time and then randomly it appears to
>>> stop running (ie. no more output is sent back from the executable). At
>>> that point if you ssh to the node that was running the calculation
>>> youw ill find that the executable is no longer running (not
>>> surprisingly). Upon killing the job I get a whole bunch of the
>>> following errors in the standard error file:
>>>
>>> mpiexec: Warning: poll_or_block_event: evt 58 task 24 on node001:
>>> remote system error.
>>>
>>> We are using the OSC mpiexec to launch the jobs from with PBS. I've
>>> looked around but haven't been able to find much related to this
>>> error. If anyone could provide any assistance it would be very much
>>> appreciated. I thank you in advance.
>>>
>>> Jacob
>>> _______________________________________________
>>> mvapich-discuss mailing list
>>> mvapich-discuss at cse.ohio-state.edu
>>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>>
>>>
>>
>>
>>
>> --
>> Jonathan Perkins
>> http://www.cse.ohio-state.edu/~perkinjo
>>
>
>



-- 
Jonathan Perkins
http://www.cse.ohio-state.edu/~perkinjo



More information about the mvapich-discuss mailing list