[mvapich-discuss] poll_or_block_event

Jacob Harvey jaharvey at chem.umass.edu
Mon Mar 14 14:12:37 EDT 2011


Jonathan,

Thank you so much for your response. We are using mvapich2-1.4.1 and
this problem happens with all of the programs that we run in parallel
(right now just DL Poly and CPMD).

What do you mean by back trace for the process? I'm not familiar with that.

I'll definitely upgrade the version of mvapich but in the mean time
I'd like to narrow down where the problem is coming from. It seems odd
that the cluster would be running just fine a week ago and now its not
without anything really changing.

I've contacted OSC about the problem as well but I haven't heard
anything back yet.

Jacob

On Mon, Mar 14, 2011 at 12:34 PM, Jonathan Perkins
<perkinjo at cse.ohio-state.edu> wrote:
> Jacob:
> Thanks for your note.  Let's try to narrow down this issue a bit further.
>
> What version of mvapich or mvapich2 are you using?  Does this problem
> only happen with this application?  Are you able to get a back trace
> from the process when this happens?
>
> You may want to also considering upgrading to mvapich2-1.6 if you're
> not already using this.  With this version, you're able to use the
> mpiexec (hydra) that ships with mvapich2 itself.  This can help
> isolate the issue as well.  For more information on this please see
> http://mvapich.cse.ohio-state.edu/support/user_guide_mvapich2-1.6.html#x1-250005.2.2
>
> On Mon, Mar 14, 2011 at 11:18 AM, Jacob Harvey <jaharvey at chem.umass.edu> wrote:
>> MVAPICH users,
>>
>> I'm running into a problem on our cluster that I don't really know
>> much about. Basically what happens is the when you submit a
>> calculation the job runs for some time and then randomly it appears to
>> stop running (ie. no more output is sent back from the executable). At
>> that point if you ssh to the node that was running the calculation
>> youw ill find that the executable is no longer running (not
>> surprisingly). Upon killing the job I get a whole bunch of the
>> following errors in the standard error file:
>>
>> mpiexec: Warning: poll_or_block_event: evt 58 task 24 on node001:
>> remote system error.
>>
>> We are using the OSC mpiexec to launch the jobs from with PBS. I've
>> looked around but haven't been able to find much related to this
>> error. If anyone could provide any assistance it would be very much
>> appreciated. I thank you in advance.
>>
>> Jacob
>> _______________________________________________
>> mvapich-discuss mailing list
>> mvapich-discuss at cse.ohio-state.edu
>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>
>>
>
>
>
> --
> Jonathan Perkins
> http://www.cse.ohio-state.edu/~perkinjo
>



More information about the mvapich-discuss mailing list