[mvapich-discuss] poll_or_block_event

Wed Mar 16 13:34:06 EDT 2011

Hey Ben,

That's a good idea, thanks for that. I actually haven't been able to
reproduce the error reliably so perhaps you are correct. There are
some weird errors in the PBS log files that indicate that the
communication between nodes went wrong. Errors like the following
started to pop up in the server logs when the jobs started to fail:

03/11/2011 20:40:37;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::stream_eof,
connection to node008 is bad, remote service may be down, message may
be corrupt, or connection may have been dropped remotely (No error).
setting node state to down

03/11/2011 22:54:04;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::stream_eof,
connection to node010 is bad, remote service may be down, message may
be corrupt, or connection may have been dropped remotely (End of
File).  setting node state to down

Anyway, thanks again Ben. Like I said I'm probably going to chalk this
up as an aberration since I can't reproduce it reliably.

Jacob

On Wed, Mar 16, 2011 at 1:08 PM, Ben Truscott
<B.S.Truscott at bristol.ac.uk> wrote:
> Dear Jacob
>
> This has happened for me in the past when our PBS server has been
> overloaded e.g. due to a very large volume of jobs being submitted rapidly
> by script. In this case communications between the PBS server and MOM
> processes might time out, and if this happens the job will tend to be
> killed as PBS assumes the node has failed and tries to clean up the job
> processes. Although it's possible that you'd avoid the problem by using an
> mpiexec less tightly integrated with PBS, that workaround might not
> represent a particularly desirable trade-off overall. Barring PBS issues,
> I've generally found OSC mpiexec to work very well with MVAPICH2. In the
> first instance I'd recommend checking the PBS logs, both for the PBS
> server itself and for the MOMs on the affected nodes, for any sign of
> communication errors or time-outs.
>
> Best regards
>
> Ben Truscott
> School of Chemistry
> University of Bristol
>
>> MVAPICH users,
>>
>> I'm running into a problem on our cluster that I don't really know
>> much about. Basically what happens is the when you submit a
>> calculation the job runs for some time and then randomly it appears to
>> stop running (ie. no more output is sent back from the executable). At
>> that point if you ssh to the node that was running the calculation
>> youw ill find that the executable is no longer running (not
>> surprisingly). Upon killing the job I get a whole bunch of the
>> following errors in the standard error file:
>>
>> mpiexec: Warning: poll_or_block_event: evt 58 task 24 on node001:
>> remote system error.
>>
>> We are using the OSC mpiexec to launch the jobs from with PBS. I've
>> looked around but haven't been able to find much related to this
>> error. If anyone could provide any assistance it would be very much
>> appreciated. I thank you in advance.
>>
>> Jacob
>
>