[mvapich-discuss] poll_or_block_event

Ben Truscott B.S.Truscott at bristol.ac.uk
Wed Mar 16 13:08:10 EDT 2011


Dear Jacob

This has happened for me in the past when our PBS server has been
overloaded e.g. due to a very large volume of jobs being submitted rapidly
by script. In this case communications between the PBS server and MOM
processes might time out, and if this happens the job will tend to be
killed as PBS assumes the node has failed and tries to clean up the job
processes. Although it's possible that you'd avoid the problem by using an
mpiexec less tightly integrated with PBS, that workaround might not
represent a particularly desirable trade-off overall. Barring PBS issues,
I've generally found OSC mpiexec to work very well with MVAPICH2. In the
first instance I'd recommend checking the PBS logs, both for the PBS
server itself and for the MOMs on the affected nodes, for any sign of
communication errors or time-outs.

Best regards

Ben Truscott
School of Chemistry
University of Bristol

> MVAPICH users,
>
> I'm running into a problem on our cluster that I don't really know
> much about. Basically what happens is the when you submit a
> calculation the job runs for some time and then randomly it appears to
> stop running (ie. no more output is sent back from the executable). At
> that point if you ssh to the node that was running the calculation
> youw ill find that the executable is no longer running (not
> surprisingly). Upon killing the job I get a whole bunch of the
> following errors in the standard error file:
>
> mpiexec: Warning: poll_or_block_event: evt 58 task 24 on node001:
> remote system error.
>
> We are using the OSC mpiexec to launch the jobs from with PBS. I've
> looked around but haven't been able to find much related to this
> error. If anyone could provide any assistance it would be very much
> appreciated. I thank you in advance.
>
> Jacob



More information about the mvapich-discuss mailing list