[mvapich-discuss] PMGR_COLLECTIVE ERROR - pmgr_collective_mpispawn
Jaidev Sridhar
sridharj at cse.ohio-state.edu
Sun Apr 27 22:29:13 EDT 2008
Steve,
On Sunday 27 April 2008 09:26 PM, Steve Jones wrote:
> Hi.
>
> I'm receiving an error on a number of Intel MPI Benchmark (IMB) jobs
> that result in a PMGR_COLLECTIVE ERROR, shown below. The job failure is
> not constant, I'm able to run the benchmark on a large number of nodes,
> it seems to only error on sets of nodes. Can you provide more detail on
> this error?
>
> I'm using MVAPICH 1.0gen2 OFED 1.2.5 on RHEL4 2.6.9-55.0.12
> The start command is $ mpirun_rsh -np 136 -hostfile $PBS_NODEFILE
> ./IMB-MPI1
>
> mpispawn.c:303 Unexpected exit status
This error message indicates that one of the processes terminated / was
unable to start for some reason. We catch this and kill the other
processes which is what caused the later messages.
Do you see a reason why some processes are failing to start? A faulty
node perhaps? You might want to try narrowing it down to the node(s)
that are causing this.
If you need anymore help, do let us know.
-Jaidev
> Exit code -1 signaled from COMPUTE-1-3
> Killing remote processes...PMGR_COLLECTIVE ERROR: reading from (read()
> Success errno=0) @ file pmgr_collective_mpispawn.c:121
> PMGR_COLLECTIVE ERROR: reading from (read() Success errno=0) @ file
> pmgr_collective_mpispawn.c:121
> PMGR_COLLECTIVE ERROR: reading from (read() Success errno=0) @ file
> pmgr_collective_mpispawn.c:121
> PMGR_COLLECTIVE ERROR: reading from (read() Success errno=0) @ file
> pmgr_collective_mpispawn.c:121
> PMGR_COLLECTIVE ERROR: reading from (read() Success errno=0) @ file
> pmgr_collective_mpispawn.c:121
> PMGR_COLLECTIVE ERROR: reading from (read() Success errno=0) @ file
> pmgr_collective_mpispawn.c:121
> PMGR_COLLECTIVE ERROR: unexpected value: received 0, expecting 7 @ file
> pmgr_collective_mpispawn.c:137
> PMGR_COLLECTIVE ERROR: reading from (read() Success errno=0) @ file
> pmgr_collective_mpispawn.c:121
> PMGR_COLLECTIVE ERROR: reading from (read() Success errno=0) @ file
> pmgr_collective_mpispawn.c:121
> PMGR_COLLECTIVE ERROR: reading from (read() Success errno=0) @ file
> pmgr_collective_mpispawn.c:121
> PMGR_COLLECTIVE ERROR: reading from (read() Success errno=0) @ file
> pmgr_collective_mpispawn.c:121
> PMGR_COLLECTIVE ERROR: unexpected value: received 0, expecting 7 @ file
> pmgr_collective_mpispawn.c:137
> PMGR_COLLECTIVE ERROR: reading from (read() Success errno=0) @ file
> pmgr_collective_mpispawn.c:121
> PMGR_COLLECTIVE ERROR: reading from (read() Success errno=0) @ file
> pmgr_collective_mpispawn.c:121
> PMGR_COLLECTIVE ERROR: reading from (read() Success errno=0) @ file
> pmgr_collective_mpispawn.c:121
> PMGR_COLLECTIVE ERROR: reading from (read() Success errno=0) @ file
> pmgr_collective_mpispawn.c:121
> PMGR_COLLECTIVE ERROR: reading from (read() Success errno=0) @ file
> pmgr_collective_mpispawn.c:121PMGR_COLLECTIVE ERROR: reading from
> (read() Success errno=0) @ file pmgr_collective_mpispawn.c:121
> PMGR_COLLECTIVE ERROR: reading from (read() Success errno=0) @ file
> pmgr_collective_mpispawn.c:121
> PMGR_COLLECTIVE ERROR: reading from (read() Success errno=0) @ file
> pmgr_collective_mpispawn.c:121
> PMGR_COLLECTIVE ERROR: reading from (read() Success errno=0) @ file
> pmgr_collective_mpispawn.c:121
> PMGR_COLLECTIVE ERROR: reading from (read() Success errno=0) @ file
> pmgr_collective_mpispawn.c:121
> PMGR_COLLECTIVE ERROR: reading from (read() Success errno=0) @ file
> pmgr_collective_mpispawn.c:121
> PMGR_COLLECTIVE ERROR: reading from (read() Success errno=0) @ file
> pmgr_collective_mpispawn.c:121
> PMGR_COLLECTIVE ERROR: PMGR_COLLECTIVE ERROR: reading from (read()
> Success errno=0) @ file pmgr_collective_mpispawn.c:121
> PMGR_COLLECTIVE ERROR: reading from (read() Success errno=0) @ file
> pmgr_collective_mpispawn.c:121
> PMGR_COLLECTIVE ERROR: reading from (read() Success errno=0) @ file
> pmgr_collective_mpispawn.c:121
> reading from (read() Success errno=0) @ file pmgr_collective_mpispawn.c:121
>
> reading from (read() Success errno=0) @ file pmgr_collective_mpispawn.c:121
> reading from (read() Success errno=0) @ file pmgr_collective_mpispawn.c:121
> DONE
> Signal 15 received.
> Signal 15 received.
> Signal 15 received.
> Signal 15 received.
> Signal 15 received.
> Signal 15 received.
> Signal 15 received.
> Signal 15 received.
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
More information about the mvapich-discuss
mailing list