[mvapich-discuss] PMGR_COLLECTIVE ERROR - pmgr_collective_mpispawn

Jaidev Sridhar sridharj at cse.ohio-state.edu
Sun Apr 27 22:29:13 EDT 2008


Steve,

On Sunday 27 April 2008 09:26 PM, Steve Jones wrote:
> Hi.
> 
> I'm receiving an error on a number of Intel MPI Benchmark (IMB) jobs 
> that result in a PMGR_COLLECTIVE ERROR, shown below. The job failure is 
> not constant, I'm able to run the benchmark on a large number of nodes, 
> it seems to only error on sets of nodes. Can you provide more detail on 
> this error?
> 
> I'm using MVAPICH 1.0gen2 OFED 1.2.5 on RHEL4 2.6.9-55.0.12
> The start command is $ mpirun_rsh -np 136 -hostfile $PBS_NODEFILE 
> ./IMB-MPI1
> 
> mpispawn.c:303 Unexpected exit status

This error message indicates that one of the processes terminated / was 
unable to start for some reason. We catch this and kill the other 
processes which is what caused the later messages.

Do you see a reason why some processes are failing to start? A faulty 
node perhaps? You might want to try narrowing it down to the node(s) 
that are causing this.

If you need anymore help, do let us know.

-Jaidev



> Exit code -1 signaled from COMPUTE-1-3
> Killing remote processes...PMGR_COLLECTIVE ERROR: reading from (read() 
> Success errno=0) @ file pmgr_collective_mpispawn.c:121
> PMGR_COLLECTIVE ERROR: reading from (read() Success errno=0) @ file 
> pmgr_collective_mpispawn.c:121
> PMGR_COLLECTIVE ERROR: reading from (read() Success errno=0) @ file 
> pmgr_collective_mpispawn.c:121
> PMGR_COLLECTIVE ERROR: reading from (read() Success errno=0) @ file 
> pmgr_collective_mpispawn.c:121
> PMGR_COLLECTIVE ERROR: reading from (read() Success errno=0) @ file 
> pmgr_collective_mpispawn.c:121
> PMGR_COLLECTIVE ERROR: reading from (read() Success errno=0) @ file 
> pmgr_collective_mpispawn.c:121
> PMGR_COLLECTIVE ERROR: unexpected value: received 0, expecting 7 @ file 
> pmgr_collective_mpispawn.c:137
> PMGR_COLLECTIVE ERROR: reading from (read() Success errno=0) @ file 
> pmgr_collective_mpispawn.c:121
> PMGR_COLLECTIVE ERROR: reading from (read() Success errno=0) @ file 
> pmgr_collective_mpispawn.c:121
> PMGR_COLLECTIVE ERROR: reading from (read() Success errno=0) @ file 
> pmgr_collective_mpispawn.c:121
> PMGR_COLLECTIVE ERROR: reading from (read() Success errno=0) @ file 
> pmgr_collective_mpispawn.c:121
> PMGR_COLLECTIVE ERROR: unexpected value: received 0, expecting 7 @ file 
> pmgr_collective_mpispawn.c:137
> PMGR_COLLECTIVE ERROR: reading from (read() Success errno=0) @ file 
> pmgr_collective_mpispawn.c:121
> PMGR_COLLECTIVE ERROR: reading from (read() Success errno=0) @ file 
> pmgr_collective_mpispawn.c:121
> PMGR_COLLECTIVE ERROR: reading from (read() Success errno=0) @ file 
> pmgr_collective_mpispawn.c:121
> PMGR_COLLECTIVE ERROR: reading from (read() Success errno=0) @ file 
> pmgr_collective_mpispawn.c:121
> PMGR_COLLECTIVE ERROR: reading from (read() Success errno=0) @ file 
> pmgr_collective_mpispawn.c:121PMGR_COLLECTIVE ERROR: reading from 
> (read() Success errno=0) @ file pmgr_collective_mpispawn.c:121
> PMGR_COLLECTIVE ERROR: reading from (read() Success errno=0) @ file 
> pmgr_collective_mpispawn.c:121
> PMGR_COLLECTIVE ERROR: reading from (read() Success errno=0) @ file 
> pmgr_collective_mpispawn.c:121
> PMGR_COLLECTIVE ERROR: reading from (read() Success errno=0) @ file 
> pmgr_collective_mpispawn.c:121
> PMGR_COLLECTIVE ERROR: reading from (read() Success errno=0) @ file 
> pmgr_collective_mpispawn.c:121
> PMGR_COLLECTIVE ERROR: reading from (read() Success errno=0) @ file 
> pmgr_collective_mpispawn.c:121
> PMGR_COLLECTIVE ERROR: reading from (read() Success errno=0) @ file 
> pmgr_collective_mpispawn.c:121
> PMGR_COLLECTIVE ERROR: PMGR_COLLECTIVE ERROR: reading from (read() 
> Success errno=0) @ file pmgr_collective_mpispawn.c:121
> PMGR_COLLECTIVE ERROR: reading from (read() Success errno=0) @ file 
> pmgr_collective_mpispawn.c:121
> PMGR_COLLECTIVE ERROR: reading from (read() Success errno=0) @ file 
> pmgr_collective_mpispawn.c:121
> reading from (read() Success errno=0) @ file pmgr_collective_mpispawn.c:121
> 
> reading from (read() Success errno=0) @ file pmgr_collective_mpispawn.c:121
> reading from (read() Success errno=0) @ file pmgr_collective_mpispawn.c:121
> DONE
> Signal 15 received.
> Signal 15 received.
> Signal 15 received.
> Signal 15 received.
> Signal 15 received.
> Signal 15 received.
> Signal 15 received.
> Signal 15 received.
> 
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> 



More information about the mvapich-discuss mailing list