[mvapich-discuss] PMGR_COLLECTIVE ERROR - pmgr_collective_mpispawn

Steve Jones stevejones at stanford.edu
Sun Apr 27 22:42:17 EDT 2008


>> I'm receiving an error on a number of Intel MPI Benchmark (IMB)   
>> jobs that result in a PMGR_COLLECTIVE ERROR, shown below. The job   
>> failure is not constant, I'm able to run the benchmark on a large   
>> number of nodes, it seems to only error on sets of nodes. Can you   
>> provide more detail on this error?
>>
>> I'm using MVAPICH 1.0gen2 OFED 1.2.5 on RHEL4 2.6.9-55.0.12
>> The start command is $ mpirun_rsh -np 136 -hostfile $PBS_NODEFILE ./IMB-MPI1
>>
>> mpispawn.c:303 Unexpected exit status
>
> This error message indicates that one of the processes terminated / was
> unable to start for some reason. We catch this and kill the other
> processes which is what caused the later messages.
>
> Do you see a reason why some processes are failing to start? A faulty
> node perhaps? You might want to try narrowing it down to the node(s)
> that are causing this.
>
> If you need anymore help, do let us know.
>
> -Jaidev

Hi Jaidev.

This makes sense as I've been able to locate a few nodes with  
mismatched firmware. The job error rate has already decreased and I'm  
looking for the rest of the node issues.

Thanks again for the sanity check.

Steve


More information about the mvapich-discuss mailing list