[mvapich-discuss] MPI code failure - help to diagnose

Wed Oct 19 14:10:43 EDT 2016

On deploying mvapich2 version 2.2 on our cluster we found that MPI codes
refuse to run across multiple nodes. The installation uses slurm as the
process manager. Non-mpi codes run OK across any number of nodes. MPI codes
run OK on a single node using any number of locally available cores (16 in
this case). However, MPI codes fail on more than one node. For example

srun --mpi=pmi2 -n 16 ./a.out            runs OK
srun --mpi=pmi2 -n 17 ./a.out            fails

(the code consists of MPI_Init() and MPI_Finalize() only). The message is
very generic, so it offers little help:

srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurmstepd: error: *** STEP 59.0 ON node62 CANCELLED AT 2016-10-19T12:49:10
***
srun: error: node63: tasks 9-16: Exited with exit code 1
srun: error: node62: tasks 0-8: Killed

The facts seem to eliminate a slurm error and point to an issue with
Infiniband. That part has been tested thoroughly and all diagnostics
completed without errors. There is no firewall running. I am rather out of
ideas at this point and would welcome advice on troubleshooting the problem.

Thanks,

-- 
Vladimir Florinski
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20161019/47fe7583/attachment.html>