[mvapich-discuss] MPI code failure - help to diagnose

Hari Subramoni subramoni.1 at osu.edu
Wed Oct 19 15:48:47 EDT 2016


Hi Vladimir,

Did you configure MVAPICH2 with SLURM support? Please refer to the
following section of the MVAPICH2 userguide for information on how to do
this.

http://mvapich.cse.ohio-state.edu/static/media/mvapich/
mvapich2-2.2-userguide.html#x1-100004.3.2

Can you please send the output of mpiname -a? This will tell us how
MVAPICH2 was built.

To diagnose runtime issues and to get backtrace, you can rerun your program
after adding "MV2_DEBUG_SHOW_BACKTRACE=2" to the environment using the
export command (export MV2_DEBUG_SHOW_BACKTRACE=2).

If you have compiled MVAPICH2 with debugging options, the above command
will provide a detailed backtrace. If not, the backtrace may be limited.
You need to add "--enable-g=gdb --enable-fast=none" to the MVAPICH2
configure line to enable debugging support.

Please refer to the following section of the MVAPICH2 userguide for
information on how to do this.

http://mvapich.cse.ohio-state.edu/static/media/mvapich/mvapich2-2.2-userguide.html#x1-1310009.1.14

Regards,
Hari.

On Wed, Oct 19, 2016 at 2:10 PM, Vladimir Florinski <vaf0001 at uah.edu> wrote:

> On deploying mvapich2 version 2.2 on our cluster we found that MPI codes
> refuse to run across multiple nodes. The installation uses slurm as the
> process manager. Non-mpi codes run OK across any number of nodes. MPI codes
> run OK on a single node using any number of locally available cores (16 in
> this case). However, MPI codes fail on more than one node. For example
>
> srun --mpi=pmi2 -n 16 ./a.out            runs OK
> srun --mpi=pmi2 -n 17 ./a.out            fails
>
> (the code consists of MPI_Init() and MPI_Finalize() only). The message is
> very generic, so it offers little help:
>
> srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
> slurmstepd: error: *** STEP 59.0 ON node62 CANCELLED AT
> 2016-10-19T12:49:10 ***
> srun: error: node63: tasks 9-16: Exited with exit code 1
> srun: error: node62: tasks 0-8: Killed
>
> The facts seem to eliminate a slurm error and point to an issue with
> Infiniband. That part has been tested thoroughly and all diagnostics
> completed without errors. There is no firewall running. I am rather out of
> ideas at this point and would welcome advice on troubleshooting the problem.
>
> Thanks,
>
> --
> Vladimir Florinski
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20161019/d481bf15/attachment-0001.html>


More information about the mvapich-discuss mailing list