[mvapich-discuss] mvapich 2.3 and slurm with pmi2

Mathias Anselmann mathias.anselmann at gmail.com
Tue Jul 24 09:51:36 EDT 2018


Hello,
I just wanted to try out the latest version of mvapich on our HPC, running
RedHat 7.3 with slurm.
On the machine I have installed locally a GCC 7.3 and compiled mvapich
(locally) with:

--enable-shared --with-pm=slurm --with-pmi=pmi2

For testing purposes I start the my program via srun (after allocating some
nodes via "salloc").
So, e.g.:

salloc --ntasks=224 --exclusive bash

and then:

srun -n 112 --mpi=pmi2 ./ProgramBinary

In srun I use half of the --ntasks argument to disable threading.

With mvapich 2.2 this works like a charm and I tested it up to a
--ntasks=560 and -n 280. My program starts within seconds and runs as
expected.
I used the same configure flags for mvapich 2.2 as for 2.3 now.

With 2.3 I have the following issue:
srun works fine, but if I increase the nodes it won't start my program.
With --ntasks=168 and -n 84 srun works like a charm, but if I go for
--ntasks=224 and -n 112 srun seems to not start the program. There comes no
output and I manually cancelled the job after 15 minutes of waiting.

Does anybody have a clue what's going on here?
If more infos are needed, I can provide them of course.

Greetings,

Mathias
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20180724/6dea4522/attachment.html>


More information about the mvapich-discuss mailing list