[mvapich-discuss] mvapich 2.3 and slurm with pmi2

Sourav Chakraborty chakraborty.52 at buckeyemail.osu.edu
Tue Jul 24 14:06:14 EDT 2018


Hi Mathias,

Thanks for your report. We were not able to reproduce this issue locally.
Can you please provide the following information so that we could debug
this further?

1. Does the hang happen only for this particular application or any
application (e.g. OMB?) Can you provide a stacktrace of the application by
running pstack <pid>?

2. What did you mean by "disable threading"? Is this a hyperthreaded
system? Is the application multi-threaded?

3. Does the system have multiple HCAs per node or a heterogeneous
configuration (Different CPU/HCA across nodes)?

Thanks,
Sourav



On Tue, Jul 24, 2018 at 9:52 AM Mathias Anselmann <
mathias.anselmann at gmail.com> wrote:

> Hello,
> I just wanted to try out the latest version of mvapich on our HPC, running
> RedHat 7.3 with slurm.
> On the machine I have installed locally a GCC 7.3 and compiled mvapich
> (locally) with:
>
> --enable-shared --with-pm=slurm --with-pmi=pmi2
>
> For testing purposes I start the my program via srun (after allocating
> some nodes via "salloc").
> So, e.g.:
>
> salloc --ntasks=224 --exclusive bash
>
> and then:
>
> srun -n 112 --mpi=pmi2 ./ProgramBinary
>
> In srun I use half of the --ntasks argument to disable threading.
>
> With mvapich 2.2 this works like a charm and I tested it up to a
> --ntasks=560 and -n 280. My program starts within seconds and runs as
> expected.
> I used the same configure flags for mvapich 2.2 as for 2.3 now.
>
> With 2.3 I have the following issue:
> srun works fine, but if I increase the nodes it won't start my program.
> With --ntasks=168 and -n 84 srun works like a charm, but if I go for
> --ntasks=224 and -n 112 srun seems to not start the program. There comes no
> output and I manually cancelled the job after 15 minutes of waiting.
>
> Does anybody have a clue what's going on here?
> If more infos are needed, I can provide them of course.
>
> Greetings,
>
> Mathias
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20180724/beba0198/attachment.html>


More information about the mvapich-discuss mailing list