[mvapich-discuss] mvapich 2.3 and slurm with pmi2

Sourav Chakraborty chakraborty.52 at buckeyemail.osu.edu
Thu Jul 26 13:58:17 EDT 2018


Hi Matthias,

Thanks for providing the information. Can you try setting these two
parameters and see if it fixes the issue?

export MV2_USE_RDMA_CM=0
export MV2_USE_RING_STARTUP=0

Thanks,
Sourav


On Wed, Jul 25, 2018 at 8:01 AM Mathias Anselmann <
mathias.anselmann at gmail.com> wrote:

> Hello Sourav,
> I try to answer your questions as good as I can:
>
> 1.)
> I downloaded OMB and ran the "osu_hello" program. Results:
> MVAPICH 2.2:
> No problem, if I salloc with 224 tasks and run "srun -n112 ./osu_hello"
> the output is:
>
> # OSU MPI Hello World Test v5.4.3
> This is a test with 112 processes
>
> MVAPICH 2.3:
> If I do the same with mvapich 2.3 it just stucks, as my own program.
> The output of pstack <pid> for osu_hello is:
>
> $ pstack 13692
> Thread 5 (Thread 0x2b6c88d27700 (LWP 13696)):
> #0  0x00002b6c88ff6411 in sigwait () from /usr/lib64/libpthread.so.0
> #1  0x00000000005a6ed8 in _srun_signal_mgr (job_ptr=0x151ae10) at
> srun_job.c:1422
> #2  0x00002b6c88feedd5 in start_thread () from /usr/lib64/libpthread.so.0
> #3  0x00002b6c89301b3d in clone () from /usr/lib64/libc.so.6
> Thread 4 (Thread 0x2b6c8a639700 (LWP 13697)):
> #0  0x00002b6c892f6e9d in poll () from /usr/lib64/libc.so.6
> #1  0x000000000042a96a in _poll_internal (pfds=0x2b6c8c0009e0, nfds=2,
> shutdown_time=0) at eio.c:362
> #2  0x000000000042a737 in eio_handle_mainloop (eio=0x2b6c8c0008d0) at
> eio.c:326
> #3  0x00002b6c89c0eb6c in _agent (unused=0x0) at agent.c:327
> #4  0x00002b6c88feedd5 in start_thread () from /usr/lib64/libpthread.so.0
> #5  0x00002b6c89301b3d in clone () from /usr/lib64/libc.so.6
> Thread 3 (Thread 0x2b6c8a73a700 (LWP 13698)):
> #0  0x00002b6c892f6e9d in poll () from /usr/lib64/libc.so.6
> #1  0x000000000042a96a in _poll_internal (pfds=0x2b6c900008d0, nfds=3,
> shutdown_time=0) at eio.c:362
> #2  0x000000000042a737 in eio_handle_mainloop (eio=0x1520880) at eio.c:326
> #3  0x00000000005969b4 in _msg_thr_internal (arg=0x151b200) at
> step_launch.c:1053
> #4  0x00002b6c88feedd5 in start_thread () from /usr/lib64/libpthread.so.0
> #5  0x00002b6c89301b3d in clone () from /usr/lib64/libc.so.6
> Thread 2 (Thread 0x2b6c8a83b700 (LWP 13699)):
> #0  0x00002b6c892f6e9d in poll () from /usr/lib64/libc.so.6
> #1  0x000000000042a96a in _poll_internal (pfds=0x2b6c94000dc0, nfds=6,
> shutdown_time=0) at eio.c:362
> #2  0x000000000042a737 in eio_handle_mainloop (eio=0x1524010) at eio.c:326
> #3  0x0000000000592161 in _io_thr_internal (cio_arg=0x1523d90) at
> step_io.c:810
> #4  0x00002b6c88feedd5 in start_thread () from /usr/lib64/libpthread.so.0
> #5  0x00002b6c89301b3d in clone () from /usr/lib64/libc.so.6
> Thread 1 (Thread 0x2b6c88c24bc0 (LWP 13692)):
> #0  0x00002b6c88ff2945 in pthread_cond_wait@@GLIBC_2.3.2 () from
> /usr/lib64/libpthread.so.0
> #1  0x00000000005954f3 in slurm_step_launch_wait_finish (ctx=0x151b1b0) at
> step_launch.c:627
> #2  0x00002b6c89e282b5 in launch_p_step_wait (job=0x151ae10,
> got_alloc=false) at launch_slurm.c:698
> #3  0x000000000059c763 in launch_g_step_wait (job=0x151ae10,
> got_alloc=false) at launch.c:521
> #4  0x000000000042921a in srun (ac=5, av=0x7ffc6b4b7148) at srun.c:262
> #5  0x0000000000429cfe in main (argc=5, argv=0x7ffc6b4b7148) at
> srun.wrapper.c:17
>
>
> 2.) The system is hyperthreaded. The application doesn't support
> threading. If I start the app with as many processes as tasks every
> processor gets full load on the "real" core and the "virtual core". To
> overcome this issue I use n with half the number of the tasks, so every
> processor gets one process which results in full load on  the physical
> processor.
>
> 3.) No, the configuration is homogenious.
>
> I hope that helps debugging.
>
> Greetings,
>
> Mathias
>
> Am Di., 24. Juli 2018 um 20:06 Uhr schrieb Sourav Chakraborty <
> chakraborty.52 at buckeyemail.osu.edu>:
>
>> Hi Mathias,
>>
>> Thanks for your report. We were not able to reproduce this issue locally.
>> Can you please provide the following information so that we could debug
>> this further?
>>
>> 1. Does the hang happen only for this particular application or any
>> application (e.g. OMB?) Can you provide a stacktrace of the application by
>> running pstack <pid>?
>>
>> 2. What did you mean by "disable threading"? Is this a hyperthreaded
>> system? Is the application multi-threaded?
>>
>> 3. Does the system have multiple HCAs per node or a heterogeneous
>> configuration (Different CPU/HCA across nodes)?
>>
>> Thanks,
>> Sourav
>>
>>
>>
>> On Tue, Jul 24, 2018 at 9:52 AM Mathias Anselmann <
>> mathias.anselmann at gmail.com> wrote:
>>
>>> Hello,
>>> I just wanted to try out the latest version of mvapich on our HPC,
>>> running RedHat 7.3 with slurm.
>>> On the machine I have installed locally a GCC 7.3 and compiled mvapich
>>> (locally) with:
>>>
>>> --enable-shared --with-pm=slurm --with-pmi=pmi2
>>>
>>> For testing purposes I start the my program via srun (after allocating
>>> some nodes via "salloc").
>>> So, e.g.:
>>>
>>> salloc --ntasks=224 --exclusive bash
>>>
>>> and then:
>>>
>>> srun -n 112 --mpi=pmi2 ./ProgramBinary
>>>
>>> In srun I use half of the --ntasks argument to disable threading.
>>>
>>> With mvapich 2.2 this works like a charm and I tested it up to a
>>> --ntasks=560 and -n 280. My program starts within seconds and runs as
>>> expected.
>>> I used the same configure flags for mvapich 2.2 as for 2.3 now.
>>>
>>> With 2.3 I have the following issue:
>>> srun works fine, but if I increase the nodes it won't start my program.
>>> With --ntasks=168 and -n 84 srun works like a charm, but if I go for
>>> --ntasks=224 and -n 112 srun seems to not start the program. There comes no
>>> output and I manually cancelled the job after 15 minutes of waiting.
>>>
>>> Does anybody have a clue what's going on here?
>>> If more infos are needed, I can provide them of course.
>>>
>>> Greetings,
>>>
>>> Mathias
>>> _______________________________________________
>>> mvapich-discuss mailing list
>>> mvapich-discuss at cse.ohio-state.edu
>>> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>>
>> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20180726/c8757986/attachment.html>


More information about the mvapich-discuss mailing list