[mvapich-discuss] slow PMI_Init using mvapich2 with slurm

Thu Mar 24 16:09:35 EDT 2016

Hi Dom,

Glad to know that your issue was resolved with these settings.

We are really interested to see why you are seeing slow startup time. In
our local cluster, running osu_init with 96 processes per node reports ~1s
for completion of MPI_Init.

If you could provide the following details about your system, it would be
really helpful for us to identify the root cause.

MVAPICH2 configuration (mpiname -a)
CPU configuration (cat /proc/cpuinfo)
HCA configuration (ibv_devinfo)
Debug output from slurm (srun -vv --slurmd-debug=4 --mpi=pmi2 -N 1 -n 96
./osu_init)

Thanks,
Sourav

On Thu, Mar 24, 2016 at 2:11 PM, Dominikus Heinzeller <climbfuji at ymail.com>
wrote:

> Hi Sourav!
>
> Thanks you so much - with these settings, the runs with 96 tasks do
> complete (with a startup time of 35s about, but given that most
> applications run for at least an hour or so, that will be alright for most
> users who don’t want to change to mpich).
>
> Thanks again,
>
> Dom/
>
> On 24/03/2016, at 6:40 PM, Sourav Chakraborty <
> chakraborty.52 at buckeyemail.osu.edu> wrote:
>
> Hi Dominikus,
>
> Looks like the system is running of shared memory space with 90+ processes
> per node. Can you please try exporting the following parameters:
>
> MV2_SMP_EAGERSIZE=4096 MV2_SMPI_LENGTH_QUEUE=16384
> MV2_SMP_NUM_SEND_BUFFER=8
>
> Thanks,
> Sourav
>
>
> On Thu, Mar 24, 2016 at 8:48 AM, Dominikus Heinzeller <climbfuji at ymail.com
> > wrote:
>
>> Ok, I have re-compiled SLURM 15.08.8 with the PMI extensions and then
>> mvapich2-2.2b with --with-pm=slurm --with-pmi=pmi2. The problem remains the
>> same, exactly, srun —mpi-pmi2 -N1 -n90 works with a startup time of about
>> 30s, with 91+ tasks it aborts with the error message:
>>
>> srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
>> slurmstepd: error: *** STEP 2433.9 ON keal1 CANCELLED AT
>> 2016-03-24T13:46:01 ***
>>
>> Using the same SLURM version with the same compiler and mpich-3.1.4, I
>> get startup times of under 1s (using pmi1).
>>
>> For the time being, I don’t mind if the startup is slow, I just want to
>> be able to increase the timeout limit so that jobs can scale across the
>> entire node. PMI1 or 2 are both fine.
>>
>> Cheers, and thanks very much,
>>
>> Dom
>>
>> On 24/03/2016, at 8:36 AM, Dominikus Heinzeller <climbfuji at ymail.com>
>> wrote:
>>
>> Hi Sourav,
>>
>> thanks for your feedback! I have tried option one with --with-pm=slurm
>> --with-pmi=pmi2 and had exactly the same problems. Timeout with 91+ tasks
>> on a single node, very slow startup with 90 tasks or less. I am now trying
>> to recompile with --with-pm=slurm --with-pmi=pmi1 but I doubt there will be
>> a difference.
>>
>> Option 2 takes a little more work. One of my colleagues also mentioned
>> that the error message talks about SHMEM - maybe something is going on
>> there.
>>
>> Interestingly, none of those problems occur with mpich-3.1.4.
>>
>> On 23/03/2016, at 11:03 PM, Sourav Chakraborty <
>> chakraborty.52 at buckeyemail.osu.edu> wrote:
>>
>> Hi Dominikus,
>>
>> Thanks for your note.
>>
>> To build MVAPICH2 with SLURM support, you should configure it using
>> --with-pm=slurm --with-pmi=pmi1 or --with-pmi=pmi2. Please note that if you
>> select pmi2, you'd have to use srun --mpi=pmi2 to launch your program.
>> Please refer to
>> http://mvapich.cse.ohio-state.edu/static/media/mvapich/mvapich2-2.2a-userguide.html#x1-100004.3.2
>> for more details.
>>
>> If you are able to modify your Slurm installation, you can also use the
>> extended PMI operations to further speed up the time to launch
>> applications. Please refer to
>> http://mvapich.cse.ohio-state.edu/static/media/mvapich/mvapich2-2.2a-userguide.html#x1-110004.3.3
>> for details on how to use extended PMI for Slurm.
>>
>> Thanks,
>> Sourav
>>
>>
>> On Wed, Mar 23, 2016 at 4:21 PM, Dominikus Heinzeller <
>> climbfuji at ymail.com> wrote:
>>
>>> Hi all,
>>>
>>> I am having a problem with spawning a large number of threads on a node.
>>> My server consists of 4 sockets x 12 cores per socket x 2 threads per core
>>> = 96 procs
>>>
>>> The slurm.conf contains the following line:
>>>
>>> NodeName=keal1  Procs=96 SocketsPerBoard=4 CoresPerSocket=12
>>> ThreadsPerCore=2 RealMemory=1031770 State=UNKNOWN
>>>
>>>
>>> - The system is a Redhat SL 7.2 system running
>>> kernel 3.10.0-327.10.1.el7.x86_64
>>> - slurm 15.08.2 (pre-compiled by the vendor, the system comes with gnu
>>> 4.8.5)
>>> - mpi library:  mvapich2-2.2b compiled with intel-15.0.4 (with-pm=none,
>>> with-pmi=slurm)
>>>
>>> srun -N1 -n90 myprogram works, but takes about 30s to get past MPI_init
>>>
>>> srun -N1 -n91 myprogram aborts with:
>>>
>>> srun: defined options for program `srun'
>>> srun: --------------- ---------------------
>>> srun: user           : `heinzeller-d'
>>> srun: uid            : 9528
>>> srun: gid            : 945
>>> srun: cwd            : /home/heinzeller-d
>>> srun: ntasks         : 91 (set)
>>> srun: nodes          : 1 (set)
>>> srun: jobid          : 1867 (default)
>>> srun: partition      : default
>>> srun: profile        : `NotSet'
>>> srun: job name       : `sh'
>>> srun: reservation    : `(null)'
>>> srun: burst_buffer   : `(null)'
>>> srun: wckey          : `(null)'
>>> srun: cpu_freq_min   : 4294967294
>>> srun: cpu_freq_max   : 4294967294
>>> srun: cpu_freq_gov   : 4294967294
>>> srun: switches       : -1
>>> srun: wait-for-switches : -1
>>> srun: distribution   : unknown
>>> srun: cpu_bind       : default
>>> srun: mem_bind       : default
>>> srun: verbose        : 1
>>> srun: slurmd_debug   : 0
>>> srun: immediate      : false
>>> srun: label output   : false
>>> srun: unbuffered IO  : false
>>> srun: overcommit     : false
>>> srun: threads        : 60
>>> srun: checkpoint_dir : /var/slurm/checkpoint
>>> srun: wait           : 0
>>> srun: account        : (null)
>>> srun: comment        : (null)
>>> srun: dependency     : (null)
>>> srun: exclusive      : false
>>> srun: qos            : (null)
>>> srun: constraints    :
>>> srun: geometry       : (null)
>>> srun: reboot         : yes
>>> srun: rotate         : no
>>> srun: preserve_env   : false
>>> srun: network        : (null)
>>> srun: propagate      : NONE
>>> srun: prolog         : (null)
>>> srun: epilog         : (null)
>>> srun: mail_type      : NONE
>>> srun: mail_user      : (null)
>>> srun: task_prolog    : (null)
>>> srun: task_epilog    : (null)
>>> srun: multi_prog     : no
>>> srun: sockets-per-node  : -2
>>> srun: cores-per-socket  : -2
>>> srun: threads-per-core  : -2
>>> srun: ntasks-per-node   : -2
>>> srun: ntasks-per-socket : -2
>>> srun: ntasks-per-core   : -2
>>> srun: plane_size        : 4294967294
>>> srun: core-spec         : NA
>>> srun: power             :
>>> srun: sicp              : 0
>>> srun: remote command    : `./heat.exe'
>>> srun: launching 1867.7 on host keal1, 91 tasks: [0-90]
>>> srun: route default plugin loaded
>>> srun: Node keal1, 91 tasks started
>>> srun: Sent KVS info to 3 nodes, up to 33 tasks per node
>>> In: PMI_Abort(1, Fatal error in MPI_Init:
>>> Other MPI error, error stack:
>>> MPIR_Init_thread(514)..........:
>>> MPID_Init(365).................: channel initialization failed
>>> MPIDI_CH3_Init(495)............:
>>> MPIDI_CH3I_SHMEM_Helper_fn(908): ftruncate: Invalid argument
>>> )
>>> srun: Complete job step 1867.7 received
>>> slurmstepd: error: *** STEP 1867.7 ON keal1 CANCELLED AT
>>> 2016-03-22T17:12:49 ***
>>>
>>> To me, this looks like a timeout problem at about 30s of PMI_Init time.
>>> If so, I am wondering how to speed up the init?
>>>
>>> Timing for different number of tasks on that box:
>>>
>>> srun: Received task exit notification for 90 tasks (status=0x0000).
>>> srun: keal1: tasks 0-89: Completed
>>> real 0m32.962s
>>> user 0m0.022s
>>> sys 0m0.032s
>>>
>>> srun: Received task exit notification for 80 tasks (status=0x0000).
>>> srun: keal1: tasks 0-79: Completed
>>> real 0m26.755s
>>> user 0m0.016s
>>> sys 0m0.036s
>>>
>>> srun: Received task exit notification for 40 tasks (status=0x0000).
>>> srun: keal1: tasks 0-39: Completed
>>>
>>> real 0m12.810s
>>> user 0m0.014s
>>> sys 0m0.036s
>>>
>>> On a different node with 2 sockets x 10 cores per socket x 2 threads per
>>> core = 40 procs:
>>>
>>> srun: Received task exit notification for 40 tasks (status=0x0000).
>>> srun: kea05: tasks 0-39: Completed
>>> real 0m4.949s
>>> user 0m0.011s
>>> sys 0m0.012s
>>>
>>> Using mpich-3.1.4, I get PMI_Init times of less than 1.5s for 96 tasks -
>>> all else identical (even the mpi library compile options).
>>>
>>> Any suggestions which parameters to tweak?
>>>
>>> Thanks,
>>>
>>> Dom
>>>
>>> _______________________________________________
>>> mvapich-discuss mailing list
>>> mvapich-discuss at cse.ohio-state.edu
>>> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>>
>>>
>>
>>
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20160324/53fe2e1b/attachment-0001.html>