[mvapich-discuss] slow PMI_Init using mvapich2 with slurm

Thu Mar 24 13:40:11 EDT 2016

Hi Dominikus,

Looks like the system is running of shared memory space with 90+ processes
per node. Can you please try exporting the following parameters:

MV2_SMP_EAGERSIZE=4096 MV2_SMPI_LENGTH_QUEUE=16384 MV2_SMP_NUM_SEND_BUFFER=8

Thanks,
Sourav

On Thu, Mar 24, 2016 at 8:48 AM, Dominikus Heinzeller <climbfuji at ymail.com>
wrote:

> Ok, I have re-compiled SLURM 15.08.8 with the PMI extensions and then
> mvapich2-2.2b with --with-pm=slurm --with-pmi=pmi2. The problem remains the
> same, exactly, srun —mpi-pmi2 -N1 -n90 works with a startup time of about
> 30s, with 91+ tasks it aborts with the error message:
>
> srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
> slurmstepd: error: *** STEP 2433.9 ON keal1 CANCELLED AT
> 2016-03-24T13:46:01 ***
>
> Using the same SLURM version with the same compiler and mpich-3.1.4, I get
> startup times of under 1s (using pmi1).
>
> For the time being, I don’t mind if the startup is slow, I just want to be
> able to increase the timeout limit so that jobs can scale across the entire
> node. PMI1 or 2 are both fine.
>
> Cheers, and thanks very much,
>
> Dom
>
> On 24/03/2016, at 8:36 AM, Dominikus Heinzeller <climbfuji at ymail.com>
> wrote:
>
> Hi Sourav,
>
> thanks for your feedback! I have tried option one with --with-pm=slurm
> --with-pmi=pmi2 and had exactly the same problems. Timeout with 91+ tasks
> on a single node, very slow startup with 90 tasks or less. I am now trying
> to recompile with --with-pm=slurm --with-pmi=pmi1 but I doubt there will be
> a difference.
>
> Option 2 takes a little more work. One of my colleagues also mentioned
> that the error message talks about SHMEM - maybe something is going on
> there.
>
> Interestingly, none of those problems occur with mpich-3.1.4.
>
> On 23/03/2016, at 11:03 PM, Sourav Chakraborty <
> chakraborty.52 at buckeyemail.osu.edu> wrote:
>
> Hi Dominikus,
>
> Thanks for your note.
>
> To build MVAPICH2 with SLURM support, you should configure it using
> --with-pm=slurm --with-pmi=pmi1 or --with-pmi=pmi2. Please note that if you
> select pmi2, you'd have to use srun --mpi=pmi2 to launch your program.
> Please refer to
> http://mvapich.cse.ohio-state.edu/static/media/mvapich/mvapich2-2.2a-userguide.html#x1-100004.3.2
> for more details.
>
> If you are able to modify your Slurm installation, you can also use the
> extended PMI operations to further speed up the time to launch
> applications. Please refer to
> http://mvapich.cse.ohio-state.edu/static/media/mvapich/mvapich2-2.2a-userguide.html#x1-110004.3.3
> for details on how to use extended PMI for Slurm.
>
> Thanks,
> Sourav
>
>
> On Wed, Mar 23, 2016 at 4:21 PM, Dominikus Heinzeller <climbfuji at ymail.com
> > wrote:
>
>> Hi all,
>>
>> I am having a problem with spawning a large number of threads on a node.
>> My server consists of 4 sockets x 12 cores per socket x 2 threads per core
>> = 96 procs
>>
>> The slurm.conf contains the following line:
>>
>> NodeName=keal1  Procs=96 SocketsPerBoard=4 CoresPerSocket=12
>> ThreadsPerCore=2 RealMemory=1031770 State=UNKNOWN
>>
>>
>> - The system is a Redhat SL 7.2 system running
>> kernel 3.10.0-327.10.1.el7.x86_64
>> - slurm 15.08.2 (pre-compiled by the vendor, the system comes with gnu
>> 4.8.5)
>> - mpi library:  mvapich2-2.2b compiled with intel-15.0.4 (with-pm=none,
>> with-pmi=slurm)
>>
>> srun -N1 -n90 myprogram works, but takes about 30s to get past MPI_init
>>
>> srun -N1 -n91 myprogram aborts with:
>>
>> srun: defined options for program `srun'
>> srun: --------------- ---------------------
>> srun: user           : `heinzeller-d'
>> srun: uid            : 9528
>> srun: gid            : 945
>> srun: cwd            : /home/heinzeller-d
>> srun: ntasks         : 91 (set)
>> srun: nodes          : 1 (set)
>> srun: jobid          : 1867 (default)
>> srun: partition      : default
>> srun: profile        : `NotSet'
>> srun: job name       : `sh'
>> srun: reservation    : `(null)'
>> srun: burst_buffer   : `(null)'
>> srun: wckey          : `(null)'
>> srun: cpu_freq_min   : 4294967294
>> srun: cpu_freq_max   : 4294967294
>> srun: cpu_freq_gov   : 4294967294
>> srun: switches       : -1
>> srun: wait-for-switches : -1
>> srun: distribution   : unknown
>> srun: cpu_bind       : default
>> srun: mem_bind       : default
>> srun: verbose        : 1
>> srun: slurmd_debug   : 0
>> srun: immediate      : false
>> srun: label output   : false
>> srun: unbuffered IO  : false
>> srun: overcommit     : false
>> srun: threads        : 60
>> srun: checkpoint_dir : /var/slurm/checkpoint
>> srun: wait           : 0
>> srun: account        : (null)
>> srun: comment        : (null)
>> srun: dependency     : (null)
>> srun: exclusive      : false
>> srun: qos            : (null)
>> srun: constraints    :
>> srun: geometry       : (null)
>> srun: reboot         : yes
>> srun: rotate         : no
>> srun: preserve_env   : false
>> srun: network        : (null)
>> srun: propagate      : NONE
>> srun: prolog         : (null)
>> srun: epilog         : (null)
>> srun: mail_type      : NONE
>> srun: mail_user      : (null)
>> srun: task_prolog    : (null)
>> srun: task_epilog    : (null)
>> srun: multi_prog     : no
>> srun: sockets-per-node  : -2
>> srun: cores-per-socket  : -2
>> srun: threads-per-core  : -2
>> srun: ntasks-per-node   : -2
>> srun: ntasks-per-socket : -2
>> srun: ntasks-per-core   : -2
>> srun: plane_size        : 4294967294
>> srun: core-spec         : NA
>> srun: power             :
>> srun: sicp              : 0
>> srun: remote command    : `./heat.exe'
>> srun: launching 1867.7 on host keal1, 91 tasks: [0-90]
>> srun: route default plugin loaded
>> srun: Node keal1, 91 tasks started
>> srun: Sent KVS info to 3 nodes, up to 33 tasks per node
>> In: PMI_Abort(1, Fatal error in MPI_Init:
>> Other MPI error, error stack:
>> MPIR_Init_thread(514)..........:
>> MPID_Init(365).................: channel initialization failed
>> MPIDI_CH3_Init(495)............:
>> MPIDI_CH3I_SHMEM_Helper_fn(908): ftruncate: Invalid argument
>> )
>> srun: Complete job step 1867.7 received
>> slurmstepd: error: *** STEP 1867.7 ON keal1 CANCELLED AT
>> 2016-03-22T17:12:49 ***
>>
>> To me, this looks like a timeout problem at about 30s of PMI_Init time.
>> If so, I am wondering how to speed up the init?
>>
>> Timing for different number of tasks on that box:
>>
>> srun: Received task exit notification for 90 tasks (status=0x0000).
>> srun: keal1: tasks 0-89: Completed
>> real 0m32.962s
>> user 0m0.022s
>> sys 0m0.032s
>>
>> srun: Received task exit notification for 80 tasks (status=0x0000).
>> srun: keal1: tasks 0-79: Completed
>> real 0m26.755s
>> user 0m0.016s
>> sys 0m0.036s
>>
>> srun: Received task exit notification for 40 tasks (status=0x0000).
>> srun: keal1: tasks 0-39: Completed
>>
>> real 0m12.810s
>> user 0m0.014s
>> sys 0m0.036s
>>
>> On a different node with 2 sockets x 10 cores per socket x 2 threads per
>> core = 40 procs:
>>
>> srun: Received task exit notification for 40 tasks (status=0x0000).
>> srun: kea05: tasks 0-39: Completed
>> real 0m4.949s
>> user 0m0.011s
>> sys 0m0.012s
>>
>> Using mpich-3.1.4, I get PMI_Init times of less than 1.5s for 96 tasks -
>> all else identical (even the mpi library compile options).
>>
>> Any suggestions which parameters to tweak?
>>
>> Thanks,
>>
>> Dom
>>
>> _______________________________________________
>> mvapich-discuss mailing list
>> mvapich-discuss at cse.ohio-state.edu
>> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>
>>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20160324/686667b2/attachment.html>