[mvapich-discuss] slow PMI_Init using mvapich2 with slurm

Thu Mar 24 08:48:27 EDT 2016

Ok, I have re-compiled SLURM 15.08.8 with the PMI extensions and then mvapich2-2.2b with --with-pm=slurm --with-pmi=pmi2. The problem remains the same, exactly, srun —mpi-pmi2 -N1 -n90 works with a startup time of about 30s, with 91+ tasks it aborts with the error message:

srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurmstepd: error: *** STEP 2433.9 ON keal1 CANCELLED AT 2016-03-24T13:46:01 ***

Using the same SLURM version with the same compiler and mpich-3.1.4, I get startup times of under 1s (using pmi1).

For the time being, I don’t mind if the startup is slow, I just want to be able to increase the timeout limit so that jobs can scale across the entire node. PMI1 or 2 are both fine.

Cheers, and thanks very much,

Dom

> On 24/03/2016, at 8:36 AM, Dominikus Heinzeller <climbfuji at ymail.com> wrote:
> 
> Hi Sourav,
> 
> thanks for your feedback! I have tried option one with --with-pm=slurm --with-pmi=pmi2 and had exactly the same problems. Timeout with 91+ tasks on a single node, very slow startup with 90 tasks or less. I am now trying to recompile with --with-pm=slurm --with-pmi=pmi1 but I doubt there will be a difference.
> 
> Option 2 takes a little more work. One of my colleagues also mentioned that the error message talks about SHMEM - maybe something is going on there.
> 
> Interestingly, none of those problems occur with mpich-3.1.4.
> 
>> On 23/03/2016, at 11:03 PM, Sourav Chakraborty <chakraborty.52 at buckeyemail.osu.edu <mailto:chakraborty.52 at buckeyemail.osu.edu>> wrote:
>> 
>> Hi Dominikus,
>> 
>> Thanks for your note.
>> 
>> To build MVAPICH2 with SLURM support, you should configure it using --with-pm=slurm --with-pmi=pmi1 or --with-pmi=pmi2. Please note that if you select pmi2, you'd have to use srun --mpi=pmi2 to launch your program. Please refer to http://mvapich.cse.ohio-state.edu/static/media/mvapich/mvapich2-2.2a-userguide.html#x1-100004.3.2 <http://mvapich.cse.ohio-state.edu/static/media/mvapich/mvapich2-2.2a-userguide.html#x1-100004.3.2> for more details.
>> 
>> If you are able to modify your Slurm installation, you can also use the extended PMI operations to further speed up the time to launch applications. Please refer to http://mvapich.cse.ohio-state.edu/static/media/mvapich/mvapich2-2.2a-userguide.html#x1-110004.3.3 <http://mvapich.cse.ohio-state.edu/static/media/mvapich/mvapich2-2.2a-userguide.html#x1-110004.3.3> for details on how to use extended PMI for Slurm.
>> 
>> Thanks,
>> Sourav
>> 
>> 
>> On Wed, Mar 23, 2016 at 4:21 PM, Dominikus Heinzeller <climbfuji at ymail.com <mailto:climbfuji at ymail.com>> wrote:
>> Hi all,
>> 
>> I am having a problem with spawning a large number of threads on a node. My server consists of 4 sockets x 12 cores per socket x 2 threads per core = 96 procs
>> 
>> The slurm.conf contains the following line:
>> 
>> NodeName=keal1  Procs=96 SocketsPerBoard=4 CoresPerSocket=12 ThreadsPerCore=2 RealMemory=1031770 State=UNKNOWN
>> 
>> 
>> - The system is a Redhat SL 7.2 system running kernel 3.10.0-327.10.1.el7.x86_64
>> - slurm 15.08.2 (pre-compiled by the vendor, the system comes with gnu 4.8.5)
>> - mpi library:  mvapich2-2.2b compiled with intel-15.0.4 (with-pm=none, with-pmi=slurm)
>> 
>> srun -N1 -n90 myprogram works, but takes about 30s to get past MPI_init
>> 
>> srun -N1 -n91 myprogram aborts with:
>> 
>> srun: defined options for program `srun'
>> srun: --------------- ---------------------
>> srun: user           : `heinzeller-d'
>> srun: uid            : 9528
>> srun: gid            : 945
>> srun: cwd            : /home/heinzeller-d
>> srun: ntasks         : 91 (set)
>> srun: nodes          : 1 (set)
>> srun: jobid          : 1867 (default)
>> srun: partition      : default
>> srun: profile        : `NotSet'
>> srun: job name       : `sh'
>> srun: reservation    : `(null)'
>> srun: burst_buffer   : `(null)'
>> srun: wckey          : `(null)'
>> srun: cpu_freq_min   : 4294967294
>> srun: cpu_freq_max   : 4294967294
>> srun: cpu_freq_gov   : 4294967294
>> srun: switches       : -1
>> srun: wait-for-switches : -1
>> srun: distribution   : unknown
>> srun: cpu_bind       : default
>> srun: mem_bind       : default
>> srun: verbose        : 1
>> srun: slurmd_debug   : 0
>> srun: immediate      : false
>> srun: label output   : false
>> srun: unbuffered IO  : false
>> srun: overcommit     : false
>> srun: threads        : 60
>> srun: checkpoint_dir : /var/slurm/checkpoint
>> srun: wait           : 0
>> srun: account        : (null)
>> srun: comment        : (null)
>> srun: dependency     : (null)
>> srun: exclusive      : false
>> srun: qos            : (null)
>> srun: constraints    :
>> srun: geometry       : (null)
>> srun: reboot         : yes
>> srun: rotate         : no
>> srun: preserve_env   : false
>> srun: network        : (null)
>> srun: propagate      : NONE
>> srun: prolog         : (null)
>> srun: epilog         : (null)
>> srun: mail_type      : NONE
>> srun: mail_user      : (null)
>> srun: task_prolog    : (null)
>> srun: task_epilog    : (null)
>> srun: multi_prog     : no
>> srun: sockets-per-node  : -2
>> srun: cores-per-socket  : -2
>> srun: threads-per-core  : -2
>> srun: ntasks-per-node   : -2
>> srun: ntasks-per-socket : -2
>> srun: ntasks-per-core   : -2
>> srun: plane_size        : 4294967294
>> srun: core-spec         : NA
>> srun: power             :
>> srun: sicp              : 0
>> srun: remote command    : `./heat.exe'
>> srun: launching 1867.7 on host keal1, 91 tasks: [0-90]
>> srun: route default plugin loaded
>> srun: Node keal1, 91 tasks started
>> srun: Sent KVS info to 3 nodes, up to 33 tasks per node
>> In: PMI_Abort(1, Fatal error in MPI_Init:
>> Other MPI error, error stack:
>> MPIR_Init_thread(514)..........:
>> MPID_Init(365).................: channel initialization failed
>> MPIDI_CH3_Init(495)............:
>> MPIDI_CH3I_SHMEM_Helper_fn(908): ftruncate: Invalid argument
>> )
>> srun: Complete job step 1867.7 received
>> slurmstepd: error: *** STEP 1867.7 ON keal1 CANCELLED AT 2016-03-22T17:12:49 ***
>> 
>> To me, this looks like a timeout problem at about 30s of PMI_Init time. If so, I am wondering how to speed up the init?
>> 
>> Timing for different number of tasks on that box:
>> 
>> srun: Received task exit notification for 90 tasks (status=0x0000).
>> srun: keal1: tasks 0-89: Completed
>> real	0m32.962s
>> user	0m0.022s
>> sys	0m0.032s
>> 
>> srun: Received task exit notification for 80 tasks (status=0x0000).
>> srun: keal1: tasks 0-79: Completed
>> real	0m26.755s
>> user	0m0.016s
>> sys	0m0.036s
>> 
>> srun: Received task exit notification for 40 tasks (status=0x0000).
>> srun: keal1: tasks 0-39: Completed
>> 
>> real	0m12.810s
>> user	0m0.014s
>> sys	0m0.036s
>> 
>> On a different node with 2 sockets x 10 cores per socket x 2 threads per core = 40 procs:
>> 
>> srun: Received task exit notification for 40 tasks (status=0x0000).
>> srun: kea05: tasks 0-39: Completed
>> real	0m4.949s
>> user	0m0.011s
>> sys	0m0.012s
>> 
>> Using mpich-3.1.4, I get PMI_Init times of less than 1.5s for 96 tasks - all else identical (even the mpi library compile options).
>> 
>> Any suggestions which parameters to tweak?
>> 
>> Thanks,
>> 
>> Dom
>> 
>> _______________________________________________
>> mvapich-discuss mailing list
>> mvapich-discuss at cse.ohio-state.edu <mailto:mvapich-discuss at cse.ohio-state.edu>
>> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss <http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss>
>> 
>> 
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20160324/61bf7230/attachment-0001.html>