[mvapich-discuss] slow PMI_Init using mvapich2 with slurm

Dominikus Heinzeller climbfuji at ymail.com
Thu Mar 24 03:36:46 EDT 2016


Hi Sourav,

thanks for your feedback! I have tried option one with --with-pm=slurm --with-pmi=pmi2 and had exactly the same problems. Timeout with 91+ tasks on a single node, very slow startup with 90 tasks or less. I am now trying to recompile with --with-pm=slurm --with-pmi=pmi1 but I doubt there will be a difference.

Option 2 takes a little more work. One of my colleagues also mentioned that the error message talks about SHMEM - maybe something is going on there.

Interestingly, none of those problems occur with mpich-3.1.4.

> On 23/03/2016, at 11:03 PM, Sourav Chakraborty <chakraborty.52 at buckeyemail.osu.edu> wrote:
> 
> Hi Dominikus,
> 
> Thanks for your note.
> 
> To build MVAPICH2 with SLURM support, you should configure it using --with-pm=slurm --with-pmi=pmi1 or --with-pmi=pmi2. Please note that if you select pmi2, you'd have to use srun --mpi=pmi2 to launch your program. Please refer to http://mvapich.cse.ohio-state.edu/static/media/mvapich/mvapich2-2.2a-userguide.html#x1-100004.3.2 <http://mvapich.cse.ohio-state.edu/static/media/mvapich/mvapich2-2.2a-userguide.html#x1-100004.3.2> for more details.
> 
> If you are able to modify your Slurm installation, you can also use the extended PMI operations to further speed up the time to launch applications. Please refer to http://mvapich.cse.ohio-state.edu/static/media/mvapich/mvapich2-2.2a-userguide.html#x1-110004.3.3 <http://mvapich.cse.ohio-state.edu/static/media/mvapich/mvapich2-2.2a-userguide.html#x1-110004.3.3> for details on how to use extended PMI for Slurm.
> 
> Thanks,
> Sourav
> 
> 
> On Wed, Mar 23, 2016 at 4:21 PM, Dominikus Heinzeller <climbfuji at ymail.com <mailto:climbfuji at ymail.com>> wrote:
> Hi all,
> 
> I am having a problem with spawning a large number of threads on a node. My server consists of 4 sockets x 12 cores per socket x 2 threads per core = 96 procs
> 
> The slurm.conf contains the following line:
> 
> NodeName=keal1  Procs=96 SocketsPerBoard=4 CoresPerSocket=12 ThreadsPerCore=2 RealMemory=1031770 State=UNKNOWN
> 
> 
> - The system is a Redhat SL 7.2 system running kernel 3.10.0-327.10.1.el7.x86_64
> - slurm 15.08.2 (pre-compiled by the vendor, the system comes with gnu 4.8.5)
> - mpi library:  mvapich2-2.2b compiled with intel-15.0.4 (with-pm=none, with-pmi=slurm)
> 
> srun -N1 -n90 myprogram works, but takes about 30s to get past MPI_init
> 
> srun -N1 -n91 myprogram aborts with:
> 
> srun: defined options for program `srun'
> srun: --------------- ---------------------
> srun: user           : `heinzeller-d'
> srun: uid            : 9528
> srun: gid            : 945
> srun: cwd            : /home/heinzeller-d
> srun: ntasks         : 91 (set)
> srun: nodes          : 1 (set)
> srun: jobid          : 1867 (default)
> srun: partition      : default
> srun: profile        : `NotSet'
> srun: job name       : `sh'
> srun: reservation    : `(null)'
> srun: burst_buffer   : `(null)'
> srun: wckey          : `(null)'
> srun: cpu_freq_min   : 4294967294
> srun: cpu_freq_max   : 4294967294
> srun: cpu_freq_gov   : 4294967294
> srun: switches       : -1
> srun: wait-for-switches : -1
> srun: distribution   : unknown
> srun: cpu_bind       : default
> srun: mem_bind       : default
> srun: verbose        : 1
> srun: slurmd_debug   : 0
> srun: immediate      : false
> srun: label output   : false
> srun: unbuffered IO  : false
> srun: overcommit     : false
> srun: threads        : 60
> srun: checkpoint_dir : /var/slurm/checkpoint
> srun: wait           : 0
> srun: account        : (null)
> srun: comment        : (null)
> srun: dependency     : (null)
> srun: exclusive      : false
> srun: qos            : (null)
> srun: constraints    :
> srun: geometry       : (null)
> srun: reboot         : yes
> srun: rotate         : no
> srun: preserve_env   : false
> srun: network        : (null)
> srun: propagate      : NONE
> srun: prolog         : (null)
> srun: epilog         : (null)
> srun: mail_type      : NONE
> srun: mail_user      : (null)
> srun: task_prolog    : (null)
> srun: task_epilog    : (null)
> srun: multi_prog     : no
> srun: sockets-per-node  : -2
> srun: cores-per-socket  : -2
> srun: threads-per-core  : -2
> srun: ntasks-per-node   : -2
> srun: ntasks-per-socket : -2
> srun: ntasks-per-core   : -2
> srun: plane_size        : 4294967294
> srun: core-spec         : NA
> srun: power             :
> srun: sicp              : 0
> srun: remote command    : `./heat.exe'
> srun: launching 1867.7 on host keal1, 91 tasks: [0-90]
> srun: route default plugin loaded
> srun: Node keal1, 91 tasks started
> srun: Sent KVS info to 3 nodes, up to 33 tasks per node
> In: PMI_Abort(1, Fatal error in MPI_Init:
> Other MPI error, error stack:
> MPIR_Init_thread(514)..........:
> MPID_Init(365).................: channel initialization failed
> MPIDI_CH3_Init(495)............:
> MPIDI_CH3I_SHMEM_Helper_fn(908): ftruncate: Invalid argument
> )
> srun: Complete job step 1867.7 received
> slurmstepd: error: *** STEP 1867.7 ON keal1 CANCELLED AT 2016-03-22T17:12:49 ***
> 
> To me, this looks like a timeout problem at about 30s of PMI_Init time. If so, I am wondering how to speed up the init?
> 
> Timing for different number of tasks on that box:
> 
> srun: Received task exit notification for 90 tasks (status=0x0000).
> srun: keal1: tasks 0-89: Completed
> real	0m32.962s
> user	0m0.022s
> sys	0m0.032s
> 
> srun: Received task exit notification for 80 tasks (status=0x0000).
> srun: keal1: tasks 0-79: Completed
> real	0m26.755s
> user	0m0.016s
> sys	0m0.036s
> 
> srun: Received task exit notification for 40 tasks (status=0x0000).
> srun: keal1: tasks 0-39: Completed
> 
> real	0m12.810s
> user	0m0.014s
> sys	0m0.036s
> 
> On a different node with 2 sockets x 10 cores per socket x 2 threads per core = 40 procs:
> 
> srun: Received task exit notification for 40 tasks (status=0x0000).
> srun: kea05: tasks 0-39: Completed
> real	0m4.949s
> user	0m0.011s
> sys	0m0.012s
> 
> Using mpich-3.1.4, I get PMI_Init times of less than 1.5s for 96 tasks - all else identical (even the mpi library compile options).
> 
> Any suggestions which parameters to tweak?
> 
> Thanks,
> 
> Dom
> 
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu <mailto:mvapich-discuss at cse.ohio-state.edu>
> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss <http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss>
> 
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20160324/f65e424f/attachment-0001.html>


More information about the mvapich-discuss mailing list