[mvapich-discuss] slow PMI_Init using mvapich2 with slurm

Dominikus Heinzeller climbfuji at ymail.com
Wed Mar 23 16:21:21 EDT 2016


Hi all,

I am having a problem with spawning a large number of threads on a node. My server consists of 4 sockets x 12 cores per socket x 2 threads per core = 96 procs

The slurm.conf contains the following line:

NodeName=keal1  Procs=96 SocketsPerBoard=4 CoresPerSocket=12 ThreadsPerCore=2 RealMemory=1031770 State=UNKNOWN


- The system is a Redhat SL 7.2 system running kernel 3.10.0-327.10.1.el7.x86_64
- slurm 15.08.2 (pre-compiled by the vendor, the system comes with gnu 4.8.5)
- mpi library:  mvapich2-2.2b compiled with intel-15.0.4 (with-pm=none, with-pmi=slurm)

srun -N1 -n90 myprogram works, but takes about 30s to get past MPI_init

srun -N1 -n91 myprogram aborts with:

srun: defined options for program `srun'
srun: --------------- ---------------------
srun: user           : `heinzeller-d'
srun: uid            : 9528
srun: gid            : 945
srun: cwd            : /home/heinzeller-d
srun: ntasks         : 91 (set)
srun: nodes          : 1 (set)
srun: jobid          : 1867 (default)
srun: partition      : default
srun: profile        : `NotSet'
srun: job name       : `sh'
srun: reservation    : `(null)'
srun: burst_buffer   : `(null)'
srun: wckey          : `(null)'
srun: cpu_freq_min   : 4294967294
srun: cpu_freq_max   : 4294967294
srun: cpu_freq_gov   : 4294967294
srun: switches       : -1
srun: wait-for-switches : -1
srun: distribution   : unknown
srun: cpu_bind       : default
srun: mem_bind       : default
srun: verbose        : 1
srun: slurmd_debug   : 0
srun: immediate      : false
srun: label output   : false
srun: unbuffered IO  : false
srun: overcommit     : false
srun: threads        : 60
srun: checkpoint_dir : /var/slurm/checkpoint
srun: wait           : 0
srun: account        : (null)
srun: comment        : (null)
srun: dependency     : (null)
srun: exclusive      : false
srun: qos            : (null)
srun: constraints    :
srun: geometry       : (null)
srun: reboot         : yes
srun: rotate         : no
srun: preserve_env   : false
srun: network        : (null)
srun: propagate      : NONE
srun: prolog         : (null)
srun: epilog         : (null)
srun: mail_type      : NONE
srun: mail_user      : (null)
srun: task_prolog    : (null)
srun: task_epilog    : (null)
srun: multi_prog     : no
srun: sockets-per-node  : -2
srun: cores-per-socket  : -2
srun: threads-per-core  : -2
srun: ntasks-per-node   : -2
srun: ntasks-per-socket : -2
srun: ntasks-per-core   : -2
srun: plane_size        : 4294967294
srun: core-spec         : NA
srun: power             :
srun: sicp              : 0
srun: remote command    : `./heat.exe'
srun: launching 1867.7 on host keal1, 91 tasks: [0-90]
srun: route default plugin loaded
srun: Node keal1, 91 tasks started
srun: Sent KVS info to 3 nodes, up to 33 tasks per node
In: PMI_Abort(1, Fatal error in MPI_Init:
Other MPI error, error stack:
MPIR_Init_thread(514)..........:
MPID_Init(365).................: channel initialization failed
MPIDI_CH3_Init(495)............:
MPIDI_CH3I_SHMEM_Helper_fn(908): ftruncate: Invalid argument
)
srun: Complete job step 1867.7 received
slurmstepd: error: *** STEP 1867.7 ON keal1 CANCELLED AT 2016-03-22T17:12:49 ***

To me, this looks like a timeout problem at about 30s of PMI_Init time. If so, I am wondering how to speed up the init?

Timing for different number of tasks on that box:

srun: Received task exit notification for 90 tasks (status=0x0000).
srun: keal1: tasks 0-89: Completed
real	0m32.962s
user	0m0.022s
sys	0m0.032s

srun: Received task exit notification for 80 tasks (status=0x0000).
srun: keal1: tasks 0-79: Completed
real	0m26.755s
user	0m0.016s
sys	0m0.036s

srun: Received task exit notification for 40 tasks (status=0x0000).
srun: keal1: tasks 0-39: Completed

real	0m12.810s
user	0m0.014s
sys	0m0.036s

On a different node with 2 sockets x 10 cores per socket x 2 threads per core = 40 procs:

srun: Received task exit notification for 40 tasks (status=0x0000).
srun: kea05: tasks 0-39: Completed
real	0m4.949s
user	0m0.011s
sys	0m0.012s

Using mpich-3.1.4, I get PMI_Init times of less than 1.5s for 96 tasks - all else identical (even the mpi library compile options).

Any suggestions which parameters to tweak?

Thanks,

Dom
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20160323/6ce39c93/attachment.html>


More information about the mvapich-discuss mailing list