[mvapich-discuss] slow PMI_Init using mvapich2 with slurm

Dominikus Heinzeller climbfuji at ymail.com
Thu Mar 24 14:11:27 EDT 2016


Hi Sourav!

Thanks you so much - with these settings, the runs with 96 tasks do complete (with a startup time of 35s about, but given that most applications run for at least an hour or so, that will be alright for most users who don’t want to change to mpich).

Thanks again,

Dom/

> On 24/03/2016, at 6:40 PM, Sourav Chakraborty <chakraborty.52 at buckeyemail.osu.edu> wrote:
> 
> Hi Dominikus,
> 
> Looks like the system is running of shared memory space with 90+ processes per node. Can you please try exporting the following parameters:
> 
> MV2_SMP_EAGERSIZE=4096 MV2_SMPI_LENGTH_QUEUE=16384 MV2_SMP_NUM_SEND_BUFFER=8
> 
> Thanks,
> Sourav
> 
> 
> On Thu, Mar 24, 2016 at 8:48 AM, Dominikus Heinzeller <climbfuji at ymail.com <mailto:climbfuji at ymail.com>> wrote:
> Ok, I have re-compiled SLURM 15.08.8 with the PMI extensions and then mvapich2-2.2b with --with-pm=slurm --with-pmi=pmi2. The problem remains the same, exactly, srun —mpi-pmi2 -N1 -n90 works with a startup time of about 30s, with 91+ tasks it aborts with the error message:
> 
> srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
> slurmstepd: error: *** STEP 2433.9 ON keal1 CANCELLED AT 2016-03-24T13:46:01 ***
> 
> Using the same SLURM version with the same compiler and mpich-3.1.4, I get startup times of under 1s (using pmi1).
> 
> For the time being, I don’t mind if the startup is slow, I just want to be able to increase the timeout limit so that jobs can scale across the entire node. PMI1 or 2 are both fine.
> 
> Cheers, and thanks very much,
> 
> Dom
> 
>> On 24/03/2016, at 8:36 AM, Dominikus Heinzeller <climbfuji at ymail.com <mailto:climbfuji at ymail.com>> wrote:
>> 
>> Hi Sourav,
>> 
>> thanks for your feedback! I have tried option one with --with-pm=slurm --with-pmi=pmi2 and had exactly the same problems. Timeout with 91+ tasks on a single node, very slow startup with 90 tasks or less. I am now trying to recompile with --with-pm=slurm --with-pmi=pmi1 but I doubt there will be a difference.
>> 
>> Option 2 takes a little more work. One of my colleagues also mentioned that the error message talks about SHMEM - maybe something is going on there.
>> 
>> Interestingly, none of those problems occur with mpich-3.1.4.
>> 
>>> On 23/03/2016, at 11:03 PM, Sourav Chakraborty <chakraborty.52 at buckeyemail.osu.edu <mailto:chakraborty.52 at buckeyemail.osu.edu>> wrote:
>>> 
>>> Hi Dominikus,
>>> 
>>> Thanks for your note.
>>> 
>>> To build MVAPICH2 with SLURM support, you should configure it using --with-pm=slurm --with-pmi=pmi1 or --with-pmi=pmi2. Please note that if you select pmi2, you'd have to use srun --mpi=pmi2 to launch your program. Please refer to http://mvapich.cse.ohio-state.edu/static/media/mvapich/mvapich2-2.2a-userguide.html#x1-100004.3.2 <http://mvapich.cse.ohio-state.edu/static/media/mvapich/mvapich2-2.2a-userguide.html#x1-100004.3.2> for more details.
>>> 
>>> If you are able to modify your Slurm installation, you can also use the extended PMI operations to further speed up the time to launch applications. Please refer to http://mvapich.cse.ohio-state.edu/static/media/mvapich/mvapich2-2.2a-userguide.html#x1-110004.3.3 <http://mvapich.cse.ohio-state.edu/static/media/mvapich/mvapich2-2.2a-userguide.html#x1-110004.3.3> for details on how to use extended PMI for Slurm.
>>> 
>>> Thanks,
>>> Sourav
>>> 
>>> 
>>> On Wed, Mar 23, 2016 at 4:21 PM, Dominikus Heinzeller <climbfuji at ymail.com <mailto:climbfuji at ymail.com>> wrote:
>>> Hi all,
>>> 
>>> I am having a problem with spawning a large number of threads on a node. My server consists of 4 sockets x 12 cores per socket x 2 threads per core = 96 procs
>>> 
>>> The slurm.conf contains the following line:
>>> 
>>> NodeName=keal1  Procs=96 SocketsPerBoard=4 CoresPerSocket=12 ThreadsPerCore=2 RealMemory=1031770 State=UNKNOWN
>>> 
>>> 
>>> - The system is a Redhat SL 7.2 system running kernel 3.10.0-327.10.1.el7.x86_64
>>> - slurm 15.08.2 (pre-compiled by the vendor, the system comes with gnu 4.8.5)
>>> - mpi library:  mvapich2-2.2b compiled with intel-15.0.4 (with-pm=none, with-pmi=slurm)
>>> 
>>> srun -N1 -n90 myprogram works, but takes about 30s to get past MPI_init
>>> 
>>> srun -N1 -n91 myprogram aborts with:
>>> 
>>> srun: defined options for program `srun'
>>> srun: --------------- ---------------------
>>> srun: user           : `heinzeller-d'
>>> srun: uid            : 9528
>>> srun: gid            : 945
>>> srun: cwd            : /home/heinzeller-d
>>> srun: ntasks         : 91 (set)
>>> srun: nodes          : 1 (set)
>>> srun: jobid          : 1867 (default)
>>> srun: partition      : default
>>> srun: profile        : `NotSet'
>>> srun: job name       : `sh'
>>> srun: reservation    : `(null)'
>>> srun: burst_buffer   : `(null)'
>>> srun: wckey          : `(null)'
>>> srun: cpu_freq_min   : 4294967294
>>> srun: cpu_freq_max   : 4294967294
>>> srun: cpu_freq_gov   : 4294967294
>>> srun: switches       : -1
>>> srun: wait-for-switches : -1
>>> srun: distribution   : unknown
>>> srun: cpu_bind       : default
>>> srun: mem_bind       : default
>>> srun: verbose        : 1
>>> srun: slurmd_debug   : 0
>>> srun: immediate      : false
>>> srun: label output   : false
>>> srun: unbuffered IO  : false
>>> srun: overcommit     : false
>>> srun: threads        : 60
>>> srun: checkpoint_dir : /var/slurm/checkpoint
>>> srun: wait           : 0
>>> srun: account        : (null)
>>> srun: comment        : (null)
>>> srun: dependency     : (null)
>>> srun: exclusive      : false
>>> srun: qos            : (null)
>>> srun: constraints    :
>>> srun: geometry       : (null)
>>> srun: reboot         : yes
>>> srun: rotate         : no
>>> srun: preserve_env   : false
>>> srun: network        : (null)
>>> srun: propagate      : NONE
>>> srun: prolog         : (null)
>>> srun: epilog         : (null)
>>> srun: mail_type      : NONE
>>> srun: mail_user      : (null)
>>> srun: task_prolog    : (null)
>>> srun: task_epilog    : (null)
>>> srun: multi_prog     : no
>>> srun: sockets-per-node  : -2
>>> srun: cores-per-socket  : -2
>>> srun: threads-per-core  : -2
>>> srun: ntasks-per-node   : -2
>>> srun: ntasks-per-socket : -2
>>> srun: ntasks-per-core   : -2
>>> srun: plane_size        : 4294967294
>>> srun: core-spec         : NA
>>> srun: power             :
>>> srun: sicp              : 0
>>> srun: remote command    : `./heat.exe'
>>> srun: launching 1867.7 on host keal1, 91 tasks: [0-90]
>>> srun: route default plugin loaded
>>> srun: Node keal1, 91 tasks started
>>> srun: Sent KVS info to 3 nodes, up to 33 tasks per node
>>> In: PMI_Abort(1, Fatal error in MPI_Init:
>>> Other MPI error, error stack:
>>> MPIR_Init_thread(514)..........:
>>> MPID_Init(365).................: channel initialization failed
>>> MPIDI_CH3_Init(495)............:
>>> MPIDI_CH3I_SHMEM_Helper_fn(908): ftruncate: Invalid argument
>>> )
>>> srun: Complete job step 1867.7 received
>>> slurmstepd: error: *** STEP 1867.7 ON keal1 CANCELLED AT 2016-03-22T17:12:49 ***
>>> 
>>> To me, this looks like a timeout problem at about 30s of PMI_Init time. If so, I am wondering how to speed up the init?
>>> 
>>> Timing for different number of tasks on that box:
>>> 
>>> srun: Received task exit notification for 90 tasks (status=0x0000).
>>> srun: keal1: tasks 0-89: Completed
>>> real	0m32.962s
>>> user	0m0.022s
>>> sys	0m0.032s
>>> 
>>> srun: Received task exit notification for 80 tasks (status=0x0000).
>>> srun: keal1: tasks 0-79: Completed
>>> real	0m26.755s
>>> user	0m0.016s
>>> sys	0m0.036s
>>> 
>>> srun: Received task exit notification for 40 tasks (status=0x0000).
>>> srun: keal1: tasks 0-39: Completed
>>> 
>>> real	0m12.810s
>>> user	0m0.014s
>>> sys	0m0.036s
>>> 
>>> On a different node with 2 sockets x 10 cores per socket x 2 threads per core = 40 procs:
>>> 
>>> srun: Received task exit notification for 40 tasks (status=0x0000).
>>> srun: kea05: tasks 0-39: Completed
>>> real	0m4.949s
>>> user	0m0.011s
>>> sys	0m0.012s
>>> 
>>> Using mpich-3.1.4, I get PMI_Init times of less than 1.5s for 96 tasks - all else identical (even the mpi library compile options).
>>> 
>>> Any suggestions which parameters to tweak?
>>> 
>>> Thanks,
>>> 
>>> Dom
>>> 
>>> _______________________________________________
>>> mvapich-discuss mailing list
>>> mvapich-discuss at cse.ohio-state.edu <mailto:mvapich-discuss at cse.ohio-state.edu>
>>> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss <http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss>
>>> 
>>> 
>> 
> 
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20160324/0a8f9780/attachment-0001.html>


More information about the mvapich-discuss mailing list