[mvapich-discuss] MPI_Comm_spawn using SLURM & PMI2

John Donners john.donners at surfsara.nl
Tue Nov 10 07:25:23 EST 2015


Dear all,

I would like to run a python script that uses mpi4py (3.0.0) to spawn
64 processes. This is started in a SLURM batch job. Unfortunately, that 
doesn't
work. I get the following verbose output from SLURM:

slurmstepd: mpi/pmi2: got client request: 82 
cmd=kvs-put;key=P0-businesscard;value=#RANK:00000000(000005aa:00026f81:00000001)#;
slurmstepd: mpi/pmi2: client_resp_send: 26 cmd=kvs-put-response;rc=0;
slurmstepd: mpi/pmi2: got client request: 229 
cmd=spawn;ncmds=1;preputcount=1;ppkey0=tag#1$description#"#RANK:00000000(000005aa:00026f81:00000001)#"$;p
pval0=;subcmd=/scratch/shared/donners.1871267/strati_p.51e;maxprocs=64;argc=1;argv0=/scratch/shared/donners.1871267/mpi4.py;
slurmstepd: mpi/pmi2: out client_req_parse_spawn
slurmstepd: mpi/pmi2: client_resp_send: 140 
cmd=fullinit-response;rc=0;pmi-version=2;pmi-subversion=0;rank=12;size=64;appnum=-1;debugged=FALSE;pmiverbos
e=FALSE;spawner-jobid=1871267.0;
...
slurmstepd: mpi/pmi2: got client request: 64 
cmd=kvs-put;key=MVAPICH2_0018;value=000005aa:00026f86:00026f88:;
slurmstepd: mpi/pmi2: client_resp_send: 26 cmd=kvs-put-response;rc=0;
..
slurmstepd: mpi/pmi2: client_resp_send: 140 
cmd=fullinit-response;rc=0;pmi-version=2;pmi-subversion=0;rank=16;size=64;appnum=-1;debugged=FALSE;pmiverbose=FALSE;spawner-jobid=1871267.0;
..
slurmstepd: mpi/pmi2: got client request: 64 
cmd=kvs-put;key=MVAPICH2_0016;value=000005ab:0001e30c:0001e30d:;
slurmstepd: mpi/pmi2: client_resp_send: 26 cmd=kvs-put-response;rc=0;
slurmstepd: mpi/pmi2: client_resp_send: 42 
cmd=spawn-response;rc=0;jobid=1871267.0-1;
..
slurmstepd: mpi/pmi2: got client request: 65 
cmd=kvs-get;jobid=1871267.0-1;srcid=-1;key=PARENT_ROOT_PORT_NAME;
slurmstepd: mpi/pmi2: client_resp_send: 38 
cmd=kvs-get-response;rc=0;found=FALSE;
..
MPID_Init(476)..............: spawn process group was unable to obtain 
parent port name from the channel

I've had a look through the code and it looks like PARENT_ROOT_PORT_NAME 
should be passed
with the pmi2 spawn request (look for cmd=spawn above) where it passes 
PARENT_ROOT_PORT_NAME
in the preput array (preputcount/ppkey0/ppval0). However, now ppkey0 
contains the value of
the port (i.e. PARENT_ROOT_PORT_NAME), while ppval0 contains nothing.

I've compiled the MVAPICH2 (versions 2.1 and 2.2a) library with the 
following options:

CC=icc CXX=icpc FC=ifort F77=ifort  ./configure --enable-g=none 
--with-pmi=pmi2 --with-pm=slurm --enable-fortran=yes 
--enable-threads=multiple --prefix=/home/donners/build/$pkg-intel 
--disable-cxx --enable-shared

where I've used the Intel 16.0.0 compiler. We're running SLURM 14.11.9.

Could you help me out to get this running?

Cheers,
John

-- 
SURFdrive: de persoonlijke cloudopslagdienst voor het Nederlandse hoger onderwijs en onderzoek.

| John Donners | Senior adviseur | Operations, Support & Development | SURFsara | Science Park 140 | 1098 XG Amsterdam | Nederland |
T (31)6 19039023 | john.donners at surfsara.nl | www.surfsara.nl |

Aanwezig op | ma | di | wo | do | vr



More information about the mvapich-discuss mailing list