[mvapich-discuss] Fwd: problem configuring mvapich2 with Slurm

Manuel Rodríguez Pascual manuel.rodriguez.pascual at gmail.com
Fri Nov 25 07:49:39 EST 2016


ok, things are getting a little weird now.

I experimented with your hostname suggestion. It did not work either.I
tried then downgrading Slurm to version 15.08 and it DID work, so it
looks like something is broken there. Anyway that's something to
discuss with Slurm people, not here.

Now, regarding your tests, it seems clear that the problem arises when
mvapich tries to communicate between two nodes.  Please find below:

-running a serial application (hostname) in a node
-running a task of a parallel application (helloWorldMPI) in a node

-running two instances of a serial application in a node
-runnning a serial instannce of a parallel application with two tasks in a node


-running two instances of a serial application, each one in a node
-runnning a serial instannce of a parallel application with two tasks,
each one in a node <-- this one crashes



-bash-4.2$ mpichversion
MVAPICH2 Version:     2.2
MVAPICH2 Release date: Thu Sep 08 22:00:00 EST 2016
MVAPICH2 Device:       ch3:mrail
MVAPICH2 configure:   --prefix=/home/localsoft/mvapich2
--disable-mcast --with-slurm=/home/localsoft/slurm --with-pmi=pmi2
--with-pm=slurm
MVAPICH2 CC:   gcc    -DNDEBUG -DNVALGRIND -O2
MVAPICH2 CXX: g++   -DNDEBUG -DNVALGRIND -O2
MVAPICH2 F77: gfortran   -O2
MVAPICH2 FC:   gfortran   -O2


-bash-4.2$ slurmd -V
slurm 15.08.12



-bash-4.2$ srun -n 1 --tasks-per-node=1 --mpi=pmi2 --slurmd-debug=4 hostname
slurmstepd: debug level = 6
slurmstepd: IO handler started pid=18416
slurmstepd: mpi/pmi2: _tree_listen_readable
slurmstepd: mpi/pmi2: _task_readable
slurmstepd: Uncached user/gid: slurm/1001
slurmstepd: mpi/pmi2: _tree_listen_readable
slurmstepd: mpi/pmi2: _task_readable
slurmstepd: mpi/pmi2: _tree_listen_readable
slurmstepd: mpi/pmi2: _task_readable
slurmstepd: task 0 (18422) started 2016-11-25T12:56:47
slurmstepd: task_p_pre_launch_priv: 986.0
slurmstepd: Uncached user/gid: slurm/1001
slurmstepd: mpi/pmi2: _tree_listen_readable
slurmstepd: mpi/pmi2: _task_readable
slurmstepd: job_container none plugin loaded
slurmstepd: Reading cgroup.conf file /home/localsoft/slurm/etc/cgroup.conf
slurmstepd: unable to get cgroup '/cgroup/cpuset' entry
'/cgroup/cpuset/slurm/system' properties: No such file or directory
slurmstepd: unable to get cgroup '/cgroup/memory' entry
'/cgroup/memory/slurm/system' properties: No such file or directory
slurmstepd: Sending launch resp rc=0
slurmstepd: mpi type = (null)
slurmstepd: task_p_pre_launch: 986.0, task 0
slurmstepd: _set_limit: conf setrlimit RLIMIT_CPU no change in value:
18446744073709551615
slurmstepd: _set_limit: conf setrlimit RLIMIT_FSIZE no change in
value: 18446744073709551615
slurmstepd: _set_limit: conf setrlimit RLIMIT_DATA no change in value:
18446744073709551615
slurmstepd: _set_limit: conf setrlimit RLIMIT_STACK no change in value: 8388608
slurmstepd: _set_limit: conf setrlimit RLIMIT_CORE no change in value: 0
slurmstepd: _set_limit: conf setrlimit RLIMIT_RSS no change in value:
18446744073709551615
slurmstepd: _set_limit: RLIMIT_NPROC  : max:256553 cur:256553 req:4096
slurmstepd: _set_limit: conf setrlimit RLIMIT_NPROC succeeded
slurmstepd: _set_limit: RLIMIT_NOFILE : max:4096 cur:4096 req:1024
slurmstepd: _set_limit: conf setrlimit RLIMIT_NOFILE succeeded
slurmstepd: _set_limit: conf setrlimit RLIMIT_MEMLOCK no change in
value: 18446744073709551615
slurmstepd: _set_limit: conf setrlimit RLIMIT_AS no change in value:
18446744073709551615
slurmstepd: Handling REQUEST_STEP_UID
slurmstepd: Handling REQUEST_SIGNAL_CONTAINER
slurmstepd: _handle_signal_container for step=986.0 uid=0 signal=995
acme11.ciemat.es
slurmstepd: task 0 (18422) exited with exit code 0.
slurmstepd: task_p_post_term: 986.0, task 0
slurmstepd: Sending SIGKILL to pgid 18416
slurmstepd: Waiting for IO
slurmstepd: Closing debug channel




-bash-4.2$ srun -n 1 --tasks-per-node=1 --mpi=pmi2 --slurmd-debug=4
./helloWorldMPI
slurmstepd: debug level = 6
slurmstepd: IO handler started pid=18430
slurmstepd: Uncached user/gid: slurm/1001
slurmstepd: mpi/pmi2: _tree_listen_readable
slurmstepd: mpi/pmi2: _task_readable
slurmstepd: mpi/pmi2: _tree_listen_readable
slurmstepd: mpi/pmi2: _task_readable
slurmstepd: mpi/pmi2: _tree_listen_readable
slurmstepd: mpi/pmi2: _task_readable
slurmstepd: task 0 (18436) started 2016-11-25T12:57:00
slurmstepd: task_p_pre_launch_priv: 987.0
slurmstepd: Uncached user/gid: slurm/1001
slurmstepd: mpi/pmi2: _tree_listen_readable
slurmstepd: mpi/pmi2: _task_readable
slurmstepd: job_container none plugin loaded
slurmstepd: Reading cgroup.conf file /home/localsoft/slurm/etc/cgroup.conf
slurmstepd: unable to get cgroup '/cgroup/cpuset' entry
'/cgroup/cpuset/slurm/system' properties: No such file or directory
slurmstepd: unable to get cgroup '/cgroup/memory' entry
'/cgroup/memory/slurm/system' properties: No such file or directory
slurmstepd: Sending launch resp rc=0
slurmstepd: mpi type = (null)
slurmstepd: task_p_pre_launch: 987.0, task 0
slurmstepd: _set_limit: conf setrlimit RLIMIT_CPU no change in value:
18446744073709551615
slurmstepd: _set_limit: conf setrlimit RLIMIT_FSIZE no change in
value: 18446744073709551615
slurmstepd: _set_limit: conf setrlimit RLIMIT_DATA no change in value:
18446744073709551615
slurmstepd: _set_limit: conf setrlimit RLIMIT_STACK no change in value: 8388608
slurmstepd: _set_limit: conf setrlimit RLIMIT_CORE no change in value: 0
slurmstepd: _set_limit: conf setrlimit RLIMIT_RSS no change in value:
18446744073709551615
slurmstepd: _set_limit: RLIMIT_NPROC  : max:256553 cur:256553 req:4096
slurmstepd: _set_limit: conf setrlimit RLIMIT_NPROC succeeded
slurmstepd: _set_limit: RLIMIT_NOFILE : max:4096 cur:4096 req:1024
slurmstepd: _set_limit: conf setrlimit RLIMIT_NOFILE succeeded
slurmstepd: _set_limit: conf setrlimit RLIMIT_MEMLOCK no change in
value: 18446744073709551615
slurmstepd: _set_limit: conf setrlimit RLIMIT_AS no change in value:
18446744073709551615
slurmstepd: Handling REQUEST_STEP_UID
slurmstepd: Handling REQUEST_SIGNAL_CONTAINER
slurmstepd: _handle_signal_container for step=987.0 uid=0 signal=995
slurmstepd: mpi/pmi2: got client PMI1 init, version=2.0
slurmstepd: mpi/pmi2: _tree_listen_readable
slurmstepd: mpi/pmi2: _task_readable
slurmstepd: mpi/pmi2: got client request: 53
cmd=fullinit;pmijobid=987.0;pmirank=0;threaded=FALSE;
slurmstepd: mpi/pmi2: client_resp_send: 114
cmd=fullinit-response;rc=0;pmi-version=2;pmi-subversion=0;rank=0;size=1;appnum=-1;debugged=FALSE;pmiverbose=FALSE;
slurmstepd: mpi/pmi2: _tree_listen_readable
slurmstepd: mpi/pmi2: _task_readable
slurmstepd: mpi/pmi2: got client request: 13     cmd=finalize;
slurmstepd: mpi/pmi2: client_resp_send: 27    cmd=finalize-response;rc=0;
slurmstepd: mpi/pmi2: _tree_listen_readable
slurmstepd: mpi/pmi2: _task_readable
slurmstepd:     false, finalized
Process 0 of 1 is on acme11.ciemat.es
Hello world from process 0 of 1
Goodbye world from process 0 of 1
slurmstepd: task 0 (18436) exited with exit code 0.
slurmstepd: task_p_post_term: 987.0, task 0
slurmstepd: Sending SIGKILL to pgid 18430
slurmstepd: Waiting for IO
slurmstepd: Closing debug channel



-bash-4.2$ srun -n 2 --tasks-per-node=2 --mpi=pmi2 --slurmd-debug=4 hostname
slurmstepd: debug level = 6
slurmstepd: IO handler started pid=18493
slurmstepd: Uncached user/gid: slurm/1001
slurmstepd: mpi/pmi2: _tree_listen_readable
slurmstepd: mpi/pmi2: _task_readable
slurmstepd: mpi/pmi2: _task_readable
slurmstepd: mpi/pmi2: _tree_listen_readable
slurmstepd: mpi/pmi2: _task_readable
slurmstepd: mpi/pmi2: _task_readable
slurmstepd: mpi/pmi2: _tree_listen_readable
slurmstepd: mpi/pmi2: _task_readable
slurmstepd: mpi/pmi2: _task_readable
slurmstepd: task_p_pre_launch_priv: 990.0
slurmstepd: task 0 (18499) started 2016-11-25T12:57:31
slurmstepd: Uncached user/gid: slurm/1001
slurmstepd: task 1 (18500) started 2016-11-25T12:57:31
slurmstepd: mpi/pmi2: _tree_listen_readable
slurmstepd: mpi/pmi2: _task_readable
slurmstepd: mpi/pmi2: _task_readable
slurmstepd: mpi/pmi2: _tree_listen_readable
slurmstepd: mpi/pmi2: _task_readable
slurmstepd: task_p_pre_launch_priv: 990.0
slurmstepd: Uncached user/gid: slurm/1001
slurmstepd: mpi/pmi2: _task_readable
slurmstepd: job_container none plugin loaded
slurmstepd: Reading cgroup.conf file /home/localsoft/slurm/etc/cgroup.conf
slurmstepd: unable to get cgroup '/cgroup/cpuset' entry
'/cgroup/cpuset/slurm/system' properties: No such file or directory
slurmstepd: unable to get cgroup '/cgroup/memory' entry
'/cgroup/memory/slurm/system' properties: No such file or directory
slurmstepd: Sending launch resp rc=0
slurmstepd: mpi type = (null)
slurmstepd: mpi type = (null)
slurmstepd: task_p_pre_launch: 990.0, task 0
slurmstepd: task_p_pre_launch: 990.0, task 1
slurmstepd: _set_limit: conf setrlimit RLIMIT_CPU no change in value:
18446744073709551615
slurmstepd: _set_limit: conf setrlimit RLIMIT_CPU no change in value:
18446744073709551615
slurmstepd: _set_limit: conf setrlimit RLIMIT_FSIZE no change in
value: 18446744073709551615
slurmstepd: _set_limit: conf setrlimit RLIMIT_FSIZE no change in
value: 18446744073709551615
slurmstepd: _set_limit: conf setrlimit RLIMIT_DATA no change in value:
18446744073709551615
slurmstepd: _set_limit: conf setrlimit RLIMIT_STACK no change in value: 8388608
slurmstepd: _set_limit: conf setrlimit RLIMIT_CORE no change in value: 0
slurmstepd: _set_limit: conf setrlimit RLIMIT_DATA no change in value:
18446744073709551615
slurmstepd: _set_limit: conf setrlimit RLIMIT_STACK no change in value: 8388608
slurmstepd: _set_limit: conf setrlimit RLIMIT_CORE no change in value: 0
slurmstepd: _set_limit: conf setrlimit RLIMIT_RSS no change in value:
18446744073709551615
slurmstepd: _set_limit: conf setrlimit RLIMIT_RSS no change in value:
18446744073709551615
slurmstepd: _set_limit: RLIMIT_NPROC  : max:256553 cur:256553 req:4096
slurmstepd: _set_limit: conf setrlimit RLIMIT_NPROC succeeded
slurmstepd: _set_limit: RLIMIT_NPROC  : max:256553 cur:256553 req:4096
slurmstepd: _set_limit: conf setrlimit RLIMIT_NPROC succeeded
slurmstepd: _set_limit: RLIMIT_NOFILE : max:4096 cur:4096 req:1024
slurmstepd: _set_limit: conf setrlimit RLIMIT_NOFILE succeeded
slurmstepd: _set_limit: conf setrlimit RLIMIT_MEMLOCK no change in
value: 18446744073709551615
slurmstepd: _set_limit: conf setrlimit RLIMIT_AS no change in value:
18446744073709551615
slurmstepd: _set_limit: RLIMIT_NOFILE : max:4096 cur:4096 req:1024
slurmstepd: _set_limit: conf setrlimit RLIMIT_NOFILE succeeded
slurmstepd: _set_limit: conf setrlimit RLIMIT_MEMLOCK no change in
value: 18446744073709551615
slurmstepd: _set_limit: conf setrlimit RLIMIT_AS no change in value:
18446744073709551615
acme11.ciemat.es
acme11.ciemat.es
slurmstepd: task 1 (18500) exited with exit code 0.
slurmstepd: Handling REQUEST_STEP_UID
slurmstepd: Handling REQUEST_SIGNAL_CONTAINER
slurmstepd: _handle_signal_container for step=990.0 uid=0 signal=995
slurmstepd: task_p_post_term: 990.0, task 1
slurmstepd: task 0 (18499) exited with exit code 0.
slurmstepd: task_p_post_term: 990.0, task 0
slurmstepd: No child processes
slurmstepd: Sending SIGKILL to pgid 18493
slurmstepd: Waiting for IO
slurmstepd: Closing debug channel





bash-4.2$ srun -n 2 --tasks-per-node=2 --mpi=pmi2 --slurmd-debug=4
./helloWorldMPI
slurmstepd: debug level = 6
slurmstepd: IO handler started pid=18508
slurmstepd: mpi/pmi2: _tree_listen_readable
slurmstepd: mpi/pmi2: _task_readable
slurmstepd: mpi/pmi2: _task_readable
slurmstepd: Uncached user/gid: slurm/1001
slurmstepd: mpi/pmi2: _tree_listen_readable
slurmstepd: mpi/pmi2: _task_readable
slurmstepd: mpi/pmi2: _task_readable
slurmstepd: mpi/pmi2: _tree_listen_readable
slurmstepd: mpi/pmi2: _task_readable
slurmstepd: mpi/pmi2: _task_readable
slurmstepd: task 0 (18514) started 2016-11-25T12:57:38
slurmstepd: task 1 (18515) started 2016-11-25T12:57:38
slurmstepd: task_p_pre_launch_priv: 991.0
slurmstepd: Uncached user/gid: slurm/1001
slurmstepd: mpi/pmi2: _tree_listen_readable
slurmstepd: mpi/pmi2: _task_readable
slurmstepd: mpi/pmi2: _task_readable
slurmstepd: task_p_pre_launch_priv: 991.0
slurmstepd: Uncached user/gid: slurm/1001
slurmstepd: mpi/pmi2: _tree_listen_readable
slurmstepd: mpi/pmi2: _task_readable
slurmstepd: mpi/pmi2: _task_readable
slurmstepd: job_container none plugin loaded
slurmstepd: Reading cgroup.conf file /home/localsoft/slurm/etc/cgroup.conf
slurmstepd: unable to get cgroup '/cgroup/cpuset' entry
'/cgroup/cpuset/slurm/system' properties: No such file or directory
slurmstepd: unable to get cgroup '/cgroup/memory' entry
'/cgroup/memory/slurm/system' properties: No such file or directory
slurmstepd: Sending launch resp rc=0
slurmstepd: mpi type = (null)
slurmstepd: mpi type = (null)
slurmstepd: task_p_pre_launch: 991.0, task 1
slurmstepd: task_p_pre_launch: 991.0, task 0
slurmstepd: _set_limit: conf setrlimit RLIMIT_CPU no change in value:
18446744073709551615
slurmstepd: _set_limit: conf setrlimit RLIMIT_FSIZE no change in
value: 18446744073709551615
slurmstepd: _set_limit: conf setrlimit RLIMIT_CPU no change in value:
18446744073709551615
slurmstepd: _set_limit: conf setrlimit RLIMIT_FSIZE no change in
value: 18446744073709551615
slurmstepd: _set_limit: conf setrlimit RLIMIT_DATA no change in value:
18446744073709551615
slurmstepd: _set_limit: conf setrlimit RLIMIT_STACK no change in value: 8388608
slurmstepd: _set_limit: conf setrlimit RLIMIT_DATA no change in value:
18446744073709551615
slurmstepd: _set_limit: conf setrlimit RLIMIT_STACK no change in value: 8388608
slurmstepd: _set_limit: conf setrlimit RLIMIT_CORE no change in value: 0
slurmstepd: _set_limit: conf setrlimit RLIMIT_RSS no change in value:
18446744073709551615
slurmstepd: _set_limit: conf setrlimit RLIMIT_CORE no change in value: 0
slurmstepd: _set_limit: conf setrlimit RLIMIT_RSS no change in value:
18446744073709551615
slurmstepd: _set_limit: RLIMIT_NPROC  : max:256553 cur:256553 req:4096
slurmstepd: _set_limit: conf setrlimit RLIMIT_NPROC succeeded
slurmstepd: _set_limit: RLIMIT_NPROC  : max:256553 cur:256553 req:4096
slurmstepd: _set_limit: conf setrlimit RLIMIT_NPROC succeeded
slurmstepd: _set_limit: RLIMIT_NOFILE : max:4096 cur:4096 req:1024
slurmstepd: _set_limit: conf setrlimit RLIMIT_NOFILE succeeded
slurmstepd: _set_limit: conf setrlimit RLIMIT_MEMLOCK no change in
value: 18446744073709551615
slurmstepd: _set_limit: RLIMIT_NOFILE : max:4096 cur:4096 req:1024
slurmstepd: _set_limit: conf setrlimit RLIMIT_NOFILE succeeded
slurmstepd: _set_limit: conf setrlimit RLIMIT_MEMLOCK no change in
value: 18446744073709551615
slurmstepd: _set_limit: conf setrlimit RLIMIT_AS no change in value:
18446744073709551615
slurmstepd: _set_limit: conf setrlimit RLIMIT_AS no change in value:
18446744073709551615
slurmstepd: Handling REQUEST_STEP_UID
slurmstepd: Handling REQUEST_SIGNAL_CONTAINER
slurmstepd: _handle_signal_container for step=991.0 uid=0 signal=995
slurmstepd: mpi/pmi2: got client PMI1 init, version=2.0
slurmstepd: mpi/pmi2: _tree_listen_readable
slurmstepd: mpi/pmi2: _task_readable
slurmstepd: mpi/pmi2: _task_readable
slurmstepd: mpi/pmi2: got client request: 53
cmd=fullinit;pmijobid=991.0;pmirank=0;threaded=FALSE;
slurmstepd: mpi/pmi2: client_resp_send: 114
cmd=fullinit-response;rc=0;pmi-version=2;pmi-subversion=0;rank=0;size=2;appnum=-1;debugged=FALSE;pmiverbose=FALSE;
slurmstepd: mpi/pmi2: _tree_listen_readable
slurmstepd: mpi/pmi2: _task_readable
slurmstepd: mpi/pmi2: _task_readable
slurmstepd: mpi/pmi2: got client request: 53
cmd=fullinit;pmijobid=991.0;pmirank=1;threaded=FALSE;
slurmstepd: mpi/pmi2: client_resp_send: 114
cmd=fullinit-response;rc=0;pmi-version=2;pmi-subversion=0;rank=1;size=2;appnum=-1;debugged=FALSE;pmiverbose=FALSE;
slurmstepd: mpi/pmi2: _tree_listen_readable
slurmstepd: mpi/pmi2: _task_readable
slurmstepd: mpi/pmi2: _task_readable
slurmstepd: mpi/pmi2: got client request: 44
cmd=info-getjobattr;key=PMI_process_mapping;
slurmstepd: mpi/pmi2: client_resp_send: 68
cmd=info-getjobattr-response;rc=0;found=TRUE;value=(vector,(0,1,2));
slurmstepd: mpi/pmi2: _tree_listen_readable
slurmstepd: mpi/pmi2: _task_readable
slurmstepd: mpi/pmi2: _task_readable
slurmstepd: mpi/pmi2: got client request: 44
cmd=info-getjobattr;key=PMI_process_mapping;
slurmstepd: mpi/pmi2: client_resp_send: 68
cmd=info-getjobattr-response;rc=0;found=TRUE;value=(vector,(0,1,2));
slurmstepd: mpi/pmi2: _tree_listen_readable
slurmstepd: mpi/pmi2: _task_readable
slurmstepd: mpi/pmi2: _task_readable
slurmstepd: mpi/pmi2: got client request: 13     cmd=finalize;
slurmstepd: mpi/pmi2: client_resp_send: 27    cmd=finalize-response;rc=0;
slurmstepd: mpi/pmi2: _tree_listen_readable
slurmstepd: mpi/pmi2: _task_readable
slurmstepd:     false, finalized
slurmstepd: mpi/pmi2: _task_readable
slurmstepd: mpi/pmi2: got client request: 13     cmd=finalize;
slurmstepd: mpi/pmi2: client_resp_send: 27    cmd=finalize-response;rc=0;
slurmstepd: mpi/pmi2: _tree_listen_readable
slurmstepd: mpi/pmi2: _task_readable
slurmstepd:     false, finalized
slurmstepd: mpi/pmi2: _task_readable
slurmstepd:     false, finalized
Process 0 of 2 is on acme11.ciemat.es
Hello world from process 0 of 2
Goodbye world from process 0 of 2
Process 1 of 2 is on acme11.ciemat.es
Hello world from process 1 of 2
Goodbye world from process 1 of 2
slurmstepd: task 0 (18514) exited with exit code 0.
slurmstepd: task_p_post_term: 991.0, task 0
slurmstepd: task 1 (18515) exited with exit code 0.
slurmstepd: task_p_post_term: 991.0, task 1
slurmstepd: No child processes
slurmstepd: Sending SIGKILL to pgid 18508
slurmstepd: Waiting for IO
slurmstepd: Closing debug channel



-bash-4.2$ srun -n 2 --tasks-per-node=1 --mpi=pmi2 --slurmd-debug=4 hostname
slurmstepd: debug level = 6
slurmstepd: mpi/pmi2: _tree_listen_readable
slurmstepd: mpi/pmi2: _task_readable
slurmstepd: Uncached user/gid: slurm/1001
slurmstepd: debug level = 6
slurmstepd: IO handler started pid=23700
slurmstepd: mpi/pmi2: _tree_listen_readable
slurmstepd: mpi/pmi2: _task_readable
slurmstepd: Uncached user/gid: slurm/1001
slurmstepd: mpi/pmi2: _tree_listen_readable
slurmstepd: mpi/pmi2: _task_readable
slurmstepd: mpi/pmi2: _tree_listen_readable
slurmstepd: mpi/pmi2: _task_readable
slurmstepd: mpi/pmi2: _tree_listen_readable
slurmstepd: mpi/pmi2: _task_readable
slurmstepd: mpi/pmi2: _tree_listen_readable
slurmstepd: mpi/pmi2: _task_readable
slurmstepd: task 0 (18458) started 2016-11-25T12:57:09
slurmstepd: task_p_pre_launch_priv: 988.0
slurmstepd: task 1 (23706) started 2016-11-25T12:57:09
slurmstepd: Uncached user/gid: slurm/1001
slurmstepd: task_p_pre_launch_priv: 988.0
slurmstepd: mpi/pmi2: _tree_listen_readable
slurmstepd: mpi/pmi2: _task_readable
slurmstepd: Uncached user/gid: slurm/1001
slurmstepd: mpi/pmi2: _tree_listen_readable
slurmstepd: mpi/pmi2: _task_readable
slurmstepd: job_container none plugin loaded
slurmstepd: mpi/pmi2: _tree_listen_readable
slurmstepd: mpi/pmi2: _task_readable
slurmstepd: Reading cgroup.conf file /home/localsoft/slurm/etc/cgroup.conf
slurmstepd: job_container none plugin loaded
slurmstepd: unable to get cgroup '/cgroup/cpuset' entry
'/cgroup/cpuset/slurm/system' properties: No such file or directory
slurmstepd: Reading cgroup.conf file /home/localsoft/slurm/etc/cgroup.conf
acme11.ciemat.es
slurmstepd: unable to get cgroup '/cgroup/memory' entry
'/cgroup/memory/slurm/system' properties: No such file or directory
slurmstepd: Sending launch resp rc=0
slurmstepd: unable to get cgroup '/cgroup/cpuset' entry
'/cgroup/cpuset/slurm/system' properties: No such file or directory
slurmstepd: unable to get cgroup '/cgroup/memory' entry
'/cgroup/memory/slurm/system' properties: No such file or directory
slurmstepd: Sending launch resp rc=0
slurmstepd: mpi type = (null)
slurmstepd: mpi type = (null)
slurmstepd: task_p_pre_launch: 988.0, task 0
slurmstepd: task_p_pre_launch: 988.0, task 1
slurmstepd: _set_limit: conf setrlimit RLIMIT_CPU no change in value:
18446744073709551615
slurmstepd: _set_limit: conf setrlimit RLIMIT_FSIZE no change in
value: 18446744073709551615
slurmstepd: _set_limit: conf setrlimit RLIMIT_CPU no change in value:
18446744073709551615
slurmstepd: _set_limit: conf setrlimit RLIMIT_DATA no change in value:
18446744073709551615
slurmstepd: _set_limit: conf setrlimit RLIMIT_STACK no change in value: 8388608
slurmstepd: _set_limit: conf setrlimit RLIMIT_FSIZE no change in
value: 18446744073709551615
slurmstepd: _set_limit: conf setrlimit RLIMIT_DATA no change in value:
18446744073709551615
acme12.ciemat.es
slurmstepd: _set_limit: conf setrlimit RLIMIT_CORE no change in value: 0
slurmstepd: _set_limit: conf setrlimit RLIMIT_RSS no change in value:
18446744073709551615
slurmstepd: _set_limit: conf setrlimit RLIMIT_STACK no change in value: 8388608
slurmstepd: _set_limit: RLIMIT_NPROC  : max:256553 cur:256553 req:4096
slurmstepd: _set_limit: conf setrlimit RLIMIT_NPROC succeeded
slurmstepd: _set_limit: conf setrlimit RLIMIT_CORE no change in value: 0
slurmstepd: _set_limit: RLIMIT_NOFILE : max:4096 cur:4096 req:1024
slurmstepd: _set_limit: conf setrlimit RLIMIT_NOFILE succeeded
slurmstepd: _set_limit: conf setrlimit RLIMIT_MEMLOCK no change in
value: 18446744073709551615
slurmstepd: _set_limit: conf setrlimit RLIMIT_RSS no change in value:
18446744073709551615
slurmstepd: _set_limit: conf setrlimit RLIMIT_AS no change in value:
18446744073709551615
slurmstepd: _set_limit: RLIMIT_NPROC  : max:256555 cur:256555 req:4096
slurmstepd: _set_limit: conf setrlimit RLIMIT_NPROC succeeded
slurmstepd: _set_limit: RLIMIT_NOFILE : max:4096 cur:4096 req:1024
slurmstepd: task 0 (18458) exited with exit code 0.
slurmstepd: _set_limit: conf setrlimit RLIMIT_NOFILE succeeded
slurmstepd: task_p_post_term: 988.0, task 0
slurmstepd: _set_limit: conf setrlimit RLIMIT_MEMLOCK no change in
value: 18446744073709551615
slurmstepd: _set_limit: conf setrlimit RLIMIT_AS no change in value:
18446744073709551615
slurmstepd: Handling REQUEST_STEP_UID
slurmstepd: Sending SIGKILL to pgid 18452
slurmstepd: Waiting for IO
slurmstepd: Closing debug channel
slurmstepd: Handling REQUEST_SIGNAL_CONTAINER
slurmstepd: _handle_signal_container for step=988.0 uid=0 signal=995
slurmstepd: task 1 (23706) exited with exit code 0.
slurmstepd: task_p_post_term: 988.0, task 0
slurmstepd: Sending SIGKILL to pgid 23700
slurmstepd: Waiting for IO
slurmstepd: Closing debug channel



-bash-4.2$ srun -n 2 --tasks-per-node=1 --mpi=pmi2 --slurmd-debug=4
./helloWorldMPI
slurmstepd: debug level = 6
slurmstepd: IO handler started pid=23714
slurmstepd: mpi/pmi2: _tree_listen_readable
slurmstepd: mpi/pmi2: _task_readable
slurmstepd: Uncached user/gid: slurm/1001
slurmstepd: debug level = 6
slurmstepd: mpi/pmi2: _tree_listen_readable
slurmstepd: mpi/pmi2: _task_readable
slurmstepd: mpi/pmi2: _tree_listen_readable
slurmstepd: mpi/pmi2: _task_readable
slurmstepd: Uncached user/gid: slurm/1001
slurmstepd: mpi/pmi2: _tree_listen_readable
slurmstepd: mpi/pmi2: _task_readable
slurmstepd: mpi/pmi2: _tree_listen_readable
slurmstepd: mpi/pmi2: _task_readable
slurmstepd: task 1 (23720) started 2016-11-25T12:57:17
slurmstepd: task_p_pre_launch_priv: 989.0
slurmstepd: Uncached user/gid: slurm/1001
slurmstepd: mpi/pmi2: _tree_listen_readable
slurmstepd: mpi/pmi2: _task_readable
slurmstepd: mpi/pmi2: _tree_listen_readable
slurmstepd: mpi/pmi2: _task_readable
slurmstepd: mpi/pmi2: _tree_listen_readable
slurmstepd: mpi/pmi2: _task_readable
slurmstepd: task_p_pre_launch_priv: 989.0
slurmstepd: task 0 (18474) started 2016-11-25T12:57:17
slurmstepd: Uncached user/gid: slurm/1001
slurmstepd: job_container none plugin loaded
slurmstepd: mpi/pmi2: _tree_listen_readable
slurmstepd: mpi/pmi2: _task_readable
slurmstepd: Reading cgroup.conf file /home/localsoft/slurm/etc/cgroup.conf
slurmstepd: job_container none plugin loaded
slurmstepd: unable to get cgroup '/cgroup/cpuset' entry
'/cgroup/cpuset/slurm/system' properties: No such file or directory
slurmstepd: Reading cgroup.conf file /home/localsoft/slurm/etc/cgroup.conf
slurmstepd: unable to get cgroup '/cgroup/memory' entry
'/cgroup/memory/slurm/system' properties: No such file or directory
slurmstepd: Sending launch resp rc=0
slurmstepd: unable to get cgroup '/cgroup/cpuset' entry
'/cgroup/cpuset/slurm/system' properties: No such file or directory
slurmstepd: mpi type = (null)
slurmstepd: unable to get cgroup '/cgroup/memory' entry
'/cgroup/memory/slurm/system' properties: No such file or directory
slurmstepd: Sending launch resp rc=0
slurmstepd: task_p_pre_launch: 989.0, task 1
slurmstepd: mpi type = (null)
slurmstepd: _set_limit: conf setrlimit RLIMIT_CPU no change in value:
18446744073709551615
slurmstepd: task_p_pre_launch: 989.0, task 0
slurmstepd: _set_limit: conf setrlimit RLIMIT_FSIZE no change in
value: 18446744073709551615
slurmstepd: _set_limit: conf setrlimit RLIMIT_CPU no change in value:
18446744073709551615
slurmstepd: _set_limit: conf setrlimit RLIMIT_FSIZE no change in
value: 18446744073709551615
slurmstepd: _set_limit: conf setrlimit RLIMIT_DATA no change in value:
18446744073709551615
slurmstepd: _set_limit: conf setrlimit RLIMIT_DATA no change in value:
18446744073709551615
slurmstepd: _set_limit: conf setrlimit RLIMIT_STACK no change in value: 8388608
slurmstepd: _set_limit: conf setrlimit RLIMIT_STACK no change in value: 8388608
slurmstepd: _set_limit: conf setrlimit RLIMIT_CORE no change in value: 0
slurmstepd: _set_limit: conf setrlimit RLIMIT_RSS no change in value:
18446744073709551615
slurmstepd: _set_limit: conf setrlimit RLIMIT_CORE no change in value: 0
slurmstepd: _set_limit: RLIMIT_NPROC  : max:256553 cur:256553 req:4096
slurmstepd: _set_limit: conf setrlimit RLIMIT_NPROC succeeded
slurmstepd: _set_limit: RLIMIT_NOFILE : max:4096 cur:4096 req:1024
slurmstepd: _set_limit: conf setrlimit RLIMIT_NOFILE succeeded
slurmstepd: _set_limit: conf setrlimit RLIMIT_RSS no change in value:
18446744073709551615
slurmstepd: _set_limit: conf setrlimit RLIMIT_MEMLOCK no change in
value: 18446744073709551615
slurmstepd: _set_limit: conf setrlimit RLIMIT_AS no change in value:
18446744073709551615
slurmstepd: _set_limit: RLIMIT_NPROC  : max:256555 cur:256555 req:4096
slurmstepd: Handling REQUEST_STEP_UID
slurmstepd: _set_limit: conf setrlimit RLIMIT_NPROC succeeded
slurmstepd: Handling REQUEST_SIGNAL_CONTAINER
slurmstepd: _handle_signal_container for step=989.0 uid=0 signal=995
slurmstepd: _set_limit: RLIMIT_NOFILE : max:4096 cur:4096 req:1024
slurmstepd: mpi/pmi2: got client PMI1 init, version=2.0
slurmstepd: _set_limit: conf setrlimit RLIMIT_NOFILE succeeded
slurmstepd: _set_limit: conf setrlimit RLIMIT_MEMLOCK no change in
value: 18446744073709551615
slurmstepd: mpi/pmi2: _tree_listen_readable
slurmstepd: mpi/pmi2: _task_readable
slurmstepd: _set_limit: conf setrlimit RLIMIT_AS no change in value:
18446744073709551615
slurmstepd: mpi/pmi2: got client request: 53
cmd=fullinit;pmijobid=989.0;pmirank=0;threaded=FALSE;
slurmstepd: Handling REQUEST_STEP_UID
slurmstepd: Handling REQUEST_SIGNAL_CONTAINER
slurmstepd: _handle_signal_container for step=989.0 uid=0 signal=995
slurmstepd: mpi/pmi2: got client PMI1 init, version=2.0
slurmstepd: mpi/pmi2: _tree_listen_readable
slurmstepd: mpi/pmi2: _task_readable
slurmstepd: mpi/pmi2: got client request: 53
cmd=fullinit;pmijobid=989.0;pmirank=1;threaded=FALSE;
slurmstepd: mpi/pmi2: client_resp_send: 114
cmd=fullinit-response;rc=0;pmi-version=2;pmi-subversion=0;rank=1;size=2;appnum=-1;debugged=FALSE;pmiverbose=FALSE;
slurmstepd: mpi/pmi2: client_resp_send: 114
cmd=fullinit-response;rc=0;pmi-version=2;pmi-subversion=0;rank=0;size=2;appnum=-1;debugged=FALSE;pmiverbose=FALSE;
slurmstepd: mpi/pmi2: _tree_listen_readable
slurmstepd: mpi/pmi2: _task_readable
slurmstepd: mpi/pmi2: _tree_listen_readable
slurmstepd: mpi/pmi2: _task_readable
slurmstepd: mpi/pmi2: got client request: 44
cmd=info-getjobattr;key=PMI_process_mapping;
slurmstepd: mpi/pmi2: client_resp_send: 68
cmd=info-getjobattr-response;rc=0;found=TRUE;value=(vector,(0,2,1));
slurmstepd: mpi/pmi2: _tree_listen_readable
slurmstepd: mpi/pmi2: _task_readable
slurmstepd: mpi/pmi2: got client request: 44
cmd=info-getjobattr;key=PMI_process_mapping;
slurmstepd: mpi/pmi2: client_resp_send: 68
cmd=info-getjobattr-response;rc=0;found=TRUE;value=(vector,(0,2,1));
slurmstepd: mpi/pmi2: _tree_listen_readable
slurmstepd: mpi/pmi2: _task_readable
slurmstepd: mpi/pmi2: got client request: 64
cmd=kvs-put;key=MVAPICH2_0000;value=0000000a:00004cae:00004caf:;
slurmstepd: mpi/pmi2: client_resp_send: 26    cmd=kvs-put-response;rc=0;
slurmstepd: mpi/pmi2: _tree_listen_readable
slurmstepd: mpi/pmi2: _task_readable
slurmstepd: mpi/pmi2: got client request: 14     cmd=kvs-fence;
slurmstepd: mpi/pmi2: _tree_listen_readable
slurmstepd: mpi/pmi2: _task_readable
slurmstepd: mpi/pmi2: got client request: 64
cmd=kvs-put;key=MVAPICH2_0001;value=00000009:0001ad87:0001ad88:;
slurmstepd: mpi/pmi2: client_resp_send: 26    cmd=kvs-put-response;rc=0;
slurmstepd: mpi/pmi2: _tree_listen_readable
slurmstepd: mpi/pmi2: _task_readable
slurmstepd: mpi/pmi2: got client request: 14     cmd=kvs-fence;
slurmstepd: mpi/pmi2: _tree_listen_readable
slurmstepd: mpi/pmi2: _task_readable
slurmstepd: mpi/pmi2: _tree_listen_read
slurmstepd: mpi/pmi2: client_resp_send: 28    cmd=kvs-fence-response;rc=0;
slurmstepd: mpi/pmi2: _tree_listen_read
slurmstepd: mpi/pmi2: client_resp_send: 28    cmd=kvs-fence-response;rc=0;
slurmstepd: mpi/pmi2: _tree_listen_readable
slurmstepd: mpi/pmi2: _task_readable
slurmstepd: mpi/pmi2: got client request: 59
cmd=kvs-get;jobid=singleton_kvs;srcid=-1;key=MVAPICH2_0000;
slurmstepd: mpi/pmi2: client_resp_send: 71
cmd=kvs-get-response;rc=0;found=TRUE;value=0000000a:00004cae:00004caf:;
slurmstepd: mpi/pmi2: _tree_listen_readable
slurmstepd: mpi/pmi2: _task_readable
slurmstepd: mpi/pmi2: _tree_listen_readable
slurmstepd: mpi/pmi2: _task_readable
slurmstepd: mpi/pmi2: got client request: 59
cmd=kvs-get;jobid=singleton_kvs;srcid=-1;key=MVAPICH2_0001;
slurmstepd: mpi/pmi2: got client request: 59
cmd=kvs-get;jobid=singleton_kvs;srcid=-1;key=MVAPICH2_0000;
slurmstepd: mpi/pmi2: client_resp_send: 71
cmd=kvs-get-response;rc=0;found=TRUE;value=0000000a:00004cae:00004caf:;
slurmstepd: mpi/pmi2: client_resp_send: 71
cmd=kvs-get-response;rc=0;found=TRUE;value=00000009:0001ad87:0001ad88:;
slurmstepd: mpi/pmi2: _tree_listen_readable
slurmstepd: mpi/pmi2: _task_readable
slurmstepd: mpi/pmi2: _tree_listen_readable
slurmstepd: mpi/pmi2: _task_readable
slurmstepd: mpi/pmi2: got client request: 59
cmd=kvs-get;jobid=singleton_kvs;srcid=-1;key=MVAPICH2_0001;
slurmstepd: mpi/pmi2: got client request: 14     cmd=kvs-fence;
slurmstepd: mpi/pmi2: client_resp_send: 71
cmd=kvs-get-response;rc=0;found=TRUE;value=00000009:0001ad87:0001ad88:;
slurmstepd: mpi/pmi2: _tree_listen_readable
slurmstepd: mpi/pmi2: _task_readable
slurmstepd: mpi/pmi2: got client request: 14     cmd=kvs-fence;
slurmstepd: mpi/pmi2: _tree_listen_readable
slurmstepd: mpi/pmi2: _task_readable
slurmstepd: mpi/pmi2: _tree_listen_readable
slurmstepd: mpi/pmi2: _task_readable
slurmstepd: mpi/pmi2: _tree_listen_read
slurmstepd: mpi/pmi2: client_resp_send: 28    cmd=kvs-fence-response;rc=0;
slurmstepd: mpi/pmi2: _tree_listen_read
slurmstepd: mpi/pmi2: client_resp_send: 28    cmd=kvs-fence-response;rc=0;
slurmstepd: mpi/pmi2: _tree_listen_readable
slurmstepd: mpi/pmi2: _task_readable
slurmstepd: mpi/pmi2: _tree_listen_readable
slurmstepd: mpi/pmi2: _task_readable
slurmstepd: mpi/pmi2: got client request: 14     cmd=kvs-fence;
slurmstepd: mpi/pmi2: got client request: 14     cmd=kvs-fence;
slurmstepd: mpi/pmi2: _tree_listen_readable
slurmstepd: mpi/pmi2: _task_readable
slurmstepd: mpi/pmi2: _tree_listen_readable
slurmstepd: mpi/pmi2: _task_readable
slurmstepd: mpi/pmi2: _tree_listen_read
slurmstepd: mpi/pmi2: client_resp_send: 28    cmd=kvs-fence-response;rc=0;
slurmstepd: mpi/pmi2: _tree_listen_read
slurmstepd: mpi/pmi2: client_resp_send: 28    cmd=kvs-fence-response;rc=0;
slurmstepd: mpi/pmi2: _tree_listen_readable
slurmstepd: mpi/pmi2: _task_readable
slurmstepd: mpi/pmi2: got client request: 14     cmd=kvs-fence;
slurmstepd: mpi/pmi2: _tree_listen_readable
slurmstepd: mpi/pmi2: _task_readable
slurmstepd: mpi/pmi2: got client request: 14     cmd=kvs-fence;
slurmstepd: mpi/pmi2: _tree_listen_readable
slurmstepd: mpi/pmi2: _task_readable
slurmstepd: mpi/pmi2: _tree_listen_readable
slurmstepd: mpi/pmi2: _task_readable
slurmstepd: mpi/pmi2: _tree_listen_read
slurmstepd: mpi/pmi2: client_resp_send: 28    cmd=kvs-fence-response;rc=0;
slurmstepd: mpi/pmi2: _tree_listen_read
slurmstepd: mpi/pmi2: client_resp_send: 28    cmd=kvs-fence-response;rc=0;
slurmstepd: mpi/pmi2: _tree_listen_readable
slurmstepd: mpi/pmi2: _task_readable
slurmstepd: mpi/pmi2: got client request: 14     cmd=kvs-fence;
slurmstepd: mpi/pmi2: _tree_listen_readable
slurmstepd: mpi/pmi2: _task_readable
slurmstepd: mpi/pmi2: got client request: 14     cmd=kvs-fence;
slurmstepd: mpi/pmi2: _tree_listen_readable
slurmstepd: mpi/pmi2: _task_readable
slurmstepd: mpi/pmi2: _tree_listen_readable
slurmstepd: mpi/pmi2: _task_readable
slurmstepd: mpi/pmi2: _tree_listen_read
slurmstepd: mpi/pmi2: _tree_listen_read
slurmstepd: mpi/pmi2: client_resp_send: 28    cmd=kvs-fence-response;rc=0;
[acme12.ciemat.es:mpi_rank_1][error_sighandler] Caught error:
Segmentation fault (signal 11)
slurmstepd: mpi/pmi2: client_resp_send: 28    cmd=kvs-fence-response;rc=0;
[acme11.ciemat.es:mpi_rank_0][error_sighandler] Caught error:
Segmentation fault (signal 11)
slurmstepd: mpi/pmi2: _tree_listen_readable
slurmstepd: mpi/pmi2: _task_readable
slurmstepd: mpi/pmi2: _tree_listen_readable
slurmstepd: mpi/pmi2: _task_readable
slurmstepd: task 0 (18474) exited. Killed by signal 11 (core dumped).
slurmstepd: task_p_post_term: 989.0, task 0
slurmstepd: task 1 (23720) exited. Killed by signal 11 (core dumped).
slurmstepd: task_p_post_term: 989.0, task 0
slurmstepd: Sending SIGKILL to pgid 18468
slurmstepd: Waiting for IO
slurmstepd: Closing debug channel
srun: error: acme11: task 0: Segmentation fault (core dumped)
slurmstepd: Sending SIGKILL to pgid 23714
slurmstepd: Waiting for IO
slurmstepd: Closing debug channel
srun: error: acme12: task 1: Segmentation fault (core dumped)




SLURMCTLD OUTPUT


slurmctld: debug2: sched: Processing RPC: REQUEST_RESOURCE_ALLOCATION
from uid=500
slurmctld: debug3: JobDesc: user_id=500 job_id=N/A partition=(null)
name=helloWorldMPI
slurmctld: debug3:    cpus=2-4294967294 pn_min_cpus=-1 core_spec=-1
slurmctld: debug3:    Nodes=1-[4294967294] Sock/Node=65534
Core/Sock=65534 Thread/Core=65534
slurmctld: debug3:    pn_min_memory_job=-1 pn_min_tmp_disk=-1
slurmctld: debug3:    immediate=0 features=(null) reservation=(null)
slurmctld: debug3:    req_nodes=(null) exc_nodes=(null) gres=(null)
slurmctld: debug3:    time_limit=-1--1 priority=-1 contiguous=0 shared=-1
slurmctld: debug3:    kill_on_node_fail=-1 script=(null)
slurmctld: debug3:    argv="./helloWorldMPI"
slurmctld: debug3:    stdin=(null) stdout=(null) stderr=(null)
slurmctld: debug3:    work_dir=/home/slurm/tests alloc_node:sid=acme31:11242
slurmctld: debug3:    sicp_mode=0 power_flags=
slurmctld: debug3:    resp_host=172.17.31.165 alloc_resp_port=40661
other_port=48105
slurmctld: debug3:    dependency=(null) account=(null) qos=(null) comment=(null)
slurmctld: debug3:    mail_type=0 mail_user=(null) nice=0 num_tasks=2
open_mode=0 overcommit=-1 acctg_freq=(null)
slurmctld: debug3:    network=(null) begin=Unknown cpus_per_task=-1
requeue=-1 licenses=(null)
slurmctld: debug3:    end_time= signal=0 at 0 wait_all_nodes=-1 cpu_freq=
slurmctld: debug3:    ntasks_per_node=1 ntasks_per_socket=-1 ntasks_per_core=-1
slurmctld: debug3:    mem_bind=65534:(null) plane_size:65534
slurmctld: debug3:    array_inx=(null)
slurmctld: debug3:    burst_buffer=(null)
slurmctld: debug3: found correct user
slurmctld: debug3: found correct association
slurmctld: debug3: found correct qos
slurmctld: debug3: before alteration asking for nodes 1-4294967294
cpus 2-4294967294
slurmctld: debug3: after alteration asking for nodes 1-4294967294 cpus
2-4294967294
slurmctld: debug2: found 8 usable nodes from config containing acme[11-14,21-24]
slurmctld: debug3: _pick_best_nodes: job 994 idle_nodes 8 share_nodes 8
slurmctld: debug3: powercapping: checking job 994 : skipped, capping disabled
slurmctld: debug2: sched: JobId=994 allocated resources: NodeList=acme[11-12]
slurmctld: sched: _slurm_rpc_allocate_resources JobId=994
NodeList=acme[11-12] usec=1340
slurmctld: debug3: Writing job id 994 to header record of job_state file
slurmctld: debug2: _slurm_rpc_job_ready(994)=3 usec=5
slurmctld: debug3: StepDesc: user_id=500 job_id=994 node_count=2-2
cpu_count=2 num_tasks=2
slurmctld: debug3:    cpu_freq_gov=4294967294 cpu_freq_max=4294967294
cpu_freq_min=4294967294 relative=65534 task_dist=0x1 plane=1
slurmctld: debug3:    node_list=(null)  constraints=(null)
slurmctld: debug3:    host=acme31 port=48845 name=helloWorldMPI
network=(null) exclusive=0
slurmctld: debug3:    checkpoint-dir=/home/localsoft/slurm/checkpoint
checkpoint_int=0
slurmctld: debug3:    mem_per_node=0 resv_port_cnt=65534 immediate=0 no_kill=0
slurmctld: debug3:    overcommit=0 time_limit=0 gres=(null)
slurmctld: debug:  Configuration for job 994 complete
slurmctld: debug3: step_layout cpus = 16 pos = 0
slurmctld: debug3: step_layout cpus = 16 pos = 1
slurmctld: debug:  laying out the 2 tasks on 2 hosts acme[11-12] dist 1
slurmctld: debug2: Processing RPC: REQUEST_COMPLETE_JOB_ALLOCATION
from uid=500, JobId=994 rc=139
slurmctld: job_complete: JobID=994 State=0x1 NodeCnt=2 WTERMSIG 11
slurmctld: debug2: Spawning RPC agent for msg_type SRUN_JOB_COMPLETE
slurmctld: debug2: got 1 threads to send out
slurmctld: debug2: Spawning RPC agent for msg_type SRUN_JOB_COMPLETE
slurmctld: debug2: got 1 threads to send out
slurmctld: debug2: Spawning RPC agent for msg_type REQUEST_TERMINATE_JOB
slurmctld: job_complete: JobID=994 State=0x8003 NodeCnt=2 done
slurmctld: debug2: _slurm_rpc_complete_job_allocation: JobID=994
State=0x8003 NodeCnt=2
slurmctld: debug2: got 1 threads to send out
slurmctld: debug2: Tree head got back 0 looking for 2
slurmctld: debug3: Tree sending to acme12
slurmctld: debug3: slurm_send_only_node_msg: sent 181
slurmctld: debug3: Tree sending to acme11
slurmctld: debug3: slurm_send_only_node_msg: sent 181
slurmctld: debug4: orig_timeout was 10000 we have 0 steps and a timeout of 10000
slurmctld: debug4: orig_timeout was 10000 we have 0 steps and a timeout of 10000
slurmctld: debug2: Tree head got back 1
slurmctld: debug2: Tree head got back 2
slurmctld: debug2: node_did_resp acme12
slurmctld: debug2: node_did_resp acme11
slurmctld: debug:  sched: Running job scheduler
slurmctld: debug3: Writing job id 994 to header record of job_state file
slurmctld: debug2: Testing job time limits and checkpoints
slurmctld: debug:  sched: Running job scheduler
slurmctld: debug2: Performing purge of old job records
slurmctld: debug2: Performing purge of old job records
slurmctld: debug2: purge_old_job: purged 1 old job records





SLURMD OUTPUT

slurmd: debug3: in the service_connection
slurmd: debug2: got this type of message 6001
slurmd: debug2: Processing RPC: REQUEST_LAUNCH_TASKS
slurmd: launch task 996.0 request from 500.1001 at 172.17.31.165 (port 37857)
slurmd: debug3: state for jobid 985: ctime:1480074710 revoked:0 expires:0
slurmd: debug3: state for jobid 986: ctime:1480075007 revoked:0 expires:0
slurmd: debug3: state for jobid 987: ctime:1480075020 revoked:0 expires:0
slurmd: debug3: state for jobid 988: ctime:1480075029 revoked:0 expires:0
slurmd: debug3: state for jobid 989: ctime:1480075037 revoked:0 expires:0
slurmd: debug3: state for jobid 990: ctime:1480075051 revoked:0 expires:0
slurmd: debug3: state for jobid 991: ctime:1480075058 revoked:0 expires:0
slurmd: debug3: state for jobid 992: ctime:1480075668
revoked:1480075668 expires:1480075668
slurmd: debug3: state for jobid 992: ctime:1480075668 revoked:0 expires:0
slurmd: debug3: state for jobid 993: ctime:1480075699
revoked:1480075699 expires:1480075699
slurmd: debug3: state for jobid 993: ctime:1480075699 revoked:0 expires:0
slurmd: debug3: state for jobid 994: ctime:1480075717
revoked:1480075717 expires:1480075717
slurmd: debug3: state for jobid 994: ctime:1480075717 revoked:0 expires:0
slurmd: debug3: state for jobid 995: ctime:1480075746
revoked:1480075746 expires:1480075746
slurmd: debug3: state for jobid 995: ctime:1480075746 revoked:0 expires:0
slurmd: debug:  Checking credential with 276 bytes of sig data
slurmd: debug:  task_p_slurmd_launch_request: 996.0 0
slurmd: _run_prolog: run job script took usec=10
slurmd: _run_prolog: prolog with lock for job 996 ran for 0 seconds
slurmd: debug3: _rpc_launch_tasks: call to _forkexec_slurmstepd
slurmd: debug3: slurmstepd rank 0 (acme11), parent rank -1 (NONE),
children 1, depth 0, max_depth 1
slurmd: debug3: _send_slurmstepd_init: call to getpwuid_r
slurmd: debug3: _send_slurmstepd_init: return from getpwuid_r
slurmd: debug3: _rpc_launch_tasks: return from _forkexec_slurmstepd
slurmd: debug:  task_p_slurmd_reserve_resources: 996 0
slurmd: debug3: in the service_connection
slurmd: debug2: got this type of message 6004
slurmd: debug2: Processing RPC: REQUEST_SIGNAL_TASKS
slurmd: debug:  _rpc_signal_tasks: sending signal 995 to step 996.0 flag 0
slurmd: debug3: in the service_connection
slurmd: debug2: got this type of message 5029
slurmd: debug3: Entering _rpc_forward_data, address:
/tmp/sock.pmi2.996.0, len: 106
slurmd: debug3: in the service_connection
slurmd: debug2: got this type of message 5029
slurmd: debug3: Entering _rpc_forward_data, address:
/tmp/sock.pmi2.996.0, len: 6
slurmd: debug3: in the service_connection
slurmd: debug2: got this type of message 5029
slurmd: debug3: Entering _rpc_forward_data, address:
/tmp/sock.pmi2.996.0, len: 6
slurmd: debug3: in the service_connection
slurmd: debug2: got this type of message 5029
slurmd: debug3: Entering _rpc_forward_data, address:
/tmp/sock.pmi2.996.0, len: 6
slurmd: debug3: in the service_connection
slurmd: debug2: got this type of message 5029
slurmd: debug3: Entering _rpc_forward_data, address:
/tmp/sock.pmi2.996.0, len: 6
slurmd: debug3: in the service_connection
slurmd: debug2: got this type of message 5016
slurmd: debug3: Entering _rpc_step_complete
slurmd: debug:  Entering stepd_completion, range_first = 1, range_last = 1
slurmd: debug3: in the service_connection
slurmd: debug2: got this type of message 6011
slurmd: debug2: Processing RPC: REQUEST_TERMINATE_JOB
slurmd: debug:  _rpc_terminate_job, uid = 500
slurmd: debug:  task_p_slurmd_release_resources: 996
slurmd: debug:  credential for job 996 revoked
slurmd: debug2: No steps in jobid 996 to send signal 18
slurmd: debug2: No steps in jobid 996 to send signal 15
slurmd: debug4: sent ALREADY_COMPLETE
slurmd: debug2: set revoke expiration for jobid 996 to 1480075907 UTS









2016-11-24 16:50 GMT+01:00 Sourav Chakraborty
<chakraborty.52 at buckeyemail.osu.edu>:
> Hi Manuel,
>
> Thanks for reporting the issue.
>
> Since Mvapich2 was configured with pmi2, the following way to launch jobs is
> correct:
> srun -n 2 --mpi=pmi2 ./helloMPI
>
> Can you please post the output of the following command? This will have more
> information to identify the issue.
> srun -n 2 --mpi=pmi2 --slurmd-debug=5 ./helloMPI
>
> Also, to identify if this is an Mvapich2 specific issue, can you please try
> running the following command?
> srun -n 2 --mpi=pmi2 hostname
>
> Thanks,
> Sourav
>
>
> On Thu, Nov 24, 2016 at 6:55 AM, Manuel Rodríguez Pascual
> <manuel.rodriguez.pascual at gmail.com> wrote:
>>
>> Hi all,
>>
>> I am trying to make mvapich2 work with Slurm, but I keep having some
>> issues. I know there are quite a lot of threads on the subject, but none of
>> them seems to solve my problems.  My problem is that Slurm is executing two
>> serial jobs instead a single parallel one.
>>
>> Below I have included quite a lot of information about how I have
>> configured my cluster and the different tests that I have performed, in case
>> that it helps.
>> ---
>> ---
>> COMPILATION
>>
>> --- Slurm  17.02.0-0pre2:
>> ./configure --prefix=/home/localsoft/slurm/
>>
>> slurm.conf:
>> MpiDefault=none
>>
>>
>> --- MVAPICH mvapich2-2.2
>>
>>
>> After quite  a lot of different tests, I've been able to compile mvapich
>> with the following environment and options (config.log is attached to this
>> mail):
>>
>> Environment vars:
>> LD_LIBRARY_PATH =
>> /usr/local/lib:/home/localsoft/slurm/lib:/home/localsoft/mvapich2/lib (and
>> some non related stuff)
>> MPICHLIB_LDFLAGS='-Wl,-rpath,/home/localsoft/slurm/lib
>> -Wl,-rpath,/home/localsoft/mvapich2/lib'
>>
>> Compilation:
>>
>> ./configure --prefix=/home/localsoft/mvapich2 --disable-mcast
>> --with-slurm=/home/localsoft/slurm --with-pmi=pmi2 --with-pm=slurm
>> --disable-romio
>>
>> Then, in every node of my cluster I have set LD_LIBRARY_PATH to the same
>> value.
>>
>> My code is compiled with:
>> mpicc  helloWorldMPI.c -o helloWorldMPI -L/home/localsoft/slurm/lib
>>
>>
>> ---
>> ---
>> EXECUTION:
>>
>> --serial jobs: OK
>> $ ./helloWorldMPI
>> Process 0 of 1 is on acme31.ciemat.es
>> Hello world from process 0 of 1
>> Goodbye world from process 0 of 1
>>
>>
>> $ srun   ./helloWorldMPI
>> Process 0 of 1 is on acme11.ciemat.es
>> Hello world from process 0 of 1
>> Goodbye world from process 0 of 1
>>
>> --parallel jobs: As you can see, slurm is executing two serial jobs
>> instead a single parallel one.
>>
>> $ srun -n 2   ./helloWorldMPI
>> Process 0 of 1 is on acme11.ciemat.es
>> Hello world from process 0 of 1
>> Goodbye world from process 0 of 1
>>
>> Process 0 of 1 is on acme11.ciemat.es
>> Hello world from process 0 of 1
>> Goodbye world from process 0 of 1
>>
>>
>> $ srun -n 2 --tasks-per-node=1   ./helloWorldMPI
>> Process 0 of 1 is on acme11.ciemat.es
>> Hello world from process 0 of 1
>> Goodbye world from process 0 of 1
>>
>> Process 0 of 1 is on acme12.ciemat.es
>> Hello world from process 0 of 1
>> Goodbye world from process 0 of 1
>>
>>
>> --different Slurm MPI types:
>>
>> $ srun -n 2 --mpi=none ./helloWorldMPI
>> Process 0 of 1 is on acme11.ciemat.es
>> Hello world from process 0 of 1
>>
>> Goodbye world from process 0 of 1
>> Process 0 of 1 is on acme11.ciemat.es
>> Hello world from process 0 of 1
>> Goodbye world from process 0 of 1
>>
>>
>> $ srun -n 2 --mpi=mvapich ./helloWorldMPI
>> Process 0 of 1 is on acme11.ciemat.es
>> Hello world from process 0 of 1
>> Goodbye world from process 0 of 1
>>
>> Process 0 of 1 is on acme11.ciemat.es
>> Hello world from process 0 of 1
>> Goodbye world from process 0 of 1
>>
>> $ srun -n 2 --mpi=pmi2 ./helloWorldMPI
>> srun: error: task 0 launch failed: Unspecified error
>> srun: error: task 1 launch failed: Unspecified error
>>
>>
>> ---
>> ---
>>
>> Any clue on what's wrong?
>>
>> Thanks for your help,
>>
>>
>> Manuel
>>
>>
>>
>>
>>
>> _______________________________________________
>> mvapich-discuss mailing list
>> mvapich-discuss at cse.ohio-state.edu
>> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>
>



More information about the mvapich-discuss mailing list