[mvapich-discuss] Fwd: problem configuring mvapich2 with Slurm

Manuel Rodríguez Pascual manuel.rodriguez.pascual at gmail.com
Mon Nov 28 07:24:15 EST 2016


Following the previous mail with a lot of debugging info...

Besides trying to solve this issue and finding any possible bug in the
newest version of mvapich2 -a task in which I am definitely willing to
help-, is there any combination of slurm  plus mvapich2 which is known
to be reliable and  painless to install and manage?

Cheers,

Manuel

2016-11-25 13:49 GMT+01:00 Manuel Rodríguez Pascual
<manuel.rodriguez.pascual at gmail.com>:
> ok, things are getting a little weird now.
>
> I experimented with your hostname suggestion. It did not work either.I
> tried then downgrading Slurm to version 15.08 and it DID work, so it
> looks like something is broken there. Anyway that's something to
> discuss with Slurm people, not here.
>
> Now, regarding your tests, it seems clear that the problem arises when
> mvapich tries to communicate between two nodes.  Please find below:
>
> -running a serial application (hostname) in a node
> -running a task of a parallel application (helloWorldMPI) in a node
>
> -running two instances of a serial application in a node
> -runnning a serial instannce of a parallel application with two tasks in a node
>
>
> -running two instances of a serial application, each one in a node
> -runnning a serial instannce of a parallel application with two tasks,
> each one in a node <-- this one crashes
>
>
>
> -bash-4.2$ mpichversion
> MVAPICH2 Version:     2.2
> MVAPICH2 Release date: Thu Sep 08 22:00:00 EST 2016
> MVAPICH2 Device:       ch3:mrail
> MVAPICH2 configure:   --prefix=/home/localsoft/mvapich2
> --disable-mcast --with-slurm=/home/localsoft/slurm --with-pmi=pmi2
> --with-pm=slurm
> MVAPICH2 CC:   gcc    -DNDEBUG -DNVALGRIND -O2
> MVAPICH2 CXX: g++   -DNDEBUG -DNVALGRIND -O2
> MVAPICH2 F77: gfortran   -O2
> MVAPICH2 FC:   gfortran   -O2
>
>
> -bash-4.2$ slurmd -V
> slurm 15.08.12
>
>
>
> -bash-4.2$ srun -n 1 --tasks-per-node=1 --mpi=pmi2 --slurmd-debug=4 hostname
> slurmstepd: debug level = 6
> slurmstepd: IO handler started pid=18416
> slurmstepd: mpi/pmi2: _tree_listen_readable
> slurmstepd: mpi/pmi2: _task_readable
> slurmstepd: Uncached user/gid: slurm/1001
> slurmstepd: mpi/pmi2: _tree_listen_readable
> slurmstepd: mpi/pmi2: _task_readable
> slurmstepd: mpi/pmi2: _tree_listen_readable
> slurmstepd: mpi/pmi2: _task_readable
> slurmstepd: task 0 (18422) started 2016-11-25T12:56:47
> slurmstepd: task_p_pre_launch_priv: 986.0
> slurmstepd: Uncached user/gid: slurm/1001
> slurmstepd: mpi/pmi2: _tree_listen_readable
> slurmstepd: mpi/pmi2: _task_readable
> slurmstepd: job_container none plugin loaded
> slurmstepd: Reading cgroup.conf file /home/localsoft/slurm/etc/cgroup.conf
> slurmstepd: unable to get cgroup '/cgroup/cpuset' entry
> '/cgroup/cpuset/slurm/system' properties: No such file or directory
> slurmstepd: unable to get cgroup '/cgroup/memory' entry
> '/cgroup/memory/slurm/system' properties: No such file or directory
> slurmstepd: Sending launch resp rc=0
> slurmstepd: mpi type = (null)
> slurmstepd: task_p_pre_launch: 986.0, task 0
> slurmstepd: _set_limit: conf setrlimit RLIMIT_CPU no change in value:
> 18446744073709551615
> slurmstepd: _set_limit: conf setrlimit RLIMIT_FSIZE no change in
> value: 18446744073709551615
> slurmstepd: _set_limit: conf setrlimit RLIMIT_DATA no change in value:
> 18446744073709551615
> slurmstepd: _set_limit: conf setrlimit RLIMIT_STACK no change in value: 8388608
> slurmstepd: _set_limit: conf setrlimit RLIMIT_CORE no change in value: 0
> slurmstepd: _set_limit: conf setrlimit RLIMIT_RSS no change in value:
> 18446744073709551615
> slurmstepd: _set_limit: RLIMIT_NPROC  : max:256553 cur:256553 req:4096
> slurmstepd: _set_limit: conf setrlimit RLIMIT_NPROC succeeded
> slurmstepd: _set_limit: RLIMIT_NOFILE : max:4096 cur:4096 req:1024
> slurmstepd: _set_limit: conf setrlimit RLIMIT_NOFILE succeeded
> slurmstepd: _set_limit: conf setrlimit RLIMIT_MEMLOCK no change in
> value: 18446744073709551615
> slurmstepd: _set_limit: conf setrlimit RLIMIT_AS no change in value:
> 18446744073709551615
> slurmstepd: Handling REQUEST_STEP_UID
> slurmstepd: Handling REQUEST_SIGNAL_CONTAINER
> slurmstepd: _handle_signal_container for step=986.0 uid=0 signal=995
> acme11.ciemat.es
> slurmstepd: task 0 (18422) exited with exit code 0.
> slurmstepd: task_p_post_term: 986.0, task 0
> slurmstepd: Sending SIGKILL to pgid 18416
> slurmstepd: Waiting for IO
> slurmstepd: Closing debug channel
>
>
>
>
> -bash-4.2$ srun -n 1 --tasks-per-node=1 --mpi=pmi2 --slurmd-debug=4
> ./helloWorldMPI
> slurmstepd: debug level = 6
> slurmstepd: IO handler started pid=18430
> slurmstepd: Uncached user/gid: slurm/1001
> slurmstepd: mpi/pmi2: _tree_listen_readable
> slurmstepd: mpi/pmi2: _task_readable
> slurmstepd: mpi/pmi2: _tree_listen_readable
> slurmstepd: mpi/pmi2: _task_readable
> slurmstepd: mpi/pmi2: _tree_listen_readable
> slurmstepd: mpi/pmi2: _task_readable
> slurmstepd: task 0 (18436) started 2016-11-25T12:57:00
> slurmstepd: task_p_pre_launch_priv: 987.0
> slurmstepd: Uncached user/gid: slurm/1001
> slurmstepd: mpi/pmi2: _tree_listen_readable
> slurmstepd: mpi/pmi2: _task_readable
> slurmstepd: job_container none plugin loaded
> slurmstepd: Reading cgroup.conf file /home/localsoft/slurm/etc/cgroup.conf
> slurmstepd: unable to get cgroup '/cgroup/cpuset' entry
> '/cgroup/cpuset/slurm/system' properties: No such file or directory
> slurmstepd: unable to get cgroup '/cgroup/memory' entry
> '/cgroup/memory/slurm/system' properties: No such file or directory
> slurmstepd: Sending launch resp rc=0
> slurmstepd: mpi type = (null)
> slurmstepd: task_p_pre_launch: 987.0, task 0
> slurmstepd: _set_limit: conf setrlimit RLIMIT_CPU no change in value:
> 18446744073709551615
> slurmstepd: _set_limit: conf setrlimit RLIMIT_FSIZE no change in
> value: 18446744073709551615
> slurmstepd: _set_limit: conf setrlimit RLIMIT_DATA no change in value:
> 18446744073709551615
> slurmstepd: _set_limit: conf setrlimit RLIMIT_STACK no change in value: 8388608
> slurmstepd: _set_limit: conf setrlimit RLIMIT_CORE no change in value: 0
> slurmstepd: _set_limit: conf setrlimit RLIMIT_RSS no change in value:
> 18446744073709551615
> slurmstepd: _set_limit: RLIMIT_NPROC  : max:256553 cur:256553 req:4096
> slurmstepd: _set_limit: conf setrlimit RLIMIT_NPROC succeeded
> slurmstepd: _set_limit: RLIMIT_NOFILE : max:4096 cur:4096 req:1024
> slurmstepd: _set_limit: conf setrlimit RLIMIT_NOFILE succeeded
> slurmstepd: _set_limit: conf setrlimit RLIMIT_MEMLOCK no change in
> value: 18446744073709551615
> slurmstepd: _set_limit: conf setrlimit RLIMIT_AS no change in value:
> 18446744073709551615
> slurmstepd: Handling REQUEST_STEP_UID
> slurmstepd: Handling REQUEST_SIGNAL_CONTAINER
> slurmstepd: _handle_signal_container for step=987.0 uid=0 signal=995
> slurmstepd: mpi/pmi2: got client PMI1 init, version=2.0
> slurmstepd: mpi/pmi2: _tree_listen_readable
> slurmstepd: mpi/pmi2: _task_readable
> slurmstepd: mpi/pmi2: got client request: 53
> cmd=fullinit;pmijobid=987.0;pmirank=0;threaded=FALSE;
> slurmstepd: mpi/pmi2: client_resp_send: 114
> cmd=fullinit-response;rc=0;pmi-version=2;pmi-subversion=0;rank=0;size=1;appnum=-1;debugged=FALSE;pmiverbose=FALSE;
> slurmstepd: mpi/pmi2: _tree_listen_readable
> slurmstepd: mpi/pmi2: _task_readable
> slurmstepd: mpi/pmi2: got client request: 13     cmd=finalize;
> slurmstepd: mpi/pmi2: client_resp_send: 27    cmd=finalize-response;rc=0;
> slurmstepd: mpi/pmi2: _tree_listen_readable
> slurmstepd: mpi/pmi2: _task_readable
> slurmstepd:     false, finalized
> Process 0 of 1 is on acme11.ciemat.es
> Hello world from process 0 of 1
> Goodbye world from process 0 of 1
> slurmstepd: task 0 (18436) exited with exit code 0.
> slurmstepd: task_p_post_term: 987.0, task 0
> slurmstepd: Sending SIGKILL to pgid 18430
> slurmstepd: Waiting for IO
> slurmstepd: Closing debug channel
>
>
>
> -bash-4.2$ srun -n 2 --tasks-per-node=2 --mpi=pmi2 --slurmd-debug=4 hostname
> slurmstepd: debug level = 6
> slurmstepd: IO handler started pid=18493
> slurmstepd: Uncached user/gid: slurm/1001
> slurmstepd: mpi/pmi2: _tree_listen_readable
> slurmstepd: mpi/pmi2: _task_readable
> slurmstepd: mpi/pmi2: _task_readable
> slurmstepd: mpi/pmi2: _tree_listen_readable
> slurmstepd: mpi/pmi2: _task_readable
> slurmstepd: mpi/pmi2: _task_readable
> slurmstepd: mpi/pmi2: _tree_listen_readable
> slurmstepd: mpi/pmi2: _task_readable
> slurmstepd: mpi/pmi2: _task_readable
> slurmstepd: task_p_pre_launch_priv: 990.0
> slurmstepd: task 0 (18499) started 2016-11-25T12:57:31
> slurmstepd: Uncached user/gid: slurm/1001
> slurmstepd: task 1 (18500) started 2016-11-25T12:57:31
> slurmstepd: mpi/pmi2: _tree_listen_readable
> slurmstepd: mpi/pmi2: _task_readable
> slurmstepd: mpi/pmi2: _task_readable
> slurmstepd: mpi/pmi2: _tree_listen_readable
> slurmstepd: mpi/pmi2: _task_readable
> slurmstepd: task_p_pre_launch_priv: 990.0
> slurmstepd: Uncached user/gid: slurm/1001
> slurmstepd: mpi/pmi2: _task_readable
> slurmstepd: job_container none plugin loaded
> slurmstepd: Reading cgroup.conf file /home/localsoft/slurm/etc/cgroup.conf
> slurmstepd: unable to get cgroup '/cgroup/cpuset' entry
> '/cgroup/cpuset/slurm/system' properties: No such file or directory
> slurmstepd: unable to get cgroup '/cgroup/memory' entry
> '/cgroup/memory/slurm/system' properties: No such file or directory
> slurmstepd: Sending launch resp rc=0
> slurmstepd: mpi type = (null)
> slurmstepd: mpi type = (null)
> slurmstepd: task_p_pre_launch: 990.0, task 0
> slurmstepd: task_p_pre_launch: 990.0, task 1
> slurmstepd: _set_limit: conf setrlimit RLIMIT_CPU no change in value:
> 18446744073709551615
> slurmstepd: _set_limit: conf setrlimit RLIMIT_CPU no change in value:
> 18446744073709551615
> slurmstepd: _set_limit: conf setrlimit RLIMIT_FSIZE no change in
> value: 18446744073709551615
> slurmstepd: _set_limit: conf setrlimit RLIMIT_FSIZE no change in
> value: 18446744073709551615
> slurmstepd: _set_limit: conf setrlimit RLIMIT_DATA no change in value:
> 18446744073709551615
> slurmstepd: _set_limit: conf setrlimit RLIMIT_STACK no change in value: 8388608
> slurmstepd: _set_limit: conf setrlimit RLIMIT_CORE no change in value: 0
> slurmstepd: _set_limit: conf setrlimit RLIMIT_DATA no change in value:
> 18446744073709551615
> slurmstepd: _set_limit: conf setrlimit RLIMIT_STACK no change in value: 8388608
> slurmstepd: _set_limit: conf setrlimit RLIMIT_CORE no change in value: 0
> slurmstepd: _set_limit: conf setrlimit RLIMIT_RSS no change in value:
> 18446744073709551615
> slurmstepd: _set_limit: conf setrlimit RLIMIT_RSS no change in value:
> 18446744073709551615
> slurmstepd: _set_limit: RLIMIT_NPROC  : max:256553 cur:256553 req:4096
> slurmstepd: _set_limit: conf setrlimit RLIMIT_NPROC succeeded
> slurmstepd: _set_limit: RLIMIT_NPROC  : max:256553 cur:256553 req:4096
> slurmstepd: _set_limit: conf setrlimit RLIMIT_NPROC succeeded
> slurmstepd: _set_limit: RLIMIT_NOFILE : max:4096 cur:4096 req:1024
> slurmstepd: _set_limit: conf setrlimit RLIMIT_NOFILE succeeded
> slurmstepd: _set_limit: conf setrlimit RLIMIT_MEMLOCK no change in
> value: 18446744073709551615
> slurmstepd: _set_limit: conf setrlimit RLIMIT_AS no change in value:
> 18446744073709551615
> slurmstepd: _set_limit: RLIMIT_NOFILE : max:4096 cur:4096 req:1024
> slurmstepd: _set_limit: conf setrlimit RLIMIT_NOFILE succeeded
> slurmstepd: _set_limit: conf setrlimit RLIMIT_MEMLOCK no change in
> value: 18446744073709551615
> slurmstepd: _set_limit: conf setrlimit RLIMIT_AS no change in value:
> 18446744073709551615
> acme11.ciemat.es
> acme11.ciemat.es
> slurmstepd: task 1 (18500) exited with exit code 0.
> slurmstepd: Handling REQUEST_STEP_UID
> slurmstepd: Handling REQUEST_SIGNAL_CONTAINER
> slurmstepd: _handle_signal_container for step=990.0 uid=0 signal=995
> slurmstepd: task_p_post_term: 990.0, task 1
> slurmstepd: task 0 (18499) exited with exit code 0.
> slurmstepd: task_p_post_term: 990.0, task 0
> slurmstepd: No child processes
> slurmstepd: Sending SIGKILL to pgid 18493
> slurmstepd: Waiting for IO
> slurmstepd: Closing debug channel
>
>
>
>
>
> bash-4.2$ srun -n 2 --tasks-per-node=2 --mpi=pmi2 --slurmd-debug=4
> ./helloWorldMPI
> slurmstepd: debug level = 6
> slurmstepd: IO handler started pid=18508
> slurmstepd: mpi/pmi2: _tree_listen_readable
> slurmstepd: mpi/pmi2: _task_readable
> slurmstepd: mpi/pmi2: _task_readable
> slurmstepd: Uncached user/gid: slurm/1001
> slurmstepd: mpi/pmi2: _tree_listen_readable
> slurmstepd: mpi/pmi2: _task_readable
> slurmstepd: mpi/pmi2: _task_readable
> slurmstepd: mpi/pmi2: _tree_listen_readable
> slurmstepd: mpi/pmi2: _task_readable
> slurmstepd: mpi/pmi2: _task_readable
> slurmstepd: task 0 (18514) started 2016-11-25T12:57:38
> slurmstepd: task 1 (18515) started 2016-11-25T12:57:38
> slurmstepd: task_p_pre_launch_priv: 991.0
> slurmstepd: Uncached user/gid: slurm/1001
> slurmstepd: mpi/pmi2: _tree_listen_readable
> slurmstepd: mpi/pmi2: _task_readable
> slurmstepd: mpi/pmi2: _task_readable
> slurmstepd: task_p_pre_launch_priv: 991.0
> slurmstepd: Uncached user/gid: slurm/1001
> slurmstepd: mpi/pmi2: _tree_listen_readable
> slurmstepd: mpi/pmi2: _task_readable
> slurmstepd: mpi/pmi2: _task_readable
> slurmstepd: job_container none plugin loaded
> slurmstepd: Reading cgroup.conf file /home/localsoft/slurm/etc/cgroup.conf
> slurmstepd: unable to get cgroup '/cgroup/cpuset' entry
> '/cgroup/cpuset/slurm/system' properties: No such file or directory
> slurmstepd: unable to get cgroup '/cgroup/memory' entry
> '/cgroup/memory/slurm/system' properties: No such file or directory
> slurmstepd: Sending launch resp rc=0
> slurmstepd: mpi type = (null)
> slurmstepd: mpi type = (null)
> slurmstepd: task_p_pre_launch: 991.0, task 1
> slurmstepd: task_p_pre_launch: 991.0, task 0
> slurmstepd: _set_limit: conf setrlimit RLIMIT_CPU no change in value:
> 18446744073709551615
> slurmstepd: _set_limit: conf setrlimit RLIMIT_FSIZE no change in
> value: 18446744073709551615
> slurmstepd: _set_limit: conf setrlimit RLIMIT_CPU no change in value:
> 18446744073709551615
> slurmstepd: _set_limit: conf setrlimit RLIMIT_FSIZE no change in
> value: 18446744073709551615
> slurmstepd: _set_limit: conf setrlimit RLIMIT_DATA no change in value:
> 18446744073709551615
> slurmstepd: _set_limit: conf setrlimit RLIMIT_STACK no change in value: 8388608
> slurmstepd: _set_limit: conf setrlimit RLIMIT_DATA no change in value:
> 18446744073709551615
> slurmstepd: _set_limit: conf setrlimit RLIMIT_STACK no change in value: 8388608
> slurmstepd: _set_limit: conf setrlimit RLIMIT_CORE no change in value: 0
> slurmstepd: _set_limit: conf setrlimit RLIMIT_RSS no change in value:
> 18446744073709551615
> slurmstepd: _set_limit: conf setrlimit RLIMIT_CORE no change in value: 0
> slurmstepd: _set_limit: conf setrlimit RLIMIT_RSS no change in value:
> 18446744073709551615
> slurmstepd: _set_limit: RLIMIT_NPROC  : max:256553 cur:256553 req:4096
> slurmstepd: _set_limit: conf setrlimit RLIMIT_NPROC succeeded
> slurmstepd: _set_limit: RLIMIT_NPROC  : max:256553 cur:256553 req:4096
> slurmstepd: _set_limit: conf setrlimit RLIMIT_NPROC succeeded
> slurmstepd: _set_limit: RLIMIT_NOFILE : max:4096 cur:4096 req:1024
> slurmstepd: _set_limit: conf setrlimit RLIMIT_NOFILE succeeded
> slurmstepd: _set_limit: conf setrlimit RLIMIT_MEMLOCK no change in
> value: 18446744073709551615
> slurmstepd: _set_limit: RLIMIT_NOFILE : max:4096 cur:4096 req:1024
> slurmstepd: _set_limit: conf setrlimit RLIMIT_NOFILE succeeded
> slurmstepd: _set_limit: conf setrlimit RLIMIT_MEMLOCK no change in
> value: 18446744073709551615
> slurmstepd: _set_limit: conf setrlimit RLIMIT_AS no change in value:
> 18446744073709551615
> slurmstepd: _set_limit: conf setrlimit RLIMIT_AS no change in value:
> 18446744073709551615
> slurmstepd: Handling REQUEST_STEP_UID
> slurmstepd: Handling REQUEST_SIGNAL_CONTAINER
> slurmstepd: _handle_signal_container for step=991.0 uid=0 signal=995
> slurmstepd: mpi/pmi2: got client PMI1 init, version=2.0
> slurmstepd: mpi/pmi2: _tree_listen_readable
> slurmstepd: mpi/pmi2: _task_readable
> slurmstepd: mpi/pmi2: _task_readable
> slurmstepd: mpi/pmi2: got client request: 53
> cmd=fullinit;pmijobid=991.0;pmirank=0;threaded=FALSE;
> slurmstepd: mpi/pmi2: client_resp_send: 114
> cmd=fullinit-response;rc=0;pmi-version=2;pmi-subversion=0;rank=0;size=2;appnum=-1;debugged=FALSE;pmiverbose=FALSE;
> slurmstepd: mpi/pmi2: _tree_listen_readable
> slurmstepd: mpi/pmi2: _task_readable
> slurmstepd: mpi/pmi2: _task_readable
> slurmstepd: mpi/pmi2: got client request: 53
> cmd=fullinit;pmijobid=991.0;pmirank=1;threaded=FALSE;
> slurmstepd: mpi/pmi2: client_resp_send: 114
> cmd=fullinit-response;rc=0;pmi-version=2;pmi-subversion=0;rank=1;size=2;appnum=-1;debugged=FALSE;pmiverbose=FALSE;
> slurmstepd: mpi/pmi2: _tree_listen_readable
> slurmstepd: mpi/pmi2: _task_readable
> slurmstepd: mpi/pmi2: _task_readable
> slurmstepd: mpi/pmi2: got client request: 44
> cmd=info-getjobattr;key=PMI_process_mapping;
> slurmstepd: mpi/pmi2: client_resp_send: 68
> cmd=info-getjobattr-response;rc=0;found=TRUE;value=(vector,(0,1,2));
> slurmstepd: mpi/pmi2: _tree_listen_readable
> slurmstepd: mpi/pmi2: _task_readable
> slurmstepd: mpi/pmi2: _task_readable
> slurmstepd: mpi/pmi2: got client request: 44
> cmd=info-getjobattr;key=PMI_process_mapping;
> slurmstepd: mpi/pmi2: client_resp_send: 68
> cmd=info-getjobattr-response;rc=0;found=TRUE;value=(vector,(0,1,2));
> slurmstepd: mpi/pmi2: _tree_listen_readable
> slurmstepd: mpi/pmi2: _task_readable
> slurmstepd: mpi/pmi2: _task_readable
> slurmstepd: mpi/pmi2: got client request: 13     cmd=finalize;
> slurmstepd: mpi/pmi2: client_resp_send: 27    cmd=finalize-response;rc=0;
> slurmstepd: mpi/pmi2: _tree_listen_readable
> slurmstepd: mpi/pmi2: _task_readable
> slurmstepd:     false, finalized
> slurmstepd: mpi/pmi2: _task_readable
> slurmstepd: mpi/pmi2: got client request: 13     cmd=finalize;
> slurmstepd: mpi/pmi2: client_resp_send: 27    cmd=finalize-response;rc=0;
> slurmstepd: mpi/pmi2: _tree_listen_readable
> slurmstepd: mpi/pmi2: _task_readable
> slurmstepd:     false, finalized
> slurmstepd: mpi/pmi2: _task_readable
> slurmstepd:     false, finalized
> Process 0 of 2 is on acme11.ciemat.es
> Hello world from process 0 of 2
> Goodbye world from process 0 of 2
> Process 1 of 2 is on acme11.ciemat.es
> Hello world from process 1 of 2
> Goodbye world from process 1 of 2
> slurmstepd: task 0 (18514) exited with exit code 0.
> slurmstepd: task_p_post_term: 991.0, task 0
> slurmstepd: task 1 (18515) exited with exit code 0.
> slurmstepd: task_p_post_term: 991.0, task 1
> slurmstepd: No child processes
> slurmstepd: Sending SIGKILL to pgid 18508
> slurmstepd: Waiting for IO
> slurmstepd: Closing debug channel
>
>
>
> -bash-4.2$ srun -n 2 --tasks-per-node=1 --mpi=pmi2 --slurmd-debug=4 hostname
> slurmstepd: debug level = 6
> slurmstepd: mpi/pmi2: _tree_listen_readable
> slurmstepd: mpi/pmi2: _task_readable
> slurmstepd: Uncached user/gid: slurm/1001
> slurmstepd: debug level = 6
> slurmstepd: IO handler started pid=23700
> slurmstepd: mpi/pmi2: _tree_listen_readable
> slurmstepd: mpi/pmi2: _task_readable
> slurmstepd: Uncached user/gid: slurm/1001
> slurmstepd: mpi/pmi2: _tree_listen_readable
> slurmstepd: mpi/pmi2: _task_readable
> slurmstepd: mpi/pmi2: _tree_listen_readable
> slurmstepd: mpi/pmi2: _task_readable
> slurmstepd: mpi/pmi2: _tree_listen_readable
> slurmstepd: mpi/pmi2: _task_readable
> slurmstepd: mpi/pmi2: _tree_listen_readable
> slurmstepd: mpi/pmi2: _task_readable
> slurmstepd: task 0 (18458) started 2016-11-25T12:57:09
> slurmstepd: task_p_pre_launch_priv: 988.0
> slurmstepd: task 1 (23706) started 2016-11-25T12:57:09
> slurmstepd: Uncached user/gid: slurm/1001
> slurmstepd: task_p_pre_launch_priv: 988.0
> slurmstepd: mpi/pmi2: _tree_listen_readable
> slurmstepd: mpi/pmi2: _task_readable
> slurmstepd: Uncached user/gid: slurm/1001
> slurmstepd: mpi/pmi2: _tree_listen_readable
> slurmstepd: mpi/pmi2: _task_readable
> slurmstepd: job_container none plugin loaded
> slurmstepd: mpi/pmi2: _tree_listen_readable
> slurmstepd: mpi/pmi2: _task_readable
> slurmstepd: Reading cgroup.conf file /home/localsoft/slurm/etc/cgroup.conf
> slurmstepd: job_container none plugin loaded
> slurmstepd: unable to get cgroup '/cgroup/cpuset' entry
> '/cgroup/cpuset/slurm/system' properties: No such file or directory
> slurmstepd: Reading cgroup.conf file /home/localsoft/slurm/etc/cgroup.conf
> acme11.ciemat.es
> slurmstepd: unable to get cgroup '/cgroup/memory' entry
> '/cgroup/memory/slurm/system' properties: No such file or directory
> slurmstepd: Sending launch resp rc=0
> slurmstepd: unable to get cgroup '/cgroup/cpuset' entry
> '/cgroup/cpuset/slurm/system' properties: No such file or directory
> slurmstepd: unable to get cgroup '/cgroup/memory' entry
> '/cgroup/memory/slurm/system' properties: No such file or directory
> slurmstepd: Sending launch resp rc=0
> slurmstepd: mpi type = (null)
> slurmstepd: mpi type = (null)
> slurmstepd: task_p_pre_launch: 988.0, task 0
> slurmstepd: task_p_pre_launch: 988.0, task 1
> slurmstepd: _set_limit: conf setrlimit RLIMIT_CPU no change in value:
> 18446744073709551615
> slurmstepd: _set_limit: conf setrlimit RLIMIT_FSIZE no change in
> value: 18446744073709551615
> slurmstepd: _set_limit: conf setrlimit RLIMIT_CPU no change in value:
> 18446744073709551615
> slurmstepd: _set_limit: conf setrlimit RLIMIT_DATA no change in value:
> 18446744073709551615
> slurmstepd: _set_limit: conf setrlimit RLIMIT_STACK no change in value: 8388608
> slurmstepd: _set_limit: conf setrlimit RLIMIT_FSIZE no change in
> value: 18446744073709551615
> slurmstepd: _set_limit: conf setrlimit RLIMIT_DATA no change in value:
> 18446744073709551615
> acme12.ciemat.es
> slurmstepd: _set_limit: conf setrlimit RLIMIT_CORE no change in value: 0
> slurmstepd: _set_limit: conf setrlimit RLIMIT_RSS no change in value:
> 18446744073709551615
> slurmstepd: _set_limit: conf setrlimit RLIMIT_STACK no change in value: 8388608
> slurmstepd: _set_limit: RLIMIT_NPROC  : max:256553 cur:256553 req:4096
> slurmstepd: _set_limit: conf setrlimit RLIMIT_NPROC succeeded
> slurmstepd: _set_limit: conf setrlimit RLIMIT_CORE no change in value: 0
> slurmstepd: _set_limit: RLIMIT_NOFILE : max:4096 cur:4096 req:1024
> slurmstepd: _set_limit: conf setrlimit RLIMIT_NOFILE succeeded
> slurmstepd: _set_limit: conf setrlimit RLIMIT_MEMLOCK no change in
> value: 18446744073709551615
> slurmstepd: _set_limit: conf setrlimit RLIMIT_RSS no change in value:
> 18446744073709551615
> slurmstepd: _set_limit: conf setrlimit RLIMIT_AS no change in value:
> 18446744073709551615
> slurmstepd: _set_limit: RLIMIT_NPROC  : max:256555 cur:256555 req:4096
> slurmstepd: _set_limit: conf setrlimit RLIMIT_NPROC succeeded
> slurmstepd: _set_limit: RLIMIT_NOFILE : max:4096 cur:4096 req:1024
> slurmstepd: task 0 (18458) exited with exit code 0.
> slurmstepd: _set_limit: conf setrlimit RLIMIT_NOFILE succeeded
> slurmstepd: task_p_post_term: 988.0, task 0
> slurmstepd: _set_limit: conf setrlimit RLIMIT_MEMLOCK no change in
> value: 18446744073709551615
> slurmstepd: _set_limit: conf setrlimit RLIMIT_AS no change in value:
> 18446744073709551615
> slurmstepd: Handling REQUEST_STEP_UID
> slurmstepd: Sending SIGKILL to pgid 18452
> slurmstepd: Waiting for IO
> slurmstepd: Closing debug channel
> slurmstepd: Handling REQUEST_SIGNAL_CONTAINER
> slurmstepd: _handle_signal_container for step=988.0 uid=0 signal=995
> slurmstepd: task 1 (23706) exited with exit code 0.
> slurmstepd: task_p_post_term: 988.0, task 0
> slurmstepd: Sending SIGKILL to pgid 23700
> slurmstepd: Waiting for IO
> slurmstepd: Closing debug channel
>
>
>
> -bash-4.2$ srun -n 2 --tasks-per-node=1 --mpi=pmi2 --slurmd-debug=4
> ./helloWorldMPI
> slurmstepd: debug level = 6
> slurmstepd: IO handler started pid=23714
> slurmstepd: mpi/pmi2: _tree_listen_readable
> slurmstepd: mpi/pmi2: _task_readable
> slurmstepd: Uncached user/gid: slurm/1001
> slurmstepd: debug level = 6
> slurmstepd: mpi/pmi2: _tree_listen_readable
> slurmstepd: mpi/pmi2: _task_readable
> slurmstepd: mpi/pmi2: _tree_listen_readable
> slurmstepd: mpi/pmi2: _task_readable
> slurmstepd: Uncached user/gid: slurm/1001
> slurmstepd: mpi/pmi2: _tree_listen_readable
> slurmstepd: mpi/pmi2: _task_readable
> slurmstepd: mpi/pmi2: _tree_listen_readable
> slurmstepd: mpi/pmi2: _task_readable
> slurmstepd: task 1 (23720) started 2016-11-25T12:57:17
> slurmstepd: task_p_pre_launch_priv: 989.0
> slurmstepd: Uncached user/gid: slurm/1001
> slurmstepd: mpi/pmi2: _tree_listen_readable
> slurmstepd: mpi/pmi2: _task_readable
> slurmstepd: mpi/pmi2: _tree_listen_readable
> slurmstepd: mpi/pmi2: _task_readable
> slurmstepd: mpi/pmi2: _tree_listen_readable
> slurmstepd: mpi/pmi2: _task_readable
> slurmstepd: task_p_pre_launch_priv: 989.0
> slurmstepd: task 0 (18474) started 2016-11-25T12:57:17
> slurmstepd: Uncached user/gid: slurm/1001
> slurmstepd: job_container none plugin loaded
> slurmstepd: mpi/pmi2: _tree_listen_readable
> slurmstepd: mpi/pmi2: _task_readable
> slurmstepd: Reading cgroup.conf file /home/localsoft/slurm/etc/cgroup.conf
> slurmstepd: job_container none plugin loaded
> slurmstepd: unable to get cgroup '/cgroup/cpuset' entry
> '/cgroup/cpuset/slurm/system' properties: No such file or directory
> slurmstepd: Reading cgroup.conf file /home/localsoft/slurm/etc/cgroup.conf
> slurmstepd: unable to get cgroup '/cgroup/memory' entry
> '/cgroup/memory/slurm/system' properties: No such file or directory
> slurmstepd: Sending launch resp rc=0
> slurmstepd: unable to get cgroup '/cgroup/cpuset' entry
> '/cgroup/cpuset/slurm/system' properties: No such file or directory
> slurmstepd: mpi type = (null)
> slurmstepd: unable to get cgroup '/cgroup/memory' entry
> '/cgroup/memory/slurm/system' properties: No such file or directory
> slurmstepd: Sending launch resp rc=0
> slurmstepd: task_p_pre_launch: 989.0, task 1
> slurmstepd: mpi type = (null)
> slurmstepd: _set_limit: conf setrlimit RLIMIT_CPU no change in value:
> 18446744073709551615
> slurmstepd: task_p_pre_launch: 989.0, task 0
> slurmstepd: _set_limit: conf setrlimit RLIMIT_FSIZE no change in
> value: 18446744073709551615
> slurmstepd: _set_limit: conf setrlimit RLIMIT_CPU no change in value:
> 18446744073709551615
> slurmstepd: _set_limit: conf setrlimit RLIMIT_FSIZE no change in
> value: 18446744073709551615
> slurmstepd: _set_limit: conf setrlimit RLIMIT_DATA no change in value:
> 18446744073709551615
> slurmstepd: _set_limit: conf setrlimit RLIMIT_DATA no change in value:
> 18446744073709551615
> slurmstepd: _set_limit: conf setrlimit RLIMIT_STACK no change in value: 8388608
> slurmstepd: _set_limit: conf setrlimit RLIMIT_STACK no change in value: 8388608
> slurmstepd: _set_limit: conf setrlimit RLIMIT_CORE no change in value: 0
> slurmstepd: _set_limit: conf setrlimit RLIMIT_RSS no change in value:
> 18446744073709551615
> slurmstepd: _set_limit: conf setrlimit RLIMIT_CORE no change in value: 0
> slurmstepd: _set_limit: RLIMIT_NPROC  : max:256553 cur:256553 req:4096
> slurmstepd: _set_limit: conf setrlimit RLIMIT_NPROC succeeded
> slurmstepd: _set_limit: RLIMIT_NOFILE : max:4096 cur:4096 req:1024
> slurmstepd: _set_limit: conf setrlimit RLIMIT_NOFILE succeeded
> slurmstepd: _set_limit: conf setrlimit RLIMIT_RSS no change in value:
> 18446744073709551615
> slurmstepd: _set_limit: conf setrlimit RLIMIT_MEMLOCK no change in
> value: 18446744073709551615
> slurmstepd: _set_limit: conf setrlimit RLIMIT_AS no change in value:
> 18446744073709551615
> slurmstepd: _set_limit: RLIMIT_NPROC  : max:256555 cur:256555 req:4096
> slurmstepd: Handling REQUEST_STEP_UID
> slurmstepd: _set_limit: conf setrlimit RLIMIT_NPROC succeeded
> slurmstepd: Handling REQUEST_SIGNAL_CONTAINER
> slurmstepd: _handle_signal_container for step=989.0 uid=0 signal=995
> slurmstepd: _set_limit: RLIMIT_NOFILE : max:4096 cur:4096 req:1024
> slurmstepd: mpi/pmi2: got client PMI1 init, version=2.0
> slurmstepd: _set_limit: conf setrlimit RLIMIT_NOFILE succeeded
> slurmstepd: _set_limit: conf setrlimit RLIMIT_MEMLOCK no change in
> value: 18446744073709551615
> slurmstepd: mpi/pmi2: _tree_listen_readable
> slurmstepd: mpi/pmi2: _task_readable
> slurmstepd: _set_limit: conf setrlimit RLIMIT_AS no change in value:
> 18446744073709551615
> slurmstepd: mpi/pmi2: got client request: 53
> cmd=fullinit;pmijobid=989.0;pmirank=0;threaded=FALSE;
> slurmstepd: Handling REQUEST_STEP_UID
> slurmstepd: Handling REQUEST_SIGNAL_CONTAINER
> slurmstepd: _handle_signal_container for step=989.0 uid=0 signal=995
> slurmstepd: mpi/pmi2: got client PMI1 init, version=2.0
> slurmstepd: mpi/pmi2: _tree_listen_readable
> slurmstepd: mpi/pmi2: _task_readable
> slurmstepd: mpi/pmi2: got client request: 53
> cmd=fullinit;pmijobid=989.0;pmirank=1;threaded=FALSE;
> slurmstepd: mpi/pmi2: client_resp_send: 114
> cmd=fullinit-response;rc=0;pmi-version=2;pmi-subversion=0;rank=1;size=2;appnum=-1;debugged=FALSE;pmiverbose=FALSE;
> slurmstepd: mpi/pmi2: client_resp_send: 114
> cmd=fullinit-response;rc=0;pmi-version=2;pmi-subversion=0;rank=0;size=2;appnum=-1;debugged=FALSE;pmiverbose=FALSE;
> slurmstepd: mpi/pmi2: _tree_listen_readable
> slurmstepd: mpi/pmi2: _task_readable
> slurmstepd: mpi/pmi2: _tree_listen_readable
> slurmstepd: mpi/pmi2: _task_readable
> slurmstepd: mpi/pmi2: got client request: 44
> cmd=info-getjobattr;key=PMI_process_mapping;
> slurmstepd: mpi/pmi2: client_resp_send: 68
> cmd=info-getjobattr-response;rc=0;found=TRUE;value=(vector,(0,2,1));
> slurmstepd: mpi/pmi2: _tree_listen_readable
> slurmstepd: mpi/pmi2: _task_readable
> slurmstepd: mpi/pmi2: got client request: 44
> cmd=info-getjobattr;key=PMI_process_mapping;
> slurmstepd: mpi/pmi2: client_resp_send: 68
> cmd=info-getjobattr-response;rc=0;found=TRUE;value=(vector,(0,2,1));
> slurmstepd: mpi/pmi2: _tree_listen_readable
> slurmstepd: mpi/pmi2: _task_readable
> slurmstepd: mpi/pmi2: got client request: 64
> cmd=kvs-put;key=MVAPICH2_0000;value=0000000a:00004cae:00004caf:;
> slurmstepd: mpi/pmi2: client_resp_send: 26    cmd=kvs-put-response;rc=0;
> slurmstepd: mpi/pmi2: _tree_listen_readable
> slurmstepd: mpi/pmi2: _task_readable
> slurmstepd: mpi/pmi2: got client request: 14     cmd=kvs-fence;
> slurmstepd: mpi/pmi2: _tree_listen_readable
> slurmstepd: mpi/pmi2: _task_readable
> slurmstepd: mpi/pmi2: got client request: 64
> cmd=kvs-put;key=MVAPICH2_0001;value=00000009:0001ad87:0001ad88:;
> slurmstepd: mpi/pmi2: client_resp_send: 26    cmd=kvs-put-response;rc=0;
> slurmstepd: mpi/pmi2: _tree_listen_readable
> slurmstepd: mpi/pmi2: _task_readable
> slurmstepd: mpi/pmi2: got client request: 14     cmd=kvs-fence;
> slurmstepd: mpi/pmi2: _tree_listen_readable
> slurmstepd: mpi/pmi2: _task_readable
> slurmstepd: mpi/pmi2: _tree_listen_read
> slurmstepd: mpi/pmi2: client_resp_send: 28    cmd=kvs-fence-response;rc=0;
> slurmstepd: mpi/pmi2: _tree_listen_read
> slurmstepd: mpi/pmi2: client_resp_send: 28    cmd=kvs-fence-response;rc=0;
> slurmstepd: mpi/pmi2: _tree_listen_readable
> slurmstepd: mpi/pmi2: _task_readable
> slurmstepd: mpi/pmi2: got client request: 59
> cmd=kvs-get;jobid=singleton_kvs;srcid=-1;key=MVAPICH2_0000;
> slurmstepd: mpi/pmi2: client_resp_send: 71
> cmd=kvs-get-response;rc=0;found=TRUE;value=0000000a:00004cae:00004caf:;
> slurmstepd: mpi/pmi2: _tree_listen_readable
> slurmstepd: mpi/pmi2: _task_readable
> slurmstepd: mpi/pmi2: _tree_listen_readable
> slurmstepd: mpi/pmi2: _task_readable
> slurmstepd: mpi/pmi2: got client request: 59
> cmd=kvs-get;jobid=singleton_kvs;srcid=-1;key=MVAPICH2_0001;
> slurmstepd: mpi/pmi2: got client request: 59
> cmd=kvs-get;jobid=singleton_kvs;srcid=-1;key=MVAPICH2_0000;
> slurmstepd: mpi/pmi2: client_resp_send: 71
> cmd=kvs-get-response;rc=0;found=TRUE;value=0000000a:00004cae:00004caf:;
> slurmstepd: mpi/pmi2: client_resp_send: 71
> cmd=kvs-get-response;rc=0;found=TRUE;value=00000009:0001ad87:0001ad88:;
> slurmstepd: mpi/pmi2: _tree_listen_readable
> slurmstepd: mpi/pmi2: _task_readable
> slurmstepd: mpi/pmi2: _tree_listen_readable
> slurmstepd: mpi/pmi2: _task_readable
> slurmstepd: mpi/pmi2: got client request: 59
> cmd=kvs-get;jobid=singleton_kvs;srcid=-1;key=MVAPICH2_0001;
> slurmstepd: mpi/pmi2: got client request: 14     cmd=kvs-fence;
> slurmstepd: mpi/pmi2: client_resp_send: 71
> cmd=kvs-get-response;rc=0;found=TRUE;value=00000009:0001ad87:0001ad88:;
> slurmstepd: mpi/pmi2: _tree_listen_readable
> slurmstepd: mpi/pmi2: _task_readable
> slurmstepd: mpi/pmi2: got client request: 14     cmd=kvs-fence;
> slurmstepd: mpi/pmi2: _tree_listen_readable
> slurmstepd: mpi/pmi2: _task_readable
> slurmstepd: mpi/pmi2: _tree_listen_readable
> slurmstepd: mpi/pmi2: _task_readable
> slurmstepd: mpi/pmi2: _tree_listen_read
> slurmstepd: mpi/pmi2: client_resp_send: 28    cmd=kvs-fence-response;rc=0;
> slurmstepd: mpi/pmi2: _tree_listen_read
> slurmstepd: mpi/pmi2: client_resp_send: 28    cmd=kvs-fence-response;rc=0;
> slurmstepd: mpi/pmi2: _tree_listen_readable
> slurmstepd: mpi/pmi2: _task_readable
> slurmstepd: mpi/pmi2: _tree_listen_readable
> slurmstepd: mpi/pmi2: _task_readable
> slurmstepd: mpi/pmi2: got client request: 14     cmd=kvs-fence;
> slurmstepd: mpi/pmi2: got client request: 14     cmd=kvs-fence;
> slurmstepd: mpi/pmi2: _tree_listen_readable
> slurmstepd: mpi/pmi2: _task_readable
> slurmstepd: mpi/pmi2: _tree_listen_readable
> slurmstepd: mpi/pmi2: _task_readable
> slurmstepd: mpi/pmi2: _tree_listen_read
> slurmstepd: mpi/pmi2: client_resp_send: 28    cmd=kvs-fence-response;rc=0;
> slurmstepd: mpi/pmi2: _tree_listen_read
> slurmstepd: mpi/pmi2: client_resp_send: 28    cmd=kvs-fence-response;rc=0;
> slurmstepd: mpi/pmi2: _tree_listen_readable
> slurmstepd: mpi/pmi2: _task_readable
> slurmstepd: mpi/pmi2: got client request: 14     cmd=kvs-fence;
> slurmstepd: mpi/pmi2: _tree_listen_readable
> slurmstepd: mpi/pmi2: _task_readable
> slurmstepd: mpi/pmi2: got client request: 14     cmd=kvs-fence;
> slurmstepd: mpi/pmi2: _tree_listen_readable
> slurmstepd: mpi/pmi2: _task_readable
> slurmstepd: mpi/pmi2: _tree_listen_readable
> slurmstepd: mpi/pmi2: _task_readable
> slurmstepd: mpi/pmi2: _tree_listen_read
> slurmstepd: mpi/pmi2: client_resp_send: 28    cmd=kvs-fence-response;rc=0;
> slurmstepd: mpi/pmi2: _tree_listen_read
> slurmstepd: mpi/pmi2: client_resp_send: 28    cmd=kvs-fence-response;rc=0;
> slurmstepd: mpi/pmi2: _tree_listen_readable
> slurmstepd: mpi/pmi2: _task_readable
> slurmstepd: mpi/pmi2: got client request: 14     cmd=kvs-fence;
> slurmstepd: mpi/pmi2: _tree_listen_readable
> slurmstepd: mpi/pmi2: _task_readable
> slurmstepd: mpi/pmi2: got client request: 14     cmd=kvs-fence;
> slurmstepd: mpi/pmi2: _tree_listen_readable
> slurmstepd: mpi/pmi2: _task_readable
> slurmstepd: mpi/pmi2: _tree_listen_readable
> slurmstepd: mpi/pmi2: _task_readable
> slurmstepd: mpi/pmi2: _tree_listen_read
> slurmstepd: mpi/pmi2: _tree_listen_read
> slurmstepd: mpi/pmi2: client_resp_send: 28    cmd=kvs-fence-response;rc=0;
> [acme12.ciemat.es:mpi_rank_1][error_sighandler] Caught error:
> Segmentation fault (signal 11)
> slurmstepd: mpi/pmi2: client_resp_send: 28    cmd=kvs-fence-response;rc=0;
> [acme11.ciemat.es:mpi_rank_0][error_sighandler] Caught error:
> Segmentation fault (signal 11)
> slurmstepd: mpi/pmi2: _tree_listen_readable
> slurmstepd: mpi/pmi2: _task_readable
> slurmstepd: mpi/pmi2: _tree_listen_readable
> slurmstepd: mpi/pmi2: _task_readable
> slurmstepd: task 0 (18474) exited. Killed by signal 11 (core dumped).
> slurmstepd: task_p_post_term: 989.0, task 0
> slurmstepd: task 1 (23720) exited. Killed by signal 11 (core dumped).
> slurmstepd: task_p_post_term: 989.0, task 0
> slurmstepd: Sending SIGKILL to pgid 18468
> slurmstepd: Waiting for IO
> slurmstepd: Closing debug channel
> srun: error: acme11: task 0: Segmentation fault (core dumped)
> slurmstepd: Sending SIGKILL to pgid 23714
> slurmstepd: Waiting for IO
> slurmstepd: Closing debug channel
> srun: error: acme12: task 1: Segmentation fault (core dumped)
>
>
>
>
> SLURMCTLD OUTPUT
>
>
> slurmctld: debug2: sched: Processing RPC: REQUEST_RESOURCE_ALLOCATION
> from uid=500
> slurmctld: debug3: JobDesc: user_id=500 job_id=N/A partition=(null)
> name=helloWorldMPI
> slurmctld: debug3:    cpus=2-4294967294 pn_min_cpus=-1 core_spec=-1
> slurmctld: debug3:    Nodes=1-[4294967294] Sock/Node=65534
> Core/Sock=65534 Thread/Core=65534
> slurmctld: debug3:    pn_min_memory_job=-1 pn_min_tmp_disk=-1
> slurmctld: debug3:    immediate=0 features=(null) reservation=(null)
> slurmctld: debug3:    req_nodes=(null) exc_nodes=(null) gres=(null)
> slurmctld: debug3:    time_limit=-1--1 priority=-1 contiguous=0 shared=-1
> slurmctld: debug3:    kill_on_node_fail=-1 script=(null)
> slurmctld: debug3:    argv="./helloWorldMPI"
> slurmctld: debug3:    stdin=(null) stdout=(null) stderr=(null)
> slurmctld: debug3:    work_dir=/home/slurm/tests alloc_node:sid=acme31:11242
> slurmctld: debug3:    sicp_mode=0 power_flags=
> slurmctld: debug3:    resp_host=172.17.31.165 alloc_resp_port=40661
> other_port=48105
> slurmctld: debug3:    dependency=(null) account=(null) qos=(null) comment=(null)
> slurmctld: debug3:    mail_type=0 mail_user=(null) nice=0 num_tasks=2
> open_mode=0 overcommit=-1 acctg_freq=(null)
> slurmctld: debug3:    network=(null) begin=Unknown cpus_per_task=-1
> requeue=-1 licenses=(null)
> slurmctld: debug3:    end_time= signal=0 at 0 wait_all_nodes=-1 cpu_freq=
> slurmctld: debug3:    ntasks_per_node=1 ntasks_per_socket=-1 ntasks_per_core=-1
> slurmctld: debug3:    mem_bind=65534:(null) plane_size:65534
> slurmctld: debug3:    array_inx=(null)
> slurmctld: debug3:    burst_buffer=(null)
> slurmctld: debug3: found correct user
> slurmctld: debug3: found correct association
> slurmctld: debug3: found correct qos
> slurmctld: debug3: before alteration asking for nodes 1-4294967294
> cpus 2-4294967294
> slurmctld: debug3: after alteration asking for nodes 1-4294967294 cpus
> 2-4294967294
> slurmctld: debug2: found 8 usable nodes from config containing acme[11-14,21-24]
> slurmctld: debug3: _pick_best_nodes: job 994 idle_nodes 8 share_nodes 8
> slurmctld: debug3: powercapping: checking job 994 : skipped, capping disabled
> slurmctld: debug2: sched: JobId=994 allocated resources: NodeList=acme[11-12]
> slurmctld: sched: _slurm_rpc_allocate_resources JobId=994
> NodeList=acme[11-12] usec=1340
> slurmctld: debug3: Writing job id 994 to header record of job_state file
> slurmctld: debug2: _slurm_rpc_job_ready(994)=3 usec=5
> slurmctld: debug3: StepDesc: user_id=500 job_id=994 node_count=2-2
> cpu_count=2 num_tasks=2
> slurmctld: debug3:    cpu_freq_gov=4294967294 cpu_freq_max=4294967294
> cpu_freq_min=4294967294 relative=65534 task_dist=0x1 plane=1
> slurmctld: debug3:    node_list=(null)  constraints=(null)
> slurmctld: debug3:    host=acme31 port=48845 name=helloWorldMPI
> network=(null) exclusive=0
> slurmctld: debug3:    checkpoint-dir=/home/localsoft/slurm/checkpoint
> checkpoint_int=0
> slurmctld: debug3:    mem_per_node=0 resv_port_cnt=65534 immediate=0 no_kill=0
> slurmctld: debug3:    overcommit=0 time_limit=0 gres=(null)
> slurmctld: debug:  Configuration for job 994 complete
> slurmctld: debug3: step_layout cpus = 16 pos = 0
> slurmctld: debug3: step_layout cpus = 16 pos = 1
> slurmctld: debug:  laying out the 2 tasks on 2 hosts acme[11-12] dist 1
> slurmctld: debug2: Processing RPC: REQUEST_COMPLETE_JOB_ALLOCATION
> from uid=500, JobId=994 rc=139
> slurmctld: job_complete: JobID=994 State=0x1 NodeCnt=2 WTERMSIG 11
> slurmctld: debug2: Spawning RPC agent for msg_type SRUN_JOB_COMPLETE
> slurmctld: debug2: got 1 threads to send out
> slurmctld: debug2: Spawning RPC agent for msg_type SRUN_JOB_COMPLETE
> slurmctld: debug2: got 1 threads to send out
> slurmctld: debug2: Spawning RPC agent for msg_type REQUEST_TERMINATE_JOB
> slurmctld: job_complete: JobID=994 State=0x8003 NodeCnt=2 done
> slurmctld: debug2: _slurm_rpc_complete_job_allocation: JobID=994
> State=0x8003 NodeCnt=2
> slurmctld: debug2: got 1 threads to send out
> slurmctld: debug2: Tree head got back 0 looking for 2
> slurmctld: debug3: Tree sending to acme12
> slurmctld: debug3: slurm_send_only_node_msg: sent 181
> slurmctld: debug3: Tree sending to acme11
> slurmctld: debug3: slurm_send_only_node_msg: sent 181
> slurmctld: debug4: orig_timeout was 10000 we have 0 steps and a timeout of 10000
> slurmctld: debug4: orig_timeout was 10000 we have 0 steps and a timeout of 10000
> slurmctld: debug2: Tree head got back 1
> slurmctld: debug2: Tree head got back 2
> slurmctld: debug2: node_did_resp acme12
> slurmctld: debug2: node_did_resp acme11
> slurmctld: debug:  sched: Running job scheduler
> slurmctld: debug3: Writing job id 994 to header record of job_state file
> slurmctld: debug2: Testing job time limits and checkpoints
> slurmctld: debug:  sched: Running job scheduler
> slurmctld: debug2: Performing purge of old job records
> slurmctld: debug2: Performing purge of old job records
> slurmctld: debug2: purge_old_job: purged 1 old job records
>
>
>
>
>
> SLURMD OUTPUT
>
> slurmd: debug3: in the service_connection
> slurmd: debug2: got this type of message 6001
> slurmd: debug2: Processing RPC: REQUEST_LAUNCH_TASKS
> slurmd: launch task 996.0 request from 500.1001 at 172.17.31.165 (port 37857)
> slurmd: debug3: state for jobid 985: ctime:1480074710 revoked:0 expires:0
> slurmd: debug3: state for jobid 986: ctime:1480075007 revoked:0 expires:0
> slurmd: debug3: state for jobid 987: ctime:1480075020 revoked:0 expires:0
> slurmd: debug3: state for jobid 988: ctime:1480075029 revoked:0 expires:0
> slurmd: debug3: state for jobid 989: ctime:1480075037 revoked:0 expires:0
> slurmd: debug3: state for jobid 990: ctime:1480075051 revoked:0 expires:0
> slurmd: debug3: state for jobid 991: ctime:1480075058 revoked:0 expires:0
> slurmd: debug3: state for jobid 992: ctime:1480075668
> revoked:1480075668 expires:1480075668
> slurmd: debug3: state for jobid 992: ctime:1480075668 revoked:0 expires:0
> slurmd: debug3: state for jobid 993: ctime:1480075699
> revoked:1480075699 expires:1480075699
> slurmd: debug3: state for jobid 993: ctime:1480075699 revoked:0 expires:0
> slurmd: debug3: state for jobid 994: ctime:1480075717
> revoked:1480075717 expires:1480075717
> slurmd: debug3: state for jobid 994: ctime:1480075717 revoked:0 expires:0
> slurmd: debug3: state for jobid 995: ctime:1480075746
> revoked:1480075746 expires:1480075746
> slurmd: debug3: state for jobid 995: ctime:1480075746 revoked:0 expires:0
> slurmd: debug:  Checking credential with 276 bytes of sig data
> slurmd: debug:  task_p_slurmd_launch_request: 996.0 0
> slurmd: _run_prolog: run job script took usec=10
> slurmd: _run_prolog: prolog with lock for job 996 ran for 0 seconds
> slurmd: debug3: _rpc_launch_tasks: call to _forkexec_slurmstepd
> slurmd: debug3: slurmstepd rank 0 (acme11), parent rank -1 (NONE),
> children 1, depth 0, max_depth 1
> slurmd: debug3: _send_slurmstepd_init: call to getpwuid_r
> slurmd: debug3: _send_slurmstepd_init: return from getpwuid_r
> slurmd: debug3: _rpc_launch_tasks: return from _forkexec_slurmstepd
> slurmd: debug:  task_p_slurmd_reserve_resources: 996 0
> slurmd: debug3: in the service_connection
> slurmd: debug2: got this type of message 6004
> slurmd: debug2: Processing RPC: REQUEST_SIGNAL_TASKS
> slurmd: debug:  _rpc_signal_tasks: sending signal 995 to step 996.0 flag 0
> slurmd: debug3: in the service_connection
> slurmd: debug2: got this type of message 5029
> slurmd: debug3: Entering _rpc_forward_data, address:
> /tmp/sock.pmi2.996.0, len: 106
> slurmd: debug3: in the service_connection
> slurmd: debug2: got this type of message 5029
> slurmd: debug3: Entering _rpc_forward_data, address:
> /tmp/sock.pmi2.996.0, len: 6
> slurmd: debug3: in the service_connection
> slurmd: debug2: got this type of message 5029
> slurmd: debug3: Entering _rpc_forward_data, address:
> /tmp/sock.pmi2.996.0, len: 6
> slurmd: debug3: in the service_connection
> slurmd: debug2: got this type of message 5029
> slurmd: debug3: Entering _rpc_forward_data, address:
> /tmp/sock.pmi2.996.0, len: 6
> slurmd: debug3: in the service_connection
> slurmd: debug2: got this type of message 5029
> slurmd: debug3: Entering _rpc_forward_data, address:
> /tmp/sock.pmi2.996.0, len: 6
> slurmd: debug3: in the service_connection
> slurmd: debug2: got this type of message 5016
> slurmd: debug3: Entering _rpc_step_complete
> slurmd: debug:  Entering stepd_completion, range_first = 1, range_last = 1
> slurmd: debug3: in the service_connection
> slurmd: debug2: got this type of message 6011
> slurmd: debug2: Processing RPC: REQUEST_TERMINATE_JOB
> slurmd: debug:  _rpc_terminate_job, uid = 500
> slurmd: debug:  task_p_slurmd_release_resources: 996
> slurmd: debug:  credential for job 996 revoked
> slurmd: debug2: No steps in jobid 996 to send signal 18
> slurmd: debug2: No steps in jobid 996 to send signal 15
> slurmd: debug4: sent ALREADY_COMPLETE
> slurmd: debug2: set revoke expiration for jobid 996 to 1480075907 UTS
>
>
>
>
>
>
>
>
>
> 2016-11-24 16:50 GMT+01:00 Sourav Chakraborty
> <chakraborty.52 at buckeyemail.osu.edu>:
>> Hi Manuel,
>>
>> Thanks for reporting the issue.
>>
>> Since Mvapich2 was configured with pmi2, the following way to launch jobs is
>> correct:
>> srun -n 2 --mpi=pmi2 ./helloMPI
>>
>> Can you please post the output of the following command? This will have more
>> information to identify the issue.
>> srun -n 2 --mpi=pmi2 --slurmd-debug=5 ./helloMPI
>>
>> Also, to identify if this is an Mvapich2 specific issue, can you please try
>> running the following command?
>> srun -n 2 --mpi=pmi2 hostname
>>
>> Thanks,
>> Sourav
>>
>>
>> On Thu, Nov 24, 2016 at 6:55 AM, Manuel Rodríguez Pascual
>> <manuel.rodriguez.pascual at gmail.com> wrote:
>>>
>>> Hi all,
>>>
>>> I am trying to make mvapich2 work with Slurm, but I keep having some
>>> issues. I know there are quite a lot of threads on the subject, but none of
>>> them seems to solve my problems.  My problem is that Slurm is executing two
>>> serial jobs instead a single parallel one.
>>>
>>> Below I have included quite a lot of information about how I have
>>> configured my cluster and the different tests that I have performed, in case
>>> that it helps.
>>> ---
>>> ---
>>> COMPILATION
>>>
>>> --- Slurm  17.02.0-0pre2:
>>> ./configure --prefix=/home/localsoft/slurm/
>>>
>>> slurm.conf:
>>> MpiDefault=none
>>>
>>>
>>> --- MVAPICH mvapich2-2.2
>>>
>>>
>>> After quite  a lot of different tests, I've been able to compile mvapich
>>> with the following environment and options (config.log is attached to this
>>> mail):
>>>
>>> Environment vars:
>>> LD_LIBRARY_PATH =
>>> /usr/local/lib:/home/localsoft/slurm/lib:/home/localsoft/mvapich2/lib (and
>>> some non related stuff)
>>> MPICHLIB_LDFLAGS='-Wl,-rpath,/home/localsoft/slurm/lib
>>> -Wl,-rpath,/home/localsoft/mvapich2/lib'
>>>
>>> Compilation:
>>>
>>> ./configure --prefix=/home/localsoft/mvapich2 --disable-mcast
>>> --with-slurm=/home/localsoft/slurm --with-pmi=pmi2 --with-pm=slurm
>>> --disable-romio
>>>
>>> Then, in every node of my cluster I have set LD_LIBRARY_PATH to the same
>>> value.
>>>
>>> My code is compiled with:
>>> mpicc  helloWorldMPI.c -o helloWorldMPI -L/home/localsoft/slurm/lib
>>>
>>>
>>> ---
>>> ---
>>> EXECUTION:
>>>
>>> --serial jobs: OK
>>> $ ./helloWorldMPI
>>> Process 0 of 1 is on acme31.ciemat.es
>>> Hello world from process 0 of 1
>>> Goodbye world from process 0 of 1
>>>
>>>
>>> $ srun   ./helloWorldMPI
>>> Process 0 of 1 is on acme11.ciemat.es
>>> Hello world from process 0 of 1
>>> Goodbye world from process 0 of 1
>>>
>>> --parallel jobs: As you can see, slurm is executing two serial jobs
>>> instead a single parallel one.
>>>
>>> $ srun -n 2   ./helloWorldMPI
>>> Process 0 of 1 is on acme11.ciemat.es
>>> Hello world from process 0 of 1
>>> Goodbye world from process 0 of 1
>>>
>>> Process 0 of 1 is on acme11.ciemat.es
>>> Hello world from process 0 of 1
>>> Goodbye world from process 0 of 1
>>>
>>>
>>> $ srun -n 2 --tasks-per-node=1   ./helloWorldMPI
>>> Process 0 of 1 is on acme11.ciemat.es
>>> Hello world from process 0 of 1
>>> Goodbye world from process 0 of 1
>>>
>>> Process 0 of 1 is on acme12.ciemat.es
>>> Hello world from process 0 of 1
>>> Goodbye world from process 0 of 1
>>>
>>>
>>> --different Slurm MPI types:
>>>
>>> $ srun -n 2 --mpi=none ./helloWorldMPI
>>> Process 0 of 1 is on acme11.ciemat.es
>>> Hello world from process 0 of 1
>>>
>>> Goodbye world from process 0 of 1
>>> Process 0 of 1 is on acme11.ciemat.es
>>> Hello world from process 0 of 1
>>> Goodbye world from process 0 of 1
>>>
>>>
>>> $ srun -n 2 --mpi=mvapich ./helloWorldMPI
>>> Process 0 of 1 is on acme11.ciemat.es
>>> Hello world from process 0 of 1
>>> Goodbye world from process 0 of 1
>>>
>>> Process 0 of 1 is on acme11.ciemat.es
>>> Hello world from process 0 of 1
>>> Goodbye world from process 0 of 1
>>>
>>> $ srun -n 2 --mpi=pmi2 ./helloWorldMPI
>>> srun: error: task 0 launch failed: Unspecified error
>>> srun: error: task 1 launch failed: Unspecified error
>>>
>>>
>>> ---
>>> ---
>>>
>>> Any clue on what's wrong?
>>>
>>> Thanks for your help,
>>>
>>>
>>> Manuel
>>>
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> mvapich-discuss mailing list
>>> mvapich-discuss at cse.ohio-state.edu
>>> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>>
>>



More information about the mvapich-discuss mailing list