[mvapich-discuss] Hydra initialization timeout on mvapich2 1.6
davidr at ressman.org
davidr at ressman.org
Wed Apr 6 16:54:19 EDT 2011
Hello all,
I'm running into a problem with hydra on mvapich2-1.6, and I'm sure
I'm doing something wrong, but I can't figure out what it is. Here's
what the test environment currently looks like:
2 hosts, computea and computeb, running:
ubuntu 10.04 LTS
ofed 1.5.2_2
2 IP interfaces, hosta (gigE) and hosta-ib0 (connectx II QDR IB)
mvapich2-1.6, configured with:
--enable-sharedlibs=gcc \
--with-pm=hydra \
--enable-f77 \
--enable-fc \
--enable-cxx \
--enable-romio
I have a simple mpi_hello program, and my hosts file looks like:
computea:1
computeb:1
When I run the following command:
mpiexec -iface ib0 -v -f /tmp/hostfile -n 2 ./mpi_hello
At this point, I get the Hydra initialization debug messages, and the
job will hang indefinitely. I can see that the mpi_hello processes
have been started on both clients as well as the hydra_pmi_proxy. If
I set MPIEXEC_TIMEOUT, it will time out and exit properly. I don't
see anything particularly useful in the debug output above, but I'm
not familiar at all with Hydra, so I doubt very much I'd be able to
recognize the problem.
What is most confusing is that I tried precisely the same setup on
mpich2 1.3.2p1 (over ethernet) and it worked perfectly. What did I
break in mvapich2?
Thanks!
The output follows:
-- BEGIN MPIEXEC OUTPUT --
mpiexec options:
----------------
Base path: /usr/local/pkg/software/modules_repo/mvapich2/1.6/bin/
Launcher: (null)
Debug level: 1
Enable X: -1
Global environment:
-------------------
MPIEXEC_TIMEOUT=20
MODULE_VERSION_STACK=3.2.7
MANPATH=/usr/local/pkg/software/modules_repo/mvapich2/1.6/share/man:/usr/man
TERM=xterm
SHELL=/bin/bash
HISTSIZE=1000
SSH_CLIENT=192.168.1018.100 44940 22
LIBRARY_PATH=/usr/local/pkg/software/modules_repo/mvapich2/1.6/lib
OLDPWD=/home/myuser
SSH_TTY=/dev/pts/0
USER=myuser
LD_LIBRARY_PATH=/usr/local/pkg/software/modules_repo/mvapich2/1.6/lib
CPATH=/usr/local/pkg/software/modules_repo/mvapich2/1.6/include
MODULE_VERSION=3.2.7
MAIL=/var/mail/myuser
PATH=/usr/local/pkg/software/modules_repo/mvapich2/1.6/bin:/usr/local/bin:/usr/bin:/bin:/sbin:/usr/sbin:/usr/local/sbin
PWD=/home/myuser/jobs/mpi_hello
_LMFILES_=/physical/gpfs/oak-hpc/home_01/grid_software/modules_modulefiles/mpi/mvapich2/1.6
LANG=en_US
MODULEPATH=/mnt/nfs/GRID_SOFTWARE/modules_modulefiles:/physical/gpfs/oak-hpc/home_01/grid_software/modules_modulefiles
LOADEDMODULES=mpi/mvapich2/1.6
PS1=\h \t \W [\!/$?] \$
HISTCONTROL=ignoreboth
PS2= >
SHLVL=1
HOME=/home/myuser
MODULE_VERSION=3.2.7
BASH_ENV=~/.bashrc
LOGNAME=myuser
SSH_CONNECTION=192.168.1018.100 44940 192.168.100.210 22
MODULESHOME=/usr/Modules/3.2.7
INCLUDE=/usr/local/pkg/software/modules_repo/mvapich2/1.6/include
HISTFILE=/home/myuser/.bash_history.d/history.computea
module=() { eval `/usr/Modules/$MODULE_VERSION/bin/modulecmd bash $*` }
_=/usr/local/pkg/software/modules_repo/mvapich2/1.6/bin/mpiexec
Hydra internal environment:
---------------------------
GFORTRAN_UNBUFFERED_PRECONNECTED=y
Proxy information:
*********************
Proxy ID: 1
-----------------
Proxy name: computea
Process count: 1
Proxy exec list:
....................
Exec: ./mpi_hello; Process count: 1
Proxy ID: 2
-----------------
Proxy name: computeb
Process count: 1
Proxy exec list:
....................
Exec: ./mpi_hello; Process count: 1
==================================================================================================
[mpiexec at computea] Timeout set to 20 (-1 means infinite)
[mpiexec at computea] Got a control port string of computea:32853
Proxy launch args:
/usr/local/pkg/software/modules_repo/mvapich2/1.6/bin/hydra_pmi_proxy
--control-port computea:32853 --debug --demux poll --pgid 0 --proxy-id
[mpiexec at computea] PMI FD: (null); PMI PORT: (null); PMI ID/RANK: -1
Arguments being passed to proxy 0:
--version 1.6rc3 --interface-env-name MPICH_INTERFACE_HOSTNAME
--hostname computea --global-core-map 0,1,1 --filler-process-map 0,1,1
--global-process-count 2 --auto-cleanup 1 --pmi-rank -1 --pmi-kvsname
kvs_11662_0 --pmi-process-mapping (vector,(0,2,1)) --ckpoint-num -1
--global-inherited-env 35 'MPIEXEC_TIMEOUT=20'
'MODULE_VERSION_STACK=3.2.7'
'MANPATH=/usr/local/pkg/software/modules_repo/mvapich2/1.6/share/man:/usr/man'
'TERM=xterm' 'SHELL=/bin/bash' 'HISTSIZE=1000'
'SSH_CLIENT=192.168.1018.100 44940 22'
'LIBRARY_PATH=/usr/local/pkg/software/modules_repo/mvapich2/1.6/lib'
'OLDPWD=/home/myuser' 'SSH_TTY=/dev/pts/0' 'USER=myuser'
'LD_LIBRARY_PATH=/usr/local/pkg/software/modules_repo/mvapich2/1.6/lib'
'CPATH=/usr/local/pkg/software/modules_repo/mvapich2/1.6/include'
'MODULE_VERSION=3.2.7' 'MAIL=/var/mail/myuser'
'PATH=/usr/local/pkg/software/modules_repo/mvapich2/1.6/bin:/usr/local/bin:/usr/bin:/bin:/sbin:/usr/sbin:/usr/local/sbin'
'PWD=/home/myuser/jobs/mpi_hello' '_LMFILES_
=/physical/gpfs/oak-hpc/home_01/grid_software/modules_modulefiles/mpi/mvapich2/1.6'
'LANG=en_US' 'MODULEPATH=/mnt/nfs/GRID_SOFTWARE/modules_modulefiles:/physical/gpfs/oak-hpc/home_01/grid_software/modules_modulefiles'
'LOADEDMODULES=mpi/mvapich2/1.6' 'PS1=\h \t \W [\!/$?] \$ '
'HISTCONTROL=ignoreboth' 'PS2= > ' 'SHLVL=1' 'HOME=/home/myuser'
'MODULE_VERSION=3.2.7' 'BASH_ENV=~/.bashrc' 'LOGNAME=myuser'
'SSH_CONNECTION=192.168.1018.100 44940 192.168.100.210 22'
'MODULESHOME=/usr/Modules/3.2.7'
'INCLUDE=/usr/local/pkg/software/modules_repo/mvapich2/1.6/include'
'HISTFILE=/home/myuser/.bash_history.d/history.computea' 'module=() {
eval `/usr/Modules/$MODULE_VERSION/bin/modulecmd bash $*` }'
'_=/usr/local/pkg/software/modules_repo/mvapich2/1.6/bin/mpiexec'
--global-user-env 0 --global-system-env 1
'GFORTRAN_UNBUFFERED_PRECONNECTED=y' --proxy-core-count 1 --exec
--exec-appnum 0 --exec-proc-count 1 --exec-local-env 0 --exec-wdir
/home/myuser/jobs/mpi_hello --exec-args 1 ./mpi_hello
[mpiexec at computea] PMI FD: (null); PMI PORT: (null); PMI ID/RANK: -1
Arguments being passed to proxy 1:
--version 1.6rc3 --interface-env-name MPICH_INTERFACE_HOSTNAME
--hostname computeb --global-core-map 1,1,0 --filler-process-map 1,1,0
--global-process-count 2 --auto-cleanup 1 --pmi-rank -1 --pmi-kvsname
kvs_11662_0 --pmi-process-mapping (vector,(0,2,1)) --ckpoint-num -1
--global-inherited-env 35 'MPIEXEC_TIMEOUT=20'
'MODULE_VERSION_STACK=3.2.7'
'MANPATH=/usr/local/pkg/software/modules_repo/mvapich2/1.6/share/man:/usr/man'
'TERM=xterm' 'SHELL=/bin/bash' 'HISTSIZE=1000'
'SSH_CLIENT=192.168.1018.100 44940 22'
'LIBRARY_PATH=/usr/local/pkg/software/modules_repo/mvapich2/1.6/lib'
'OLDPWD=/home/myuser' 'SSH_TTY=/dev/pts/0' 'USER=myuser'
'LD_LIBRARY_PATH=/usr/local/pkg/software/modules_repo/mvapich2/1.6/lib'
'CPATH=/usr/local/pkg/software/modules_repo/mvapich2/1.6/include'
'MODULE_VERSION=3.2.7' 'MAIL=/var/mail/myuser'
'PATH=/usr/local/pkg/software/modules_repo/mvapich2/1.6/bin:/usr/local/bin:/usr/bin:/bin:/sbin:/usr/sbin:/usr/local/sbin'
'PWD=/home/myuser/jobs/mpi_hello' '_LMFILES_
=/physical/gpfs/oak-hpc/home_01/grid_software/modules_modulefiles/mpi/mvapich2/1.6'
'LANG=en_US' 'MODULEPATH=/mnt/nfs/GRID_SOFTWARE/modules_modulefiles:/physical/gpfs/oak-hpc/home_01/grid_software/modules_modulefiles'
'LOADEDMODULES=mpi/mvapich2/1.6' 'PS1=\h \t \W [\!/$?] \$ '
'HISTCONTROL=ignoreboth' 'PS2= > ' 'SHLVL=1' 'HOME=/home/myuser'
'MODULE_VERSION=3.2.7' 'BASH_ENV=~/.bashrc' 'LOGNAME=myuser'
'SSH_CONNECTION=192.168.1018.100 44940 192.168.100.210 22'
'MODULESHOME=/usr/Modules/3.2.7'
'INCLUDE=/usr/local/pkg/software/modules_repo/mvapich2/1.6/include'
'HISTFILE=/home/myuser/.bash_history.d/history.computea' 'module=() {
eval `/usr/Modules/$MODULE_VERSION/bin/modulecmd bash $*` }'
'_=/usr/local/pkg/software/modules_repo/mvapich2/1.6/bin/mpiexec'
--global-user-env 0 --global-system-env 1
'GFORTRAN_UNBUFFERED_PRECONNECTED=y' --proxy-core-count 1 --exec
--exec-appnum 0 --exec-proc-count 1 --exec-local-env 0 --exec-wdir
/home/myuser/jobs/mpi_hello --exec-args 1 ./mpi_hello
[mpiexec at computea] Launch arguments:
/usr/local/pkg/software/modules_repo/mvapich2/1.6/bin/hydra_pmi_proxy
--control-port computea:32853 --debug --demux poll --pgid 0 --proxy-id
0 [mpiexec at computea] Launch arguments: /usr/bin/ssh -x computeb
"/usr/local/pkg/software/modules_repo/mvapich2/1.6/bin/hydra_pmi_proxy"
--control-port computea:32853 --debug --demux poll --pgid 0 --proxy-id
1 [proxy:0:0 at computea] got pmi command (from 0): init
pmi_version=1 pmi_subversion=1
[proxy:0:0 at computea] PMI response: cmd=response_to_init pmi_version=1
pmi_subversion=1 rc=0 [proxy:0:0 at computea] got pmi command (from 0):
get_maxes
[proxy:0:0 at computea] PMI response: cmd=maxes kvsname_max=256
keylen_max=64 vallen_max=1024 [proxy:0:0 at computea] got pmi command
(from 0): get_appnum
[proxy:0:0 at computea] PMI response: cmd=appnum appnum=0
[proxy:0:0 at computea] got pmi command (from 0): get_my_kvsname
[proxy:0:0 at computea] PMI response: cmd=my_kvsname kvsname=kvs_11662_0
[proxy:0:0 at computea] got pmi command (from 0): get_my_kvsname
[proxy:0:0 at computea] PMI response: cmd=my_kvsname kvsname=kvs_11662_0
[proxy:0:0 at computea] got pmi command (from 0): get
kvsname=kvs_11662_0 key=PMI_process_mapping [proxy:0:0 at computea] PMI
response: cmd=get_result rc=0 msg=success value=(vector,(0,2,1))
[proxy:0:0 at computea] got pmi command (from 0): put
kvsname=kvs_11662_0 key=MVAPICH2_0000 value=00000146:004c0051:004c0052:
[proxy:0:0 at computea] we don't understand this command put; forwarding
upstream [mpiexec at computea] [pgid: 0] got PMI command: cmd=put
kvsname=kvs_11662_0 key=MVAPICH2_0000
value=00000146:004c0051:004c0052:
[mpiexec at computea] PMI response to fd 6 pid 0: cmd=put_result rc=0
msg=success [proxy:0:0 at computea] we don't understand the response
put_result; forwarding downstream [proxy:0:0 at computea] got pmi command
(from 0): barrier_in
[proxy:0:0 at computea] forwarding command (cmd=barrier_in) upstream
[mpiexec at computea] [pgid: 0] got PMI command: cmd=barrier_in
[proxy:0:1 at computeb] got pmi command (from 4): init
pmi_version=1 pmi_subversion=1
[proxy:0:1 at computeb] PMI response: cmd=response_to_init pmi_version=1
pmi_subversion=1 rc=0 [proxy:0:1 at computeb] got pmi command (from 4):
get_maxes
[proxy:0:1 at computeb] PMI response: cmd=maxes kvsname_max=256
keylen_max=64 vallen_max=1024 [proxy:0:1 at computeb] got pmi command
(from 4): get_appnum
[proxy:0:1 at computeb] PMI response: cmd=appnum appnum=0
[proxy:0:1 at computeb] got pmi command (from 4): get_my_kvsname
[proxy:0:1 at computeb] PMI response: cmd=my_kvsname kvsname=kvs_11662_0
[proxy:0:1 at computeb] got pmi command (from 4): get_my_kvsname
[proxy:0:1 at computeb] PMI response: cmd=my_kvsname kvsname=kvs_11662_0
[proxy:0:1 at computeb] got pmi command (from 4): get
kvsname=kvs_11662_0 key=PMI_process_mapping [proxy:0:1 at computeb] PMI
response: cmd=get_result rc=0 msg=success value=(vector,(0,2,1))
[mpiexec at computea] [pgid: 0] got PMI command: cmd=put
kvsname=kvs_11662_0 key=MVAPICH2_0001
value=00000147:00580051:00580052:
[mpiexec at computea] PMI response to fd 7 pid 4: cmd=put_result rc=0
msg=success [proxy:0:1 at computeb] got pmi command (from 4): put
kvsname=kvs_11662_0 key=MVAPICH2_0001 value=00000147:00580051:00580052:
[proxy:0:1 at computeb] we don't understand this command put; forwarding
upstream [proxy:0:1 at computeb] we don't understand the response
put_result; forwarding downstream [mpiexec at computea] [pgid: 0] got PMI
command: cmd=barrier_in [mpiexec at computea] PMI response to fd 6 pid 4:
cmd=barrier_out [mpiexec at computea] PMI response to fd 7 pid 4:
cmd=barrier_out [proxy:0:0 at computea] PMI response: cmd=barrier_out
[proxy:0:1 at computeb] got pmi command (from 4): barrier_in
[proxy:0:1 at computeb] forwarding command (cmd=barrier_in) upstream
[proxy:0:0 at computea] got pmi command (from 0): get
kvsname=kvs_11662_0 key=MVAPICH2_0001
[mpiexec at computea] [pgid: 0] got PMI command: cmd=get
kvsname=kvs_11662_0 key=MVAPICH2_0001 [mpiexec at computea] PMI response
to fd 6 pid 0: cmd=get_result rc=0 msg=success
value=00000147:00580051:00580052:
[proxy:0:0 at computea] forwarding command (cmd=get kvsname=kvs_11662_0
key=MVAPICH2_0001) upstream [proxy:0:0 at computea] we don't understand
the response get_result; forwarding downstream [proxy:0:0 at computea]
got pmi command (from 0): get
kvsname=kvs_11662_0 key=MVAPICH2_0001
[proxy:0:0 at computea] forwarding command (cmd=get kvsname=kvs_11662_0
key=MVAPICH2_0001) upstream [proxy:0:1 at computeb] PMI response:
cmd=barrier_out [mpiexec at computea] [pgid: 0] got PMI command: cmd=get
kvsname=kvs_11662_0 key=MVAPICH2_0001 [mpiexec at computea] PMI response
to fd 6 pid 0: cmd=get_result rc=0 msg=success
value=00000147:00580051:00580052:
[mpiexec at computea] [pgid: 0] got PMI command: cmd=get
kvsname=kvs_11662_0 key=MVAPICH2_0000 [mpiexec at computea] PMI response
to fd 7 pid 4: cmd=get_result rc=0 msg=success
value=00000146:004c0051:004c0052:
[proxy:0:0 at computea] we don't understand the response get_result;
forwarding downstream [proxy:0:1 at computeb] got pmi command (from 4):
get
kvsname=kvs_11662_0 key=MVAPICH2_0000
[proxy:0:1 at computeb] forwarding command (cmd=get kvsname=kvs_11662_0
key=MVAPICH2_0000) upstream [proxy:0:0 at computea] got pmi command (from
0): barrier_in
[proxy:0:0 at computea] forwarding command (cmd=barrier_in) upstream
[mpiexec at computea] [pgid: 0] got PMI command: cmd=barrier_in
[proxy:0:1 at computeb] we don't understand the response get_result;
forwarding downstream [mpiexec at computea] [pgid: 0] got PMI command:
cmd=get kvsname=kvs_11662_0 key=MVAPICH2_0000 [mpiexec at computea] PMI
response to fd 7 pid 4: cmd=get_result rc=0 msg=success
value=00000146:004c0051:004c0052:
[proxy:0:1 at computeb] got pmi command (from 4): get
kvsname=kvs_11662_0 key=MVAPICH2_0000
[proxy:0:1 at computeb] forwarding command (cmd=get kvsname=kvs_11662_0
key=MVAPICH2_0000) upstream [proxy:0:1 at computeb] we don't understand
the response get_result; forwarding downstream [mpiexec at computea]
[pgid: 0] got PMI command: cmd=barrier_in [mpiexec at computea] PMI
response to fd 6 pid 4: cmd=barrier_out [mpiexec at computea] PMI
response to fd 7 pid 4: cmd=barrier_out [proxy:0:1 at computeb] got pmi
command (from 4): barrier_in
[proxy:0:1 at computeb] forwarding command (cmd=barrier_in) upstream
[proxy:0:0 at computea] PMI response: cmd=barrier_out
[proxy:0:1 at computeb] PMI response: cmd=barrier_out
[proxy:0:0 at computea] got pmi command (from 0): barrier_in
[proxy:0:0 at computea] forwarding command (cmd=barrier_in) upstream
[mpiexec at computea] [pgid: 0] got PMI command: cmd=barrier_in
[mpiexec at computea] [pgid: 0] got PMI command: cmd=barrier_in
[mpiexec at computea] PMI response to fd 6 pid 4: cmd=barrier_out
[mpiexec at computea] PMI response to fd 7 pid 4: cmd=barrier_out
[proxy:0:0 at computea] PMI response: cmd=barrier_out
[proxy:0:1 at computeb] got pmi command (from 4): barrier_in
[proxy:0:1 at computeb] forwarding command (cmd=barrier_in) upstream
[proxy:0:1 at computeb] PMI response: cmd=barrier_out
[proxy:0:0 at computea] got pmi command (from 0): barrier_in
[proxy:0:0 at computea] forwarding command (cmd=barrier_in) upstream
[mpiexec at computea] [pgid: 0] got PMI command: cmd=barrier_in
[mpiexec at computea] [pgid: 0] got PMI command: cmd=barrier_in
[mpiexec at computea] PMI response to fd 6 pid 4: cmd=barrier_out
[mpiexec at computea] PMI response to fd 7 pid 4: cmd=barrier_out
[proxy:0:0 at computea] PMI response: cmd=barrier_out
[proxy:0:1 at computeb] got pmi command (from 4): barrier_in
[proxy:0:1 at computeb] forwarding command (cmd=barrier_in) upstream
[proxy:0:1 at computeb] PMI response: cmd=barrier_out
[proxy:0:0 at computea] got pmi command (from 0): barrier_in
[proxy:0:0 at computea] forwarding command (cmd=barrier_in) upstream
[mpiexec at computea] [pgid: 0] got PMI command: cmd=barrier_in
[mpiexec at computea] [pgid: 0] got PMI command: cmd=barrier_in
[mpiexec at computea] PMI response to fd 6 pid 4: cmd=barrier_out
[mpiexec at computea] PMI response to fd 7 pid 4: cmd=barrier_out
[proxy:0:0 at computea] PMI response: cmd=barrier_out
[proxy:0:1 at computeb] got pmi command (from 4): barrier_in
[proxy:0:1 at computeb] forwarding command (cmd=barrier_in) upstream
[proxy:0:1 at computeb] PMI response: cmd=barrier_out [mpiexec at computea]
[pgid: 0] got PMI command: cmd=barrier_in [proxy:0:1 at computeb] got pmi
command (from 4): barrier_in
[proxy:0:1 at computeb] forwarding command (cmd=barrier_in) upstream
[proxy:0:0 at computea] got pmi command (from 0): barrier_in
[proxy:0:0 at computea] forwarding command (cmd=barrier_in) upstream
[mpiexec at computea] [pgid: 0] got PMI command: cmd=barrier_in
[mpiexec at computea] PMI response to fd 6 pid 0: cmd=barrier_out
[mpiexec at computea] PMI response to fd 7 pid 0: cmd=barrier_out
[proxy:0:0 at computea] PMI response: cmd=barrier_out
[proxy:0:1 at computeb] PMI response: cmd=barrier_out [mpiexec at computea]
[pgid: 0] got PMI command: cmd=barrier_in [proxy:0:1 at computeb] got pmi
command (from 4): barrier_in
[proxy:0:1 at computeb] forwarding command (cmd=barrier_in) upstream
[proxy:0:0 at computea] got pmi command (from 0): barrier_in
[proxy:0:0 at computea] forwarding command (cmd=barrier_in) upstream
[mpiexec at computea] [pgid: 0] got PMI command: cmd=barrier_in
[mpiexec at computea] PMI response to fd 6 pid 0: cmd=barrier_out
[mpiexec at computea] PMI response to fd 7 pid 0: cmd=barrier_out
[proxy:0:0 at computea] PMI response: cmd=barrier_out
[proxy:0:0 at computea] got pmi command (from 0): barrier_in
[proxy:0:0 at computea] forwarding command (cmd=barrier_in) upstream
[mpiexec at computea] [pgid: 0] got PMI command: cmd=barrier_in
[proxy:0:1 at computeb] PMI response: cmd=barrier_out [mpiexec at computea]
[pgid: 0] got PMI command: cmd=barrier_in [mpiexec at computea] PMI
response to fd 6 pid 4: cmd=barrier_out [mpiexec at computea] PMI
response to fd 7 pid 4: cmd=barrier_out [proxy:0:0 at computea] PMI
response: cmd=barrier_out [proxy:0:1 at computeb] got pmi command (from
4): barrier_in
[proxy:0:1 at computeb] forwarding command (cmd=barrier_in) upstream
[proxy:0:0 at computea] got pmi command (from 0): barrier_in
[proxy:0:0 at computea] forwarding command (cmd=barrier_in) upstream
[mpiexec at computea] [pgid: 0] got PMI command: cmd=barrier_in
[proxy:0:1 at computeb] PMI response: cmd=barrier_out [mpiexec at computea]
[pgid: 0] got PMI command: cmd=barrier_in [mpiexec at computea] PMI
response to fd 6 pid 4: cmd=barrier_out [mpiexec at computea] PMI
response to fd 7 pid 4: cmd=barrier_out [proxy:0:0 at computea] PMI
response: cmd=barrier_out [proxy:0:1 at computeb] got pmi command (from
4): barrier_in
[proxy:0:1 at computeb] forwarding command (cmd=barrier_in) upstream
[proxy:0:1 at computeb] PMI response: cmd=barrier_out
< at this point, output stops until the job times out or I Ctrl-C it >
-- END MPIEXEC OUTPUT --
More information about the mvapich-discuss
mailing list