[mvapich-discuss] timeout problems with mpiexec and bash

Andy Wettstein ajw at illinois.edu
Mon Apr 11 14:38:06 EDT 2011


Hello,

I've been having some problems with launching a 2000+ core job using
mpiexec 0.84 and mvapich2 1.6 when using bash as the shell. We're
running on Scientific Linux 6 (aka rhel 6).

I get errors like this:

[unset]: connect failed with timeout
[unset]: Unable to connect to taub511 on 39404
Fatal error in MPI_Init_thread:
Other MPI error, error stack:
MPIR_Init_thread(413): Initialization failed
MPID_Init(203).......: channel initialization failed
MPID_Init(514).......: PMI_Init returned -1


The machines we are using have 12 cores. Right now I'm launching on 192
x 12 so 2304 cores total.

Smaller core counts seem to work ok. For instance, a 1200 core job just
launched fine. Switching the shell to tcsh also allows me to launch these 
jobs. I haven't seen tcsh fail yet in starting this job.

I'll attach the environment and limits that are set for these jobs.

I asked on the mpiexec mailing list and they believed that I must be
hitting some timeout in the mvapich2 startup code.

If you need any more info, just let me know.

Thanks
andy


-- 
andy wettstein
unix administrator
department of physics
university of illinois at urbana-champaign

-------------- next part --------------
----------------------------------------
Begin Torque Prologue (Fri Apr  8 16:57:50 2011)
Job ID:           398
Username:         ajw
Group:            ajw
Job Name:         tcsh-mvapich2.submit
Limits:           ncpus=1,neednodes=193:ppn=12,nodes=193:ppn=12,walltime=01:00:00
Job Queue:        secondary
Account:          Unknown
Nodes:            taub257 taub258 taub259 taub260 taub261 taub262 taub263 taub264 taub265 taub266 taub267 taub268 taub269 taub270 taub271 taub272 taub321 taub322 taub323 taub324 taub325 taub326 taub327 taub328 taub331 taub332 taub333 taub334 taub335 taub336 taub337 taub338 taub339 taub341 taub342 taub343 taub345 taub346 taub347 taub348 taub349 taub350 taub351 taub352 taub354 taub355 taub356 taub357 taub358 taub359 taub360 taub361 taub362 taub363 taub364 taub365 taub366 taub367 taub368 taub369 taub371 taub372 taub373 taub374 taub375 taub376 taub377 taub378 taub379 taub380 taub381 taub382 taub383 taub384 taub385 taub386 taub387 taub388 taub389 taub390 taub391 taub392 taub393 taub394 taub395 taub396 taub397 taub398 taub399 taub400 taub401 taub402 taub403 taub404 taub405 taub406 taub407 taub408 taub409 taub411 taub412 taub413 taub414 taub415 taub416 taub417 taub418 taub419 taub420 taub421 taub422 taub423 taub424 taub425 taub426 taub428 taub429 taub430 taub431 taub432 taub433 taub434 taub435 taub436 taub438 taub440 taub441 taub442 taub443 taub444 taub445 taub446 taub447 taub448 taub449 taub450 taub451 taub452 taub453 taub454 taub455 taub456 taub457 taub458 taub459 taub460 taub461 taub462 taub463 taub464 taub465 taub467 taub468 taub469 taub470 taub471 taub472 taub473 taub474 taub475 taub476 taub477 taub478 taub479 taub480 taub481 taub482 taub485 taub486 taub487 taub488 taub489 taub490 taub491 taub492 taub493 taub494 taub495 taub496 taub497 taub498 taub499 taub500 taub501 taub502 taub504 taub505 taub506 taub507 taub508 taub509 taub510 taub511 
End Torque Prologue
----------------------------------------
Warning: no access to tty (Bad file descriptor).
Thus no job control in this shell.
soft limits
cputime      unlimited
filesize     unlimited
datasize     unlimited
stacksize    10240 kbytes
coredumpsize 0 kbytes
memoryuse    unlimited
vmemoryuse   unlimited
descriptors  65536 
memorylocked unlimited
maxproc      1024 

hard limits
cputime      unlimited
filesize     unlimited
datasize     unlimited
stacksize    unlimited
coredumpsize unlimited
memoryuse    unlimited
vmemoryuse   unlimited
descriptors  65536 
memorylocked unlimited
maxproc      193002 
umask
2

PATH=/usr/local/mpiexec/bin:/usr/local/mvapich2-1.6-gcc/bin:/usr/local/vim-7.3/bin:/bin:/usr/bin:/usr/lpp/mmfs/bin:/usr/lpp/mmfs/bin
PBS_O_QUEUE=secondary
PBS_O_HOME=/home/ajw
PBS_O_LANG=C
PBS_O_LOGNAME=ajw
PBS_O_PATH=/usr/local/mpiexec/bin:/usr/local/mvapich2-1.6-gccdebug/bin:/usr/local/torque/bin:/usr/local/vim-7.3/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/usr/lpp/mmfs/bin:/usr/local/moab-6.0.2.s1324/bin:/home/ajw/bin
PBS_O_MAIL=/var/spool/mail/ajw
PBS_O_SHELL=/bin/bash
PBS_O_HOST=taubh2
PBS_SERVER=taubm1.campuscluster.illinois.edu
PBS_O_WORKDIR=/home/ajw/scratch/mt-III-TEST
HOME=/home/ajw
LOGNAME=ajw
PBS_JOBNAME=tcsh-mvapich2.submit
PBS_JOBID=398.taubm1.campuscluster.illinois.edu
PBS_QUEUE=secondary
SHELL=/bin/tcsh
USER=ajw
PBS_JOBCOOKIE=21AA054E62744DDE6627D2EB7DA2860C
PBS_NODENUM=0
PBS_TASKNUM=1
PBS_MOMPORT=15003
PBS_NODEFILE=/var/spool/torque/aux//398.taubm1.campuscluster.illinois.edu
PBS_GPUFILE=/var/spool/torque/aux//398.taubm1.campuscluster.illinois.edugpu
PBS_NUM_NODES=193
PBS_NUM_PPN=12
PBS_VERSION=TORQUE-3.0.1-snap.201104061727
PBS_VNODENUM=0
PBS_ENVIRONMENT=PBS_BATCH
ENVIRONMENT=BATCH
HOSTTYPE=x86_64-linux
VENDOR=unknown
OSTYPE=linux
MACHTYPE=x86_64
SHLVL=2
PWD=/home/ajw
GROUP=ajw
HOST=taub511
MAIL=/var/spool/mail/ajw
HOSTNAME=taub511
LS_COLORS=
G_BROKEN_FILENAMES=1
LESSOPEN=|/usr/bin/lesspipe.sh %s
MODULESHOME=/usr/share/Modules
MODULEPATH=/usr/share/Modules/modulefiles:/etc/modulefiles:/usr/apps/modulefiles:/usr/local/modulefiles
LOADEDMODULES=vim/7.3:mvapich2/mpiexec:mvapich2/1.6-gcc
LD_LIBRARY_PATH=/usr/local/mvapich2-1.6-gcc/lib
MANPATH=/usr/local/mpiexec/man:/usr/local/mvapich2-1.6-gcc/share/man:/usr/local/vim-7.3/share/man:/usr/man
_LMFILES_=/usr/local/modulefiles/vim/7.3:/usr/local/modulefiles/mvapich2/mpiexec:/usr/local/modulefiles/mvapich2/1.6-gcc
-------------- next part --------------
----------------------------------------
Begin Torque Prologue (Fri Apr  8 16:08:47 2011)
Job ID:           391
Username:         ajw
Group:            ajw
Job Name:         namd-mvapich2.submit
Limits:           ncpus=1,neednodes=193:ppn=12,nodes=193:ppn=12,walltime=00:01:30
Job Queue:        secondary
Account:          Unknown
Nodes:            taub257 taub258 taub259 taub260 taub261 taub262 taub263 taub264 taub265 taub266 taub267 taub268 taub269 taub270 taub271 taub272 taub321 taub322 taub323 taub324 taub325 taub326 taub327 taub328 taub331 taub332 taub333 taub334 taub335 taub336 taub337 taub338 taub339 taub341 taub342 taub343 taub345 taub346 taub347 taub348 taub349 taub350 taub351 taub352 taub354 taub355 taub356 taub357 taub358 taub359 taub360 taub361 taub362 taub363 taub364 taub365 taub366 taub367 taub368 taub369 taub371 taub372 taub373 taub374 taub375 taub376 taub377 taub378 taub379 taub380 taub381 taub382 taub383 taub384 taub385 taub386 taub387 taub388 taub389 taub390 taub391 taub392 taub393 taub394 taub395 taub396 taub397 taub398 taub399 taub400 taub401 taub402 taub403 taub404 taub405 taub406 taub407 taub408 taub409 taub411 taub412 taub413 taub414 taub415 taub416 taub417 taub418 taub419 taub420 taub421 taub422 taub423 taub424 taub425 taub426 taub428 taub429 taub430 taub431 taub432 taub433 taub434 taub435 taub436 taub438 taub440 taub441 taub442 taub443 taub444 taub445 taub446 taub447 taub448 taub449 taub450 taub451 taub452 taub453 taub454 taub455 taub456 taub457 taub458 taub459 taub460 taub461 taub462 taub463 taub464 taub465 taub467 taub468 taub469 taub470 taub471 taub472 taub473 taub474 taub475 taub476 taub477 taub478 taub479 taub480 taub481 taub482 taub485 taub486 taub487 taub488 taub489 taub490 taub491 taub492 taub493 taub494 taub495 taub496 taub497 taub498 taub499 taub500 taub501 taub502 taub504 taub505 taub506 taub507 taub508 taub509 taub510 taub511 
End Torque Prologue
----------------------------------------
soft limits
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 193002
max locked memory       (kbytes, -l) unlimited
max memory size         (kbytes, -m) unlimited
open files                      (-n) 65536
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 10240
cpu time               (seconds, -t) unlimited
max user processes              (-u) 1024
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

hard limits
core file size          (blocks, -c) unlimited
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 193002
max locked memory       (kbytes, -l) unlimited
max memory size         (kbytes, -m) unlimited
open files                      (-n) 65536
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) unlimited
cpu time               (seconds, -t) unlimited
max user processes              (-u) 193002
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited
BASH=/bin/bash
BASHOPTS=cmdhist:extquote:force_fignore:hostcomplete:interactive_comments:progcomp:promptvars:sourcepath
BASH_ALIASES=()
BASH_ARGC=()
BASH_ARGV=()
BASH_CMDS=()
BASH_LINENO=([0]="0")
BASH_SOURCE=()
BASH_VERSINFO=([0]="4" [1]="1" [2]="2" [3]="1" [4]="release" [5]="x86_64-koji-linux-gnu")
BASH_VERSION='4.1.2(1)-release'
CVS_RSH=/usr/bin/ssh
DIRSTACK=()
ENVIRONMENT=BATCH
EUID=394298
GROUPS=()
G_BROKEN_FILENAMES=1
HISTCONTROL=ignoredups
HISTSIZE=1000
HOME=/home/ajw
HOSTNAME=taub511
HOSTTYPE=x86_64
IFS=$' \t\n'
LD_LIBRARY_PATH=/usr/local/mvapich2-1.5-gcc/lib:/usr/local/torque/lib:/usr/local/moab-6.0.2.s1324/lib
LESSOPEN='|/usr/bin/lesspipe.sh %s'
LOADEDMODULES=vim/7.3:torque/3.0:moab/6.0.2:env/taub:mvapich2/mpiexec:mvapich2/1.5-gcc
LOGNAME=ajw
MACHTYPE=x86_64-koji-linux-gnu
MAIL=/var/spool/mail/ajw
MANPATH=/usr/local/mpiexec/man:/usr/local/mvapich2-1.5-gcc/share/man:/usr/local/torque/man:/usr/local/sysman:/usr/local/vim-7.3/share/man:/usr/man:/usr/local/moab-6.0.2.s1324/man
MODULEPATH=/usr/share/Modules/modulefiles:/etc/modulefiles:/usr/apps/modulefiles:/usr/local/modulefiles
MODULESHOME=/usr/share/Modules
OPTERR=1
OPTIND=1
OSTYPE=linux-gnu
PATH=/usr/local/mpiexec/bin:/usr/local/mvapich2-1.5-gcc/bin:/usr/local/torque/bin:/usr/local/vim-7.3/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/usr/lpp/mmfs/bin:/usr/local/moab-6.0.2.s1324/bin:/home/ajw/bin
PBS_ENVIRONMENT=PBS_BATCH
PBS_GPUFILE=/var/spool/torque/aux//391.taubm1.campuscluster.illinois.edugpu
PBS_JOBCOOKIE=CED19645D31285870482F3CA0AA726F3
PBS_JOBID=391.taubm1.campuscluster.illinois.edu
PBS_JOBNAME=namd-mvapich2.submit
PBS_MOMPORT=15003
PBS_NODEFILE=/var/spool/torque/aux//391.taubm1.campuscluster.illinois.edu
PBS_NODENUM=0
PBS_NUM_NODES=193
PBS_NUM_PPN=12
PBS_O_HOME=/home/ajw
PBS_O_HOST=taubh2
PBS_O_LANG=C
PBS_O_LOGNAME=ajw
PBS_O_MAIL=/var/spool/mail/ajw
PBS_O_PATH=/usr/local/mpiexec/bin:/usr/local/mvapich2-1.5-gcc/bin:/usr/local/torque/bin:/usr/local/vim-7.3/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/usr/lpp/mmfs/bin:/usr/local/moab-6.0.2.s1324/bin:/home/ajw/bin
PBS_O_QUEUE=secondary
PBS_O_SHELL=/bin/bash
PBS_O_WORKDIR=/home/ajw/scratch/mt-III-TEST
PBS_QUEUE=secondary
PBS_SERVER=taubm1.campuscluster.illinois.edu
PBS_TASKNUM=1
PBS_VERSION=TORQUE-3.0.1-snap.201104061727
PBS_VNODENUM=0
PIPESTATUS=([0]="0")
PPID=14574
PS4='+ '
PWD=/home/ajw
SHELL=/bin/bash
SHELLOPTS=braceexpand:hashall:interactive-comments
SHLVL=2
TERM=dumb
TMPDIR=/tmp
UID=394298
USER=ajw
_=-Ha
_LMFILES_=/usr/local/modulefiles/vim/7.3:/usr/local/modulefiles/torque/3.0:/usr/local/modulefiles/moab/6.0.2:/usr/local/modulefiles/env/taub:/usr/local/modulefiles/mvapich2/mpiexec:/usr/local/modulefiles/mvapich2/1.5-gcc


More information about the mvapich-discuss mailing list