[mvapich-discuss] mvapich2 1.6 cannot run job on many nodes

Fri Jul 15 02:34:29 EDT 2011

 Hi Jonathan,

I have 8 cores per node. Half of nodes have 16GB RAM, half of them have 32GB.
CPU are
Intel(R) Xeon(R) E5410  @ 2.33GHz
Intel(R) Xeon(R) E5472  @ 3.00GHz
Intel(R) Xeon(R) E5620  @ 2.40GHz
OFED 1.3.1-rc2 and CentOS 5 with kernel 2.6.18-53.1.21.el5.

ulimit -a on the all nodes:
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
max nice                        (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 139264
max locked memory       (kbytes, -l) unlimited
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
max rt priority                 (-r) 0
stack size              (kbytes, -s) 10240
cpu time               (seconds, -t) unlimited
max user processes              (-u) 139264
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

the same problem I have as for gcc as for intel 10.1 compilers.

thnax,
Egor.

> I've used the same configuration options but I have not been
> able to reproduce this problem. I've used varying number of cores
> (focusing on 321 and 512 cores), while running cpi and osu_mbw_mr with
> mpirun_rsh and hydra (mpiexec). Perhaps there is some missing
> information I need to reproduce this. How many cores per machine are
> you using? Perhaps a certain machine triggers the problem. Can you
> tell us what cpu and how much memory each machine has? Thanks in
> advance.
> 
> 2011/7/14 <worldeb at ukr.net>:
> >
> >  Hi folks,
> >
> > mvapich2-1.6-r4751
> > gcc (GCC) 4.1.2 20070626 (Red Hat 4.1.2-14)
> > InfiniBand: Mellanox Technologies MT25204
> > torque 2.1.8
> >
> > ./configure --prefix=/usr/mpi/gcc/mvapich2-1.6.0 --enable-f77 --enable-f90 --enable-cxx --enable-debuginfo --enable-smpcoll --enable-async-progress --enable-threads=default --with-hwloc --with-device=ch3:nemesis:ib --enable-sharedlibs=gcc --enable-romio
> >
> > Cannot run jobs on many nodes (for examples >320 cores) as using batch system with mpiexec osu or native mpiexec as submiting them directly by mpiexec.hydra or mpirun_rsh.
> > Actually this number of 320 cores is not fixed. It change from time to time but mpirun_rsh submit jobs successfully on the less nodes exactly.
> >
> > I try to play only with simple codes like "hello word" on each cpu or even with cpi from examples or osu_benchmarks.
> >
> > Errors are like:
> >
> > mpiexec.hydra -n 321 -f HOSTFILE ./test_mvapich2_gcc-1.6.0
> >
> > Fatal error in MPI_Init: Internal MPI error!, error stack:
> > MPIR_Init_thread(413): Initialization failed
> > (unknown)(): Internal MPI error!
> > =====================================================================================
> > =   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
> > =   EXIT CODE: 256
> > =   CLEANING UP REMAINING PROCESSES
> > =   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
> > =====================================================================================
> > [proxy:0:0 at node01] HYD_pmcd_pmip_control_cmd_cb (./pm/pmiserv/pmip_cb.c:906): assert (!closed) failed
> > [proxy:0:0 at node01] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status
> > [proxy:0:0 at node01].ac.at] main (./pm/pmiserv/pmip.c:214): demux engine error waiting for event
> > [mpiexec at head] HYDT_bscu_wait_for_completion (./tools/bootstrap/utils/bscu_wait.c:70): one of the processes terminated badly; aborting
> > [mpiexec at head] HYDT_bsci_wait_for_completion (./tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting for completion
> > [mpiexec at head] HYD_pmci_wait_for_completion (./pm/pmiserv/pmiserv_pmci.c:199): launcher returned error waiting for completion
> > [mpiexec at head] main (./ui/mpich/mpiexec.c:385): process manager error waiting for completion
> >
> >
> > I have no problem with the same codes but compiled by last openmpi with IB and calculated on all nodes.
> >
> > Any suggestions what was a problem and how to solve it.
> >
> > Thanx in advanced,
> > Egor.
> > _______________________________________________
> > mvapich-discuss mailing list
> > mvapich-discuss at cse.ohio-state.edu
> > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> >
> >
> 
> 
> 
> -- 
> Jonathan Perkins
> http://www.cse.ohio-state.edu/~perkinjo
> 
>