[mvapich-discuss] mvapich2 1.6 cannot run job on many nodes

Thu Jul 14 23:13:42 EDT 2011

Hello.  I've used the same configuration options but I have not been
able to reproduce this problem.  I've used varying number of cores
(focusing on 321 and 512 cores), while running cpi and osu_mbw_mr with
mpirun_rsh and hydra (mpiexec).  Perhaps there is some missing
information I need to reproduce this.  How many cores per machine are
you using?  Perhaps a certain machine triggers the problem.  Can you
tell us what cpu and how much memory each machine has?  Thanks in
advance.

2011/7/14  <worldeb at ukr.net>:
>
>  Hi folks,
>
> mvapich2-1.6-r4751
> gcc (GCC) 4.1.2 20070626 (Red Hat 4.1.2-14)
> InfiniBand: Mellanox Technologies MT25204
> torque 2.1.8
>
> ./configure --prefix=/usr/mpi/gcc/mvapich2-1.6.0 --enable-f77 --enable-f90 --enable-cxx --enable-debuginfo --enable-smpcoll --enable-async-progress --enable-threads=default --with-hwloc --with-device=ch3:nemesis:ib --enable-sharedlibs=gcc --enable-romio
>
> Cannot run jobs on many nodes (for examples >320 cores) as using batch system with mpiexec osu or native mpiexec as submiting them directly by mpiexec.hydra or mpirun_rsh.
> Actually this number of 320 cores is not fixed. It change from time to time but mpirun_rsh submit jobs successfully on the less nodes exactly.
>
> I try to play only with simple codes like "hello word" on each cpu or even with cpi from examples or osu_benchmarks.
>
> Errors are like:
>
> mpiexec.hydra -n 321 -f HOSTFILE ./test_mvapich2_gcc-1.6.0
>
> Fatal error in MPI_Init: Internal MPI error!, error stack:
> MPIR_Init_thread(413): Initialization failed
> (unknown)(): Internal MPI error!
> =====================================================================================
> =   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
> =   EXIT CODE: 256
> =   CLEANING UP REMAINING PROCESSES
> =   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
> =====================================================================================
> [proxy:0:0 at node01] HYD_pmcd_pmip_control_cmd_cb (./pm/pmiserv/pmip_cb.c:906): assert (!closed) failed
> [proxy:0:0 at node01] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status
> [proxy:0:0 at node01].ac.at] main (./pm/pmiserv/pmip.c:214): demux engine error waiting for event
> [mpiexec at head] HYDT_bscu_wait_for_completion (./tools/bootstrap/utils/bscu_wait.c:70): one of the processes terminated badly; aborting
> [mpiexec at head] HYDT_bsci_wait_for_completion (./tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting for completion
> [mpiexec at head] HYD_pmci_wait_for_completion (./pm/pmiserv/pmiserv_pmci.c:199): launcher returned error waiting for completion
> [mpiexec at head] main (./ui/mpich/mpiexec.c:385): process manager error waiting for completion
>
>
> I have no problem with the same codes but compiled by last openmpi with IB and calculated on all nodes.
>
> Any suggestions what was a problem and how to solve it.
>
> Thanx in advanced,
> Egor.
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
>

-- 
Jonathan Perkins
http://www.cse.ohio-state.edu/~perkinjo