[mvapich-discuss] mvapich2 1.6 cannot run job on many nodes

worldeb at ukr.net worldeb at ukr.net
Thu Jul 14 13:02:34 EDT 2011


 Hi folks,

mvapich2-1.6-r4751
gcc (GCC) 4.1.2 20070626 (Red Hat 4.1.2-14)
InfiniBand: Mellanox Technologies MT25204
torque 2.1.8

./configure --prefix=/usr/mpi/gcc/mvapich2-1.6.0 --enable-f77 --enable-f90 --enable-cxx --enable-debuginfo --enable-smpcoll --enable-async-progress --enable-threads=default --with-hwloc --with-device=ch3:nemesis:ib --enable-sharedlibs=gcc --enable-romio

Cannot run jobs on many nodes (for examples >320 cores) as using batch system with mpiexec osu or native mpiexec as submiting them directly by mpiexec.hydra or mpirun_rsh.
Actually this number of 320 cores is not fixed. It change from time to time but mpirun_rsh submit jobs successfully on the less nodes exactly.

I try to play only with simple codes like "hello word" on each cpu or even with cpi from examples or osu_benchmarks.

Errors are like:

mpiexec.hydra -n 321 -f HOSTFILE ./test_mvapich2_gcc-1.6.0

Fatal error in MPI_Init: Internal MPI error!, error stack:
MPIR_Init_thread(413): Initialization failed
(unknown)(): Internal MPI error!
=====================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   EXIT CODE: 256
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
=====================================================================================
[proxy:0:0 at node01] HYD_pmcd_pmip_control_cmd_cb (./pm/pmiserv/pmip_cb.c:906): assert (!closed) failed
[proxy:0:0 at node01] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status
[proxy:0:0 at node01].ac.at] main (./pm/pmiserv/pmip.c:214): demux engine error waiting for event
[mpiexec at head] HYDT_bscu_wait_for_completion (./tools/bootstrap/utils/bscu_wait.c:70): one of the processes terminated badly; aborting
[mpiexec at head] HYDT_bsci_wait_for_completion (./tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting for completion
[mpiexec at head] HYD_pmci_wait_for_completion (./pm/pmiserv/pmiserv_pmci.c:199): launcher returned error waiting for completion
[mpiexec at head] main (./ui/mpich/mpiexec.c:385): process manager error waiting for completion


I have no problem with the same codes but compiled by last openmpi with IB and calculated on all nodes.

Any suggestions what was a problem and how to solve it.

Thanx in advanced,
Egor.


More information about the mvapich-discuss mailing list