[mvapich-discuss] mvapich2 1.6 cannot run job on many nodes
worldeb at ukr.net
worldeb at ukr.net
Thu Jul 14 13:02:34 EDT 2011
Hi folks,
mvapich2-1.6-r4751
gcc (GCC) 4.1.2 20070626 (Red Hat 4.1.2-14)
InfiniBand: Mellanox Technologies MT25204
torque 2.1.8
./configure --prefix=/usr/mpi/gcc/mvapich2-1.6.0 --enable-f77 --enable-f90 --enable-cxx --enable-debuginfo --enable-smpcoll --enable-async-progress --enable-threads=default --with-hwloc --with-device=ch3:nemesis:ib --enable-sharedlibs=gcc --enable-romio
Cannot run jobs on many nodes (for examples >320 cores) as using batch system with mpiexec osu or native mpiexec as submiting them directly by mpiexec.hydra or mpirun_rsh.
Actually this number of 320 cores is not fixed. It change from time to time but mpirun_rsh submit jobs successfully on the less nodes exactly.
I try to play only with simple codes like "hello word" on each cpu or even with cpi from examples or osu_benchmarks.
Errors are like:
mpiexec.hydra -n 321 -f HOSTFILE ./test_mvapich2_gcc-1.6.0
Fatal error in MPI_Init: Internal MPI error!, error stack:
MPIR_Init_thread(413): Initialization failed
(unknown)(): Internal MPI error!
=====================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= EXIT CODE: 256
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
=====================================================================================
[proxy:0:0 at node01] HYD_pmcd_pmip_control_cmd_cb (./pm/pmiserv/pmip_cb.c:906): assert (!closed) failed
[proxy:0:0 at node01] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status
[proxy:0:0 at node01].ac.at] main (./pm/pmiserv/pmip.c:214): demux engine error waiting for event
[mpiexec at head] HYDT_bscu_wait_for_completion (./tools/bootstrap/utils/bscu_wait.c:70): one of the processes terminated badly; aborting
[mpiexec at head] HYDT_bsci_wait_for_completion (./tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting for completion
[mpiexec at head] HYD_pmci_wait_for_completion (./pm/pmiserv/pmiserv_pmci.c:199): launcher returned error waiting for completion
[mpiexec at head] main (./ui/mpich/mpiexec.c:385): process manager error waiting for completion
I have no problem with the same codes but compiled by last openmpi with IB and calculated on all nodes.
Any suggestions what was a problem and how to solve it.
Thanx in advanced,
Egor.
More information about the mvapich-discuss
mailing list