[mvapich-discuss] mvapich2 1.6 cannot run job on many nodes

Jonathan Perkins perkinjo at cse.ohio-state.edu
Fri Jul 15 10:01:38 EDT 2011


Thanks for your note. Good to know that MVAPICH2 1.6 is working fine.
MVAPICH2 1.0.3 is quite old and is not supported any more. The error code
seems to points to IB-related error. You seem to be running an older OFED
also. Please update the OFED to the latest stable version and use it with
the latest MVAPICH2 (1.6) version. In the earlier e-mail, you had also
indicated that you are using the Nemesis interface of MVAPICH2. You can
use the CH3-Gen2 interface to get the best performance, features and
scalability.

Below is a link to the section on building the CH3-Gen2 interface from
the MVAPICH2 1.6 user guide.

http://mvapich.cse.ohio-state.edu/support/user_guide_mvapich2-1.6.html#x1-100004.4

2011/7/15  <worldeb at ukr.net>:
>
>  Hi again,
>
> it seems I found problem or at least localized it.
> There is one node which when I submit job there produces these errors.
> In case of cpi code it happens when this node is in list of more than 8 nodes.
> For other codes it happen just for two nodes (as for mvapich2 1.6 as for 1.0.3)
> Any codes work well when jobs run only on this problem node.
>
> I try to play with this node and network using osu benchmarks, for example osu_bibw
>
> mvapich2-1.6 passed without any problems
> mvapich2-1.0.3 with osu mpiexec & torque shows errors:
> send desc error
> [0] Abort: [] Got completion with error 9, vendor code=8a, dest rank=1
> at line 519 in file ibv_channel_manager.c
> [1] Abort: Got FATAL event 3
> at line 796 in file ibv_channel_manager.c
>
> mvapich1-1.0.1 passed without any errors
> openmpi-1.4.3 with -mca btl self,openib passed without errors.
>
> So, is it IB problem? In this case why does it just happen only with mvapich2?
>
> I tested IB card on this node by standard ib_rdma/read/send_bw/lat and it seems to work.
>
> Thnax,
> Egor.
>
>> I have 8 cores per node. Half of nodes have 16GB RAM, half of them have 32GB.
>> CPU are
>> Intel(R) Xeon(R) E5410 @ 2.33GHz
>> Intel(R) Xeon(R) E5472 @ 3.00GHz
>> Intel(R) Xeon(R) E5620 @ 2.40GHz
>> OFED 1.3.1-rc2 and CentOS 5 with kernel 2.6.18-53.1.21.el5.
>>
>> ulimit -a on the all nodes:
>> core file size (blocks, -c) 0
>> data seg size (kbytes, -d) unlimited
>> max nice (-e) 0
>> file size (blocks, -f) unlimited
>> pending signals (-i) 139264
>> max locked memory (kbytes, -l) unlimited
>> max memory size (kbytes, -m) unlimited
>> open files (-n) 1024
>> pipe size (512 bytes, -p) 8
>> POSIX message queues (bytes, -q) 819200
>> max rt priority (-r) 0
>> stack size (kbytes, -s) 10240
>> cpu time (seconds, -t) unlimited
>> max user processes (-u) 139264
>> virtual memory (kbytes, -v) unlimited
>> file locks (-x) unlimited
>>
>> the same problem I have as for gcc as for intel 10.1 compilers.
>>
>> thnax,
>> Egor.
>>
>> > I've used the same configuration options but I have not been
>> > able to reproduce this problem. I've used varying number of cores
>> > (focusing on 321 and 512 cores), while running cpi and osu_mbw_mr with
>> > mpirun_rsh and hydra (mpiexec). Perhaps there is some missing
>> > information I need to reproduce this. How many cores per machine are
>> > you using? Perhaps a certain machine triggers the problem. Can you
>> > tell us what cpu and how much memory each machine has? Thanks in
>> > advance.
>> >
>> > 2011/7/14 <worldeb at ukr.net>:
>> > >
>> > > Hi folks,
>> > >
>> > > mvapich2-1.6-r4751
>> > > gcc (GCC) 4.1.2 20070626 (Red Hat 4.1.2-14)
>> > > InfiniBand: Mellanox Technologies MT25204
>> > > torque 2.1.8
>> > >
>> > > ./configure --prefix=/usr/mpi/gcc/mvapich2-1.6.0 --enable-f77 --enable-f90 --enable-cxx --enable-debuginfo --enable-smpcoll --enable-async-progress --enable-threads=default --with-hwloc --with-device=ch3:nemesis:ib --enable-sharedlibs=gcc --enable-romio
>> > >
>> > > Cannot run jobs on many nodes (for examples >320 cores) as using batch system with mpiexec osu or native mpiexec as submiting them directly by mpiexec.hydra or mpirun_rsh.
>> > > Actually this number of 320 cores is not fixed. It change from time to time but mpirun_rsh submit jobs successfully on the less nodes exactly.
>> > >
>> > > I try to play only with simple codes like "hello word" on each cpu or even with cpi from examples or osu_benchmarks.
>> > >
>> > > Errors are like:
>> > >
>> > > mpiexec.hydra -n 321 -f HOSTFILE ./test_mvapich2_gcc-1.6.0
>> > >
>> > > Fatal error in MPI_Init: Internal MPI error!, error stack:
>> > > MPIR_Init_thread(413): Initialization failed
>> > > (unknown)(): Internal MPI error!
>> > > =====================================================================================
>> > > = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
>> > > = EXIT CODE: 256
>> > > = CLEANING UP REMAINING PROCESSES
>> > > = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
>> > > =====================================================================================
>> > > [proxy:0:0 at node01] HYD_pmcd_pmip_control_cmd_cb (./pm/pmiserv/pmip_cb.c:906): assert (!closed) failed
>> > > [proxy:0:0 at node01] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status
>> > > [proxy:0:0 at node01].ac.at] main (./pm/pmiserv/pmip.c:214): demux engine error waiting for event
>> > > [mpiexec at head] HYDT_bscu_wait_for_completion (./tools/bootstrap/utils/bscu_wait.c:70): one of the processes terminated badly; aborting
>> > > [mpiexec at head] HYDT_bsci_wait_for_completion (./tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting for completion
>> > > [mpiexec at head] HYD_pmci_wait_for_completion (./pm/pmiserv/pmiserv_pmci.c:199): launcher returned error waiting for completion
>> > > [mpiexec at head] main (./ui/mpich/mpiexec.c:385): process manager error waiting for completion
>> > >
>> > >
>> > > I have no problem with the same codes but compiled by last openmpi with IB and calculated on all nodes.
>> > >
>> > > Any suggestions what was a problem and how to solve it.
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
>



-- 
Jonathan Perkins
http://www.cse.ohio-state.edu/~perkinjo



More information about the mvapich-discuss mailing list