[mvapich-discuss] MPI over hybrid infiniband cards
MICHAEL S DAVIS
msdavis at s383.jpl.nasa.gov
Fri Apr 13 18:59:50 EDT 2012
Thanks for the replies.
on version 1.5.1p1 I was using mpirun which was a wrapper for PBS that
called mpirun_rsh.
on version 1.7 latest and 1.8rc1, I used mpiexec which is calling
mpiexec.hydra and seems to work correctly with PBS. I looked at the osc
version of mpiexec, but it looked very old so I thought it was not
maintained anymore.
/opt/sys/mvapich2/1.7/intel/bin # ./mpiname -a
MVAPICH2 1.7 2012-02-13 [1.7 r5225] ch3:mrail
Compilation
CC: icc -fpic -g -DNDEBUG -DNVALGRIND -O2
CXX: icc -g -DNDEBUG -DNVALGRIND -O2
F77: ifort -fpic -g -O2
FC: ifort -g -O2
Configuration
--prefix=/var/tmp/mvapich2-intel-1.7/opt/sys/mvapich2/1.7/intel
--enable-f77 --enable-fc --enable-cxx --enable-romio
--enable-threads=multiple --with-rdma=gen2 --enable-g=dbg
I also turned on traceback to look for problems within the code, but
errors continued to be MPI INIT Failed with no extra information.
I will try the mpirun_rsh line you suggested, but the system is running
at 100% so it may take some time to get some results.
thanks again
Mike
Devendar Bureddy wrote:
> Hi Mike
>
> - Are you using mpirun_rsh? if not, can you try with mpirun_rsh by
> passing MV2_IBA_HCA and MV2_DEFAULT_PORT run time parameters at
> command line.
>
> example:
> $mpirun_rsh -np 4 -hostfile hosts
> MV2_IBA_HCA=mlx4_0 MV2_DEFAULT_PORT=1 ./a.out
>
> - Can you list all runtime parameters which you are using?
>
> - Can you please compile 1.7 or 1.8 with debug options (--enable-g=all
> --enable-error-checking=all --enable-fast=none ) and run. This should
> show a backtrace when it failed in MPI_Init.
>
> - In hanging case, can you attach gdb to rank 0 and get the backtrace.
> This information should help to debug this issue further
>
> -Devendar
>
> On Fri, Apr 13, 2012 at 4:41 PM, MICHAEL S DAVIS
> <msdavis at s383.jpl.nasa.gov <mailto:msdavis at s383.jpl.nasa.gov>> wrote:
>
> Hello,
>
> We have just upgraded out SGI ICE 8400 EX from 768 cores to over
> 1800 cores. The old system had an InfiniBand: Mellanox
> Technologies MT26428 card which looks like this with ibstat:
>
> 1i0n0:~ # ibstat
> CA 'mlx4_0'
> CA type: MT26428
> Number of ports: 2
> Firmware version: 2.7.0
> Hardware version: b0
> Node GUID: 0x003048fffff09498
> System image GUID: 0x003048fffff0949b
> Port 1:
> State: Active
> Physical state: LinkUp
> Rate: 40
> Base lid: 53
> LMC: 0
> SM lid: 84
> Capability mask: 0x02510868
> Port GUID: 0x003048fffff09499
> Link layer: IB
> Port 2:
> State: Active
> Physical state: LinkUp
> Rate: 40
> Base lid: 282
> LMC: 0
> SM lid: 76
> Capability mask: 0x02510868
> Port GUID: 0x003048fffff0949a
> Link layer: IB
> r1i0n0:~ #
> The new cards have a different chipset and are supposed to be
> same, but look like this when we run ibstat:
> r2i0n0:~ # ibstat
> CA 'mlx4_0'
> CA type: MT26428
> Number of ports: 1
> Firmware version: 2.7.200
> Hardware version: b0
> Node GUID: 0x003048fffff4f18c
> System image GUID: 0x003048fffff4f18f
> Port 1:
> State: Active
> Physical state: LinkUp
> Rate: 40
> Base lid: 146
> LMC: 0
> SM lid: 84
> Capability mask: 0x02510868
> Port GUID: 0x003048fffff4f18d
> Link layer: IB
> CA 'mlx4_1'
> CA type: MT26428
> Number of ports: 1
> Firmware version: 2.7.200
> Hardware version: b0
> Node GUID: 0x003048fffff4f188
> System image GUID: 0x003048fffff4f18b
> Port 1:
> State: Active
> Physical state: LinkUp
> Rate: 40
> Base lid: 220
> LMC: 0
> SM lid: 76
> Capability mask: 0x02510868
> Port GUID: 0x003048fffff4f189
> Link layer: IB
> r2i0n0:~ #
> Instead of having one card (mlx4_0) with 2 ports, the new card
> look like two cards with one port each (mlx4_0 and mlx4_1)
>
> I have been using mvapich2 1.5.1p1 for over a year and anything
> compiled and run on the old cards or the new cards works, but if
> they run on a combination of the two either fail with MPI INIT
> Failed or run forever.
>
> I have tried mvapich2 1.7 latest and mvapich2 1.8 and neither one
> seems to work. Software compiled with SGI's MPT or openmpi seem
> to work across the cards.
>
> I also tried forcing the MPI runs on mlx4_0 port 1 with the
> following environment variables
> setenv MV2_IBA_HCA mlx4_0
> setenv MV2_DEFAULT_PORT 1
>
> But it doesn't seem to work.
>
> Any ideas what I could be doing wrong or what I might try to fix
> this problem would be greatly appreciated.
>
> thanks
> Mike
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> <mailto:mvapich-discuss at cse.ohio-state.edu>
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
>
>
>
> --
> Devendar
More information about the mvapich-discuss
mailing list