[mvapich-discuss] MPI over hybrid infiniband cards

Devendar Bureddy bureddy at cse.ohio-state.edu
Fri Apr 13 18:16:57 EDT 2012


Hi Mike

- Are you using mpirun_rsh? if not, can you try with mpirun_rsh by passing
 MV2_IBA_HCA and  MV2_DEFAULT_PORT run time parameters at command line.

example:
$mpirun_rsh -np 4 -hostfile hosts MV2_IBA_HCA=mlx4_0 MV2_DEFAULT_PORT=1
./a.out

-  Can you list all runtime parameters which you are using?

- Can you please compile 1.7 or 1.8 with debug options (--enable-g=all
--enable-error-checking=all --enable-fast=none ) and run. This should show
a backtrace when it failed in MPI_Init.

- In hanging case, can you attach gdb to rank 0 and get the backtrace.
 This information should help to debug this issue further

-Devendar

On Fri, Apr 13, 2012 at 4:41 PM, MICHAEL S DAVIS
<msdavis at s383.jpl.nasa.gov>wrote:

> Hello,
>
> We have just upgraded out SGI ICE 8400 EX from 768 cores to over 1800
> cores.  The old system had an InfiniBand: Mellanox Technologies MT26428
> card which looks like this with ibstat:
>
> 1i0n0:~ # ibstat
> CA 'mlx4_0'
>       CA type: MT26428
>       Number of ports: 2
>       Firmware version: 2.7.0
>       Hardware version: b0
>       Node GUID: 0x003048fffff09498
>       System image GUID: 0x003048fffff0949b
>       Port 1:
>               State: Active
>               Physical state: LinkUp
>               Rate: 40
>               Base lid: 53
>               LMC: 0
>               SM lid: 84
>               Capability mask: 0x02510868
>               Port GUID: 0x003048fffff09499
>               Link layer: IB
>       Port 2:
>               State: Active
>               Physical state: LinkUp
>               Rate: 40
>               Base lid: 282
>               LMC: 0
>               SM lid: 76
>               Capability mask: 0x02510868
>               Port GUID: 0x003048fffff0949a
>               Link layer: IB
> r1i0n0:~ #
> The new cards have a different chipset and are supposed to be same, but
> look like this when we run ibstat:
> r2i0n0:~ # ibstat
> CA 'mlx4_0'
>       CA type: MT26428
>       Number of ports: 1
>       Firmware version: 2.7.200
>       Hardware version: b0
>       Node GUID: 0x003048fffff4f18c
>       System image GUID: 0x003048fffff4f18f
>       Port 1:
>               State: Active
>               Physical state: LinkUp
>               Rate: 40
>               Base lid: 146
>               LMC: 0
>               SM lid: 84
>               Capability mask: 0x02510868
>               Port GUID: 0x003048fffff4f18d
>               Link layer: IB
> CA 'mlx4_1'
>       CA type: MT26428
>       Number of ports: 1
>       Firmware version: 2.7.200
>       Hardware version: b0
>       Node GUID: 0x003048fffff4f188
>       System image GUID: 0x003048fffff4f18b
>       Port 1:
>               State: Active
>               Physical state: LinkUp
>               Rate: 40
>               Base lid: 220
>               LMC: 0
>               SM lid: 76
>               Capability mask: 0x02510868
>               Port GUID: 0x003048fffff4f189
>               Link layer: IB
> r2i0n0:~ #
> Instead of having one card (mlx4_0) with 2 ports, the new card look like
> two cards with one port each (mlx4_0 and mlx4_1)
>
> I have been using mvapich2 1.5.1p1 for over a year and anything compiled
> and run on the old cards or the new cards works, but if they run on a
> combination of the two either fail with MPI INIT Failed or run forever.
>
> I have tried mvapich2 1.7 latest and mvapich2 1.8 and neither one seems to
> work.  Software compiled with SGI's MPT or openmpi seem to work across the
> cards.
>
> I also tried forcing the MPI runs on mlx4_0 port 1 with the following
> environment variables
> setenv          MV2_IBA_HCA         mlx4_0
> setenv          MV2_DEFAULT_PORT    1
>
> But it doesn't seem to work.
>
> Any ideas what I could be doing wrong or what I might try to fix this
> problem would be greatly appreciated.
>
> thanks
> Mike
>
> ______________________________**_________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-**state.edu <mvapich-discuss at cse.ohio-state.edu>
> http://mail.cse.ohio-state.**edu/mailman/listinfo/mvapich-**discuss<http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss>
>



-- 
Devendar
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20120413/83b61eaf/attachment-0001.html


More information about the mvapich-discuss mailing list