[mvapich-discuss] MPI over hybrid infiniband cards

Steve Heistand steve.heistand at nasa.gov
Fri Apr 13 17:26:42 EDT 2012


it may be unrelated but we have found that at least on our cluster running over ~1000 cores
will make codes hang if they are started with mpiexec. mpirun_rsh works fine though.

steve

On 04/13/2012 01:41 PM, MICHAEL S DAVIS wrote:
> Hello,
> 
> We have just upgraded out SGI ICE 8400 EX from 768 cores to over 1800 
> cores.  The old system had an InfiniBand: Mellanox Technologies MT26428 
> card which looks like this with ibstat:
> 
> 1i0n0:~ # ibstat
> CA 'mlx4_0'
>         CA type: MT26428
>         Number of ports: 2
>         Firmware version: 2.7.0
>         Hardware version: b0
>         Node GUID: 0x003048fffff09498
>         System image GUID: 0x003048fffff0949b
>         Port 1:
>                 State: Active
>                 Physical state: LinkUp
>                 Rate: 40
>                 Base lid: 53
>                 LMC: 0
>                 SM lid: 84
>                 Capability mask: 0x02510868
>                 Port GUID: 0x003048fffff09499
>                 Link layer: IB
>         Port 2:
>                 State: Active
>                 Physical state: LinkUp
>                 Rate: 40
>                 Base lid: 282
>                 LMC: 0
>                 SM lid: 76
>                 Capability mask: 0x02510868
>                 Port GUID: 0x003048fffff0949a
>                 Link layer: IB
> r1i0n0:~ #   
> 
> The new cards have a different chipset and are supposed to be same, but 
> look like this when we run ibstat:
> r2i0n0:~ # ibstat
> CA 'mlx4_0'
>         CA type: MT26428
>         Number of ports: 1
>         Firmware version: 2.7.200
>         Hardware version: b0
>         Node GUID: 0x003048fffff4f18c
>         System image GUID: 0x003048fffff4f18f
>         Port 1:
>                 State: Active
>                 Physical state: LinkUp
>                 Rate: 40
>                 Base lid: 146
>                 LMC: 0
>                 SM lid: 84
>                 Capability mask: 0x02510868
>                 Port GUID: 0x003048fffff4f18d
>                 Link layer: IB
> CA 'mlx4_1'
>         CA type: MT26428
>         Number of ports: 1
>         Firmware version: 2.7.200
>         Hardware version: b0
>         Node GUID: 0x003048fffff4f188
>         System image GUID: 0x003048fffff4f18b
>         Port 1:
>                 State: Active
>                 Physical state: LinkUp
>                 Rate: 40
>                 Base lid: 220
>                 LMC: 0
>                 SM lid: 76
>                 Capability mask: 0x02510868
>                 Port GUID: 0x003048fffff4f189
>                 Link layer: IB
> r2i0n0:~ # 
> 
> Instead of having one card (mlx4_0) with 2 ports, the new card look like 
> two cards with one port each (mlx4_0 and mlx4_1)
> 
> I have been using mvapich2 1.5.1p1 for over a year and anything compiled 
> and run on the old cards or the new cards works, but if they run on a 
> combination of the two either fail with MPI INIT Failed or run forever.
> 
> I have tried mvapich2 1.7 latest and mvapich2 1.8 and neither one seems 
> to work.  Software compiled with SGI's MPT or openmpi seem to work 
> across the cards.
> 
> I also tried forcing the MPI runs on mlx4_0 port 1 with the following 
> environment variables
> setenv          MV2_IBA_HCA         mlx4_0
> setenv          MV2_DEFAULT_PORT    1
> 
> But it doesn't seem to work.
> 
> Any ideas what I could be doing wrong or what I might try to fix this 
> problem would be greatly appreciated.
> 
> thanks
> Mike
> 
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss

-- 
************************************************************************
 Steve Heistand                           NASA Ames Research Center
 SciCon Group                             Mail Stop 258-6
 steve.heistand at nasa.gov  (650) 604-4369  Moffett Field, CA 94035-1000
************************************************************************
 "Any opinions expressed are those of our alien overlords, not my own."

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 198 bytes
Desc: OpenPGP digital signature
Url : http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20120413/8d8179bc/signature.bin


More information about the mvapich-discuss mailing list