[mvapich-discuss] MPI over hybrid infiniband cards
Steve Heistand
steve.heistand at nasa.gov
Fri Apr 13 17:26:42 EDT 2012
it may be unrelated but we have found that at least on our cluster running over ~1000 cores
will make codes hang if they are started with mpiexec. mpirun_rsh works fine though.
steve
On 04/13/2012 01:41 PM, MICHAEL S DAVIS wrote:
> Hello,
>
> We have just upgraded out SGI ICE 8400 EX from 768 cores to over 1800
> cores. The old system had an InfiniBand: Mellanox Technologies MT26428
> card which looks like this with ibstat:
>
> 1i0n0:~ # ibstat
> CA 'mlx4_0'
> CA type: MT26428
> Number of ports: 2
> Firmware version: 2.7.0
> Hardware version: b0
> Node GUID: 0x003048fffff09498
> System image GUID: 0x003048fffff0949b
> Port 1:
> State: Active
> Physical state: LinkUp
> Rate: 40
> Base lid: 53
> LMC: 0
> SM lid: 84
> Capability mask: 0x02510868
> Port GUID: 0x003048fffff09499
> Link layer: IB
> Port 2:
> State: Active
> Physical state: LinkUp
> Rate: 40
> Base lid: 282
> LMC: 0
> SM lid: 76
> Capability mask: 0x02510868
> Port GUID: 0x003048fffff0949a
> Link layer: IB
> r1i0n0:~ #
>
> The new cards have a different chipset and are supposed to be same, but
> look like this when we run ibstat:
> r2i0n0:~ # ibstat
> CA 'mlx4_0'
> CA type: MT26428
> Number of ports: 1
> Firmware version: 2.7.200
> Hardware version: b0
> Node GUID: 0x003048fffff4f18c
> System image GUID: 0x003048fffff4f18f
> Port 1:
> State: Active
> Physical state: LinkUp
> Rate: 40
> Base lid: 146
> LMC: 0
> SM lid: 84
> Capability mask: 0x02510868
> Port GUID: 0x003048fffff4f18d
> Link layer: IB
> CA 'mlx4_1'
> CA type: MT26428
> Number of ports: 1
> Firmware version: 2.7.200
> Hardware version: b0
> Node GUID: 0x003048fffff4f188
> System image GUID: 0x003048fffff4f18b
> Port 1:
> State: Active
> Physical state: LinkUp
> Rate: 40
> Base lid: 220
> LMC: 0
> SM lid: 76
> Capability mask: 0x02510868
> Port GUID: 0x003048fffff4f189
> Link layer: IB
> r2i0n0:~ #
>
> Instead of having one card (mlx4_0) with 2 ports, the new card look like
> two cards with one port each (mlx4_0 and mlx4_1)
>
> I have been using mvapich2 1.5.1p1 for over a year and anything compiled
> and run on the old cards or the new cards works, but if they run on a
> combination of the two either fail with MPI INIT Failed or run forever.
>
> I have tried mvapich2 1.7 latest and mvapich2 1.8 and neither one seems
> to work. Software compiled with SGI's MPT or openmpi seem to work
> across the cards.
>
> I also tried forcing the MPI runs on mlx4_0 port 1 with the following
> environment variables
> setenv MV2_IBA_HCA mlx4_0
> setenv MV2_DEFAULT_PORT 1
>
> But it doesn't seem to work.
>
> Any ideas what I could be doing wrong or what I might try to fix this
> problem would be greatly appreciated.
>
> thanks
> Mike
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
--
************************************************************************
Steve Heistand NASA Ames Research Center
SciCon Group Mail Stop 258-6
steve.heistand at nasa.gov (650) 604-4369 Moffett Field, CA 94035-1000
************************************************************************
"Any opinions expressed are those of our alien overlords, not my own."
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 198 bytes
Desc: OpenPGP digital signature
Url : http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20120413/8d8179bc/signature.bin
More information about the mvapich-discuss
mailing list