[mvapich-discuss] MPI over hybrid infiniband cards
MICHAEL S DAVIS
msdavis at s383.jpl.nasa.gov
Fri Apr 13 16:41:47 EDT 2012
Hello,
We have just upgraded out SGI ICE 8400 EX from 768 cores to over 1800
cores. The old system had an InfiniBand: Mellanox Technologies MT26428
card which looks like this with ibstat:
1i0n0:~ # ibstat
CA 'mlx4_0'
CA type: MT26428
Number of ports: 2
Firmware version: 2.7.0
Hardware version: b0
Node GUID: 0x003048fffff09498
System image GUID: 0x003048fffff0949b
Port 1:
State: Active
Physical state: LinkUp
Rate: 40
Base lid: 53
LMC: 0
SM lid: 84
Capability mask: 0x02510868
Port GUID: 0x003048fffff09499
Link layer: IB
Port 2:
State: Active
Physical state: LinkUp
Rate: 40
Base lid: 282
LMC: 0
SM lid: 76
Capability mask: 0x02510868
Port GUID: 0x003048fffff0949a
Link layer: IB
r1i0n0:~ #
The new cards have a different chipset and are supposed to be same, but
look like this when we run ibstat:
r2i0n0:~ # ibstat
CA 'mlx4_0'
CA type: MT26428
Number of ports: 1
Firmware version: 2.7.200
Hardware version: b0
Node GUID: 0x003048fffff4f18c
System image GUID: 0x003048fffff4f18f
Port 1:
State: Active
Physical state: LinkUp
Rate: 40
Base lid: 146
LMC: 0
SM lid: 84
Capability mask: 0x02510868
Port GUID: 0x003048fffff4f18d
Link layer: IB
CA 'mlx4_1'
CA type: MT26428
Number of ports: 1
Firmware version: 2.7.200
Hardware version: b0
Node GUID: 0x003048fffff4f188
System image GUID: 0x003048fffff4f18b
Port 1:
State: Active
Physical state: LinkUp
Rate: 40
Base lid: 220
LMC: 0
SM lid: 76
Capability mask: 0x02510868
Port GUID: 0x003048fffff4f189
Link layer: IB
r2i0n0:~ #
Instead of having one card (mlx4_0) with 2 ports, the new card look like
two cards with one port each (mlx4_0 and mlx4_1)
I have been using mvapich2 1.5.1p1 for over a year and anything compiled
and run on the old cards or the new cards works, but if they run on a
combination of the two either fail with MPI INIT Failed or run forever.
I have tried mvapich2 1.7 latest and mvapich2 1.8 and neither one seems
to work. Software compiled with SGI's MPT or openmpi seem to work
across the cards.
I also tried forcing the MPI runs on mlx4_0 port 1 with the following
environment variables
setenv MV2_IBA_HCA mlx4_0
setenv MV2_DEFAULT_PORT 1
But it doesn't seem to work.
Any ideas what I could be doing wrong or what I might try to fix this
problem would be greatly appreciated.
thanks
Mike
More information about the mvapich-discuss
mailing list