[mvapich-discuss] MPI over hybrid infiniband cards

Mehmet mbelgin at gmail.com
Fri Apr 13 18:00:30 EDT 2012


Did you guys try with mpiexec 0.84 by OSC (
http://www.osc.edu/~djohnson/mpiexec/index.php) instead? In our experience,
this version can run some codes which hang with mpiexec.hydra.

-Mehmet


On Fri, Apr 13, 2012 at 5:26 PM, Steve Heistand <steve.heistand at nasa.gov>wrote:

> it may be unrelated but we have found that at least on our cluster running
> over ~1000 cores
> will make codes hang if they are started with mpiexec. mpirun_rsh works
> fine though.
>
> steve
>
> On 04/13/2012 01:41 PM, MICHAEL S DAVIS wrote:
> > Hello,
> >
> > We have just upgraded out SGI ICE 8400 EX from 768 cores to over 1800
> > cores.  The old system had an InfiniBand: Mellanox Technologies MT26428
> > card which looks like this with ibstat:
> >
> > 1i0n0:~ # ibstat
> > CA 'mlx4_0'
> >         CA type: MT26428
> >         Number of ports: 2
> >         Firmware version: 2.7.0
> >         Hardware version: b0
> >         Node GUID: 0x003048fffff09498
> >         System image GUID: 0x003048fffff0949b
> >         Port 1:
> >                 State: Active
> >                 Physical state: LinkUp
> >                 Rate: 40
> >                 Base lid: 53
> >                 LMC: 0
> >                 SM lid: 84
> >                 Capability mask: 0x02510868
> >                 Port GUID: 0x003048fffff09499
> >                 Link layer: IB
> >         Port 2:
> >                 State: Active
> >                 Physical state: LinkUp
> >                 Rate: 40
> >                 Base lid: 282
> >                 LMC: 0
> >                 SM lid: 76
> >                 Capability mask: 0x02510868
> >                 Port GUID: 0x003048fffff0949a
> >                 Link layer: IB
> > r1i0n0:~ #
> >
> > The new cards have a different chipset and are supposed to be same, but
> > look like this when we run ibstat:
> > r2i0n0:~ # ibstat
> > CA 'mlx4_0'
> >         CA type: MT26428
> >         Number of ports: 1
> >         Firmware version: 2.7.200
> >         Hardware version: b0
> >         Node GUID: 0x003048fffff4f18c
> >         System image GUID: 0x003048fffff4f18f
> >         Port 1:
> >                 State: Active
> >                 Physical state: LinkUp
> >                 Rate: 40
> >                 Base lid: 146
> >                 LMC: 0
> >                 SM lid: 84
> >                 Capability mask: 0x02510868
> >                 Port GUID: 0x003048fffff4f18d
> >                 Link layer: IB
> > CA 'mlx4_1'
> >         CA type: MT26428
> >         Number of ports: 1
> >         Firmware version: 2.7.200
> >         Hardware version: b0
> >         Node GUID: 0x003048fffff4f188
> >         System image GUID: 0x003048fffff4f18b
> >         Port 1:
> >                 State: Active
> >                 Physical state: LinkUp
> >                 Rate: 40
> >                 Base lid: 220
> >                 LMC: 0
> >                 SM lid: 76
> >                 Capability mask: 0x02510868
> >                 Port GUID: 0x003048fffff4f189
> >                 Link layer: IB
> > r2i0n0:~ #
> >
> > Instead of having one card (mlx4_0) with 2 ports, the new card look like
> > two cards with one port each (mlx4_0 and mlx4_1)
> >
> > I have been using mvapich2 1.5.1p1 for over a year and anything compiled
> > and run on the old cards or the new cards works, but if they run on a
> > combination of the two either fail with MPI INIT Failed or run forever.
> >
> > I have tried mvapich2 1.7 latest and mvapich2 1.8 and neither one seems
> > to work.  Software compiled with SGI's MPT or openmpi seem to work
> > across the cards.
> >
> > I also tried forcing the MPI runs on mlx4_0 port 1 with the following
> > environment variables
> > setenv          MV2_IBA_HCA         mlx4_0
> > setenv          MV2_DEFAULT_PORT    1
> >
> > But it doesn't seem to work.
> >
> > Any ideas what I could be doing wrong or what I might try to fix this
> > problem would be greatly appreciated.
> >
> > thanks
> > Mike
> >
> > _______________________________________________
> > mvapich-discuss mailing list
> > mvapich-discuss at cse.ohio-state.edu
> > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
> --
> ************************************************************************
>  Steve Heistand                           NASA Ames Research Center
>  SciCon Group                             Mail Stop 258-6
>  steve.heistand at nasa.gov  (650) 604-4369  Moffett Field, CA 94035-1000
> ************************************************************************
>  "Any opinions expressed are those of our alien overlords, not my own."
>
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
>


-- 
=========================================
Mehmet Belgin, Ph.D. (mehmet.belgin at oit.gatech.edu)
Scientific Computing Consultant | OIT - Academic and Research Technologies
Georgia Institute of Technology
258 Fourth Street, Rich Building, Room 326
Atlanta, GA  30332-0700
Office: (404) 385-0665
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20120413/a5b5cf3e/attachment.html


More information about the mvapich-discuss mailing list