[mvapich-discuss] MPI over hybrid infiniband cards
Mehmet
mbelgin at gmail.com
Fri Apr 13 18:00:30 EDT 2012
Did you guys try with mpiexec 0.84 by OSC (
http://www.osc.edu/~djohnson/mpiexec/index.php) instead? In our experience,
this version can run some codes which hang with mpiexec.hydra.
-Mehmet
On Fri, Apr 13, 2012 at 5:26 PM, Steve Heistand <steve.heistand at nasa.gov>wrote:
> it may be unrelated but we have found that at least on our cluster running
> over ~1000 cores
> will make codes hang if they are started with mpiexec. mpirun_rsh works
> fine though.
>
> steve
>
> On 04/13/2012 01:41 PM, MICHAEL S DAVIS wrote:
> > Hello,
> >
> > We have just upgraded out SGI ICE 8400 EX from 768 cores to over 1800
> > cores. The old system had an InfiniBand: Mellanox Technologies MT26428
> > card which looks like this with ibstat:
> >
> > 1i0n0:~ # ibstat
> > CA 'mlx4_0'
> > CA type: MT26428
> > Number of ports: 2
> > Firmware version: 2.7.0
> > Hardware version: b0
> > Node GUID: 0x003048fffff09498
> > System image GUID: 0x003048fffff0949b
> > Port 1:
> > State: Active
> > Physical state: LinkUp
> > Rate: 40
> > Base lid: 53
> > LMC: 0
> > SM lid: 84
> > Capability mask: 0x02510868
> > Port GUID: 0x003048fffff09499
> > Link layer: IB
> > Port 2:
> > State: Active
> > Physical state: LinkUp
> > Rate: 40
> > Base lid: 282
> > LMC: 0
> > SM lid: 76
> > Capability mask: 0x02510868
> > Port GUID: 0x003048fffff0949a
> > Link layer: IB
> > r1i0n0:~ #
> >
> > The new cards have a different chipset and are supposed to be same, but
> > look like this when we run ibstat:
> > r2i0n0:~ # ibstat
> > CA 'mlx4_0'
> > CA type: MT26428
> > Number of ports: 1
> > Firmware version: 2.7.200
> > Hardware version: b0
> > Node GUID: 0x003048fffff4f18c
> > System image GUID: 0x003048fffff4f18f
> > Port 1:
> > State: Active
> > Physical state: LinkUp
> > Rate: 40
> > Base lid: 146
> > LMC: 0
> > SM lid: 84
> > Capability mask: 0x02510868
> > Port GUID: 0x003048fffff4f18d
> > Link layer: IB
> > CA 'mlx4_1'
> > CA type: MT26428
> > Number of ports: 1
> > Firmware version: 2.7.200
> > Hardware version: b0
> > Node GUID: 0x003048fffff4f188
> > System image GUID: 0x003048fffff4f18b
> > Port 1:
> > State: Active
> > Physical state: LinkUp
> > Rate: 40
> > Base lid: 220
> > LMC: 0
> > SM lid: 76
> > Capability mask: 0x02510868
> > Port GUID: 0x003048fffff4f189
> > Link layer: IB
> > r2i0n0:~ #
> >
> > Instead of having one card (mlx4_0) with 2 ports, the new card look like
> > two cards with one port each (mlx4_0 and mlx4_1)
> >
> > I have been using mvapich2 1.5.1p1 for over a year and anything compiled
> > and run on the old cards or the new cards works, but if they run on a
> > combination of the two either fail with MPI INIT Failed or run forever.
> >
> > I have tried mvapich2 1.7 latest and mvapich2 1.8 and neither one seems
> > to work. Software compiled with SGI's MPT or openmpi seem to work
> > across the cards.
> >
> > I also tried forcing the MPI runs on mlx4_0 port 1 with the following
> > environment variables
> > setenv MV2_IBA_HCA mlx4_0
> > setenv MV2_DEFAULT_PORT 1
> >
> > But it doesn't seem to work.
> >
> > Any ideas what I could be doing wrong or what I might try to fix this
> > problem would be greatly appreciated.
> >
> > thanks
> > Mike
> >
> > _______________________________________________
> > mvapich-discuss mailing list
> > mvapich-discuss at cse.ohio-state.edu
> > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
> --
> ************************************************************************
> Steve Heistand NASA Ames Research Center
> SciCon Group Mail Stop 258-6
> steve.heistand at nasa.gov (650) 604-4369 Moffett Field, CA 94035-1000
> ************************************************************************
> "Any opinions expressed are those of our alien overlords, not my own."
>
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
>
--
=========================================
Mehmet Belgin, Ph.D. (mehmet.belgin at oit.gatech.edu)
Scientific Computing Consultant | OIT - Academic and Research Technologies
Georgia Institute of Technology
258 Fourth Street, Rich Building, Room 326
Atlanta, GA 30332-0700
Office: (404) 385-0665
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20120413/a5b5cf3e/attachment.html
More information about the mvapich-discuss
mailing list