[mvapich-discuss] perhaps odd behavior..
Hari Subramoni
subramoni.1 at osu.edu
Fri Jan 9 16:41:45 EST 2015
Thank you Steve. I will wait for the debug information from you.
Thx,
Hari.
On Fri, Jan 9, 2015 at 4:08 PM, Steve Heistand <steve.heistand at nasa.gov>
wrote:
> by default and on most systems I dont specify any HCA/PORT info,
> on the new systems I do give MV2_NUM_HCSA=2 & MV2_NUM_PORTS=1 yes.
>
> we have several thousand of the dual IB adapter machines all running MPT
> in dual rail mode just fine. so I doubt any bad hardware.
> also our primary network traffic is over one IB port and IO/lustre
> traffic is always on the other IB port. that all works as well.
>
> but mvapich2 doesnt like them without telling the hca/port info.
>
> working fine: ibv_devinfo
> hca_id: mlx4_0
> transport: InfiniBand (0)
> fw_ver: 2.31.5910
> node_guid: f452:1403:0022:8380
> sys_image_guid: f452:1403:0022:8383
> vendor_id: 0x02c9
> vendor_part_id: 4099
> hw_ver: 0x0
> board_id: SGI__2669_00X
> phys_port_cnt: 2
> port: 1
> state: PORT_ACTIVE (4)
> max_mtu: 4096 (5)
> active_mtu: 4096 (5)
> sm_lid: 1
> port_lid: 40927
> port_lmc: 0x00
> link_layer: InfiniBand
>
> port: 2
> state: PORT_ACTIVE (4)
> max_mtu: 4096 (5)
> active_mtu: 4096 (5)
> sm_lid: 20384
> port_lid: 40981
> port_lmc: 0x00
> link_layer: InfiniBand
>
> machines needing info: ibv_devinfo
> hca_id: mlx4_1
> transport: InfiniBand (0)
> fw_ver: 2.31.5910
> node_guid: f452:1403:005b:bd28
> sys_image_guid: f452:1403:005b:bd2b
> vendor_id: 0x02c9
> vendor_part_id: 4099
> hw_ver: 0x0
> board_id: SGI__2573_00X_1
> phys_port_cnt: 1
> port: 1
> state: PORT_ACTIVE (4)
> max_mtu: 4096 (5)
> active_mtu: 4096 (5)
> sm_lid: 20384
> port_lid: 40825
> port_lmc: 0x00
> link_layer: InfiniBand
>
> hca_id: mlx4_0
> transport: InfiniBand (0)
> fw_ver: 2.31.5910
> node_guid: f452:1403:005b:bd20
> sys_image_guid: f452:1403:005b:bd23
> vendor_id: 0x02c9
> vendor_part_id: 4099
> hw_ver: 0x0
> board_id: SGI__2573_00X_0
> phys_port_cnt: 1
> port: 1
> state: PORT_ACTIVE (4)
> max_mtu: 4096 (5)
> active_mtu: 4096 (5)
> sm_lid: 1
> port_lid: 40767
> port_lmc: 0x00
> link_layer: InfiniBand
>
>
> I will reconfigure the mvapich2 build and rerun shortly.
>
> steve
>
>
> On 01/09/2015 12:27 PM, Hari Subramoni wrote:
> > Hello Steve,
> >
> > By default MVAPICH2 will identify all available HCAs and use the first
> port
> > on these HCAs for communication. However, it will only use one port
> unless
> > the user explicitly states that MVAPICH2 should use more than one port by
> > setting the "MV2_NUM_PORTS" environment variable.
> >
> > From your e-mail, I'm assuming that things are running fine
> "MV2_NUM_HCAS=2
> > MV2_NUM_PORTS=1" - is this correct?
> >
> > At this point, I'm guessing that on the system where things are failing,
> > there is a bad HCA.
> >
> > Could you please give us the output ibv_devinfo on the system wher things
> > are passing and on the system where things are failing? Also, could you
> > please configure MVAPICH2 in debug mode (--enable-g=dbg
> --enable-fast=none)
> > and run it with "MV2_SHOW_ENV_INFO=1 MV2_DEBUG_SHOW_BACKTRACE=1" and send
> > us the output?
> >
> > Regards,
> > Hari.
> >
> > On Fri, Jan 9, 2015 at 3:02 PM, Steve Heistand <steve.heistand at nasa.gov>
> > wrote:
> >
> > so we have the latest mvapich build:
> >
> > MVAPICH2 2.1rc1 Thu Dec 18 20:00:00 EDT 2014 ch3:mrail
> >
> > Compilation
> > CC: icc -fpic -m64 -DNDEBUG -DNVALGRIND -O2
> > CXX: icpc -fpic -m64 -DNDEBUG -DNVALGRIND -O2
> > F77: ifort -L/lib -L/lib -m64 -fpic -O2
> > FC: ifort -m64 -fpic -O2
> >
> > Configuration
> > --with-device=ch3:mrail --with-rdma=gen2 CC=icc CXX=icpc F77=ifort
> > FC=ifort CFLAGS=-fpic
> > -m64 CXXFLAGS=-fpic -m64 FFLAGS=-m64 -fpic FCFLAGS=-m64 -fpic
> > --enable-f77 --enable-fc
> > --enable-cxx --enable-romio --enable-threads=default --with-hwloc
> > -disable-multi-aliases
> > -enable-xrc=no -enable-hybrid --prefix=XXX --with-file-system=lustre
> >
> > it was compiled on and for the most part run on machines that have 1 IB
> > card with dual
> > ports. This is all fine so far.
> > However when we run on a system that has dual cards each with a single
> > port the job dies
> > at startup.
> >
> > If I tell it that the system is dual hca single port via environment
> > variables it runs fine.
> >
> > Im at this point unsure if it actually uses both ports on either
> > configuration.
> >
> > I would have thought it would have probed the hardware to figure out what
> > set up
> > it had when it tried to bond to the multiple ports.
> >
> > unless its actually crashing in the probe section of the mpi_init
> > routines...
> >
> > thoughts?
> >
> > thanks
> >
> > s
> >
> >
> >> _______________________________________________
> >> mvapich-discuss mailing list
> >> mvapich-discuss at cse.ohio-state.edu
> >> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> >>
> >
>
> --
> ************************************************************************
> Steve Heistand NASA Ames Research Center
> SciCon Group Mail Stop 258-6
> steve.heistand at nasa.gov (650) 604-4369 Moffett Field, CA 94035-1000
> ************************************************************************
> "Any opinions expressed are those of our alien overlords, not my own."
>
> # For Remedy #
> #Action: Resolve #
> #Resolution: Resolved #
> #Reason: No Further Action Required #
> #Tier1: User Code #
> #Tier2: Other #
> #Tier3: Assistance #
> #Notification: None #
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20150109/ccd571be/attachment.html>
More information about the mvapich-discuss
mailing list