[mvapich-discuss] perhaps odd behavior..

Hari Subramoni subramoni.1 at osu.edu
Fri Jan 9 16:41:45 EST 2015


Thank you Steve. I will wait for the debug information from you.

Thx,
Hari.

On Fri, Jan 9, 2015 at 4:08 PM, Steve Heistand <steve.heistand at nasa.gov>
wrote:

> by default and on most systems I dont specify any HCA/PORT info,
> on the new systems I do give MV2_NUM_HCSA=2 & MV2_NUM_PORTS=1 yes.
>
> we have several thousand of the dual IB adapter machines all running MPT
> in dual rail mode just fine. so I doubt any bad hardware.
> also our primary network traffic is over one IB port and IO/lustre
> traffic is always on the other IB port. that all works as well.
>
> but mvapich2 doesnt like them without telling the hca/port info.
>
> working fine: ibv_devinfo
> hca_id: mlx4_0
>         transport:                      InfiniBand (0)
>         fw_ver:                         2.31.5910
>         node_guid:                      f452:1403:0022:8380
>         sys_image_guid:                 f452:1403:0022:8383
>         vendor_id:                      0x02c9
>         vendor_part_id:                 4099
>         hw_ver:                         0x0
>         board_id:                       SGI__2669_00X
>         phys_port_cnt:                  2
>                 port:   1
>                         state:                  PORT_ACTIVE (4)
>                         max_mtu:                4096 (5)
>                         active_mtu:             4096 (5)
>                         sm_lid:                 1
>                         port_lid:               40927
>                         port_lmc:               0x00
>                         link_layer:             InfiniBand
>
>                 port:   2
>                         state:                  PORT_ACTIVE (4)
>                         max_mtu:                4096 (5)
>                         active_mtu:             4096 (5)
>                         sm_lid:                 20384
>                         port_lid:               40981
>                         port_lmc:               0x00
>                         link_layer:             InfiniBand
>
> machines needing info: ibv_devinfo
> hca_id: mlx4_1
>         transport:                      InfiniBand (0)
>         fw_ver:                         2.31.5910
>         node_guid:                      f452:1403:005b:bd28
>         sys_image_guid:                 f452:1403:005b:bd2b
>         vendor_id:                      0x02c9
>         vendor_part_id:                 4099
>         hw_ver:                         0x0
>         board_id:                       SGI__2573_00X_1
>         phys_port_cnt:                  1
>                 port:   1
>                         state:                  PORT_ACTIVE (4)
>                         max_mtu:                4096 (5)
>                         active_mtu:             4096 (5)
>                         sm_lid:                 20384
>                         port_lid:               40825
>                         port_lmc:               0x00
>                         link_layer:             InfiniBand
>
> hca_id: mlx4_0
>         transport:                      InfiniBand (0)
>         fw_ver:                         2.31.5910
>         node_guid:                      f452:1403:005b:bd20
>         sys_image_guid:                 f452:1403:005b:bd23
>         vendor_id:                      0x02c9
>         vendor_part_id:                 4099
>         hw_ver:                         0x0
>         board_id:                       SGI__2573_00X_0
>         phys_port_cnt:                  1
>                 port:   1
>                         state:                  PORT_ACTIVE (4)
>                         max_mtu:                4096 (5)
>                         active_mtu:             4096 (5)
>                         sm_lid:                 1
>                         port_lid:               40767
>                         port_lmc:               0x00
>                         link_layer:             InfiniBand
>
>
> I will reconfigure the mvapich2 build and rerun shortly.
>
> steve
>
>
> On 01/09/2015 12:27 PM, Hari Subramoni wrote:
> > Hello Steve,
> >
> > By default MVAPICH2 will identify all available HCAs and use the first
> port
> > on these HCAs for communication. However, it will only use one port
> unless
> > the user explicitly states that MVAPICH2 should use more than one port by
> > setting the "MV2_NUM_PORTS" environment variable.
> >
> > From your e-mail, I'm assuming that things are running fine
> "MV2_NUM_HCAS=2
> > MV2_NUM_PORTS=1" - is this correct?
> >
> > At this point, I'm guessing that on the system where things are failing,
> > there is a bad HCA.
> >
> > Could you please give us the output ibv_devinfo on the system wher things
> > are passing and on the system where things are failing? Also, could you
> > please configure MVAPICH2 in debug mode (--enable-g=dbg
> --enable-fast=none)
> > and run it with "MV2_SHOW_ENV_INFO=1 MV2_DEBUG_SHOW_BACKTRACE=1" and send
> > us the output?
> >
> > Regards,
> > Hari.
> >
> > On Fri, Jan 9, 2015 at 3:02 PM, Steve Heistand <steve.heistand at nasa.gov>
> > wrote:
> >
> > so we have the latest mvapich build:
> >
> > MVAPICH2 2.1rc1 Thu Dec 18 20:00:00 EDT 2014 ch3:mrail
> >
> > Compilation
> > CC: icc -fpic -m64   -DNDEBUG -DNVALGRIND -O2
> > CXX: icpc -fpic -m64  -DNDEBUG -DNVALGRIND -O2
> > F77: ifort -L/lib -L/lib -m64 -fpic  -O2
> > FC: ifort -m64 -fpic  -O2
> >
> > Configuration
> > --with-device=ch3:mrail --with-rdma=gen2 CC=icc CXX=icpc F77=ifort
> > FC=ifort CFLAGS=-fpic
> > -m64 CXXFLAGS=-fpic -m64 FFLAGS=-m64 -fpic FCFLAGS=-m64 -fpic
> > --enable-f77 --enable-fc
> > --enable-cxx --enable-romio --enable-threads=default --with-hwloc
> > -disable-multi-aliases
> > -enable-xrc=no -enable-hybrid --prefix=XXX --with-file-system=lustre
> >
> > it was compiled on and for the most part run on machines that have 1 IB
> > card with dual
> > ports. This is all fine so far.
> > However when we run on a system that has dual cards each with a single
> > port the job dies
> > at startup.
> >
> > If I tell it that the system is dual hca single port via environment
> > variables it runs fine.
> >
> > Im at this point unsure if it actually uses both ports on either
> > configuration.
> >
> > I would have thought it would have probed the hardware to figure out what
> > set up
> > it had when it tried to bond to the multiple ports.
> >
> > unless its actually crashing in the probe section of the mpi_init
> > routines...
> >
> > thoughts?
> >
> > thanks
> >
> > s
> >
> >
> >> _______________________________________________
> >> mvapich-discuss mailing list
> >> mvapich-discuss at cse.ohio-state.edu
> >> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> >>
> >
>
> --
> ************************************************************************
>  Steve Heistand                           NASA Ames Research Center
>  SciCon Group                             Mail Stop 258-6
>  steve.heistand at nasa.gov  (650) 604-4369  Moffett Field, CA 94035-1000
> ************************************************************************
>  "Any opinions expressed are those of our alien overlords, not my own."
>
> # For Remedy                        #
> #Action: Resolve                    #
> #Resolution: Resolved               #
> #Reason: No Further Action Required #
> #Tier1: User Code                   #
> #Tier2: Other                       #
> #Tier3: Assistance                  #
> #Notification: None                 #
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20150109/ccd571be/attachment.html>


More information about the mvapich-discuss mailing list