[mvapich-discuss] perhaps odd behavior..
Steve Heistand
steve.heistand at nasa.gov
Fri Jan 9 16:08:54 EST 2015
by default and on most systems I dont specify any HCA/PORT info,
on the new systems I do give MV2_NUM_HCSA=2 & MV2_NUM_PORTS=1 yes.
we have several thousand of the dual IB adapter machines all running MPT
in dual rail mode just fine. so I doubt any bad hardware.
also our primary network traffic is over one IB port and IO/lustre
traffic is always on the other IB port. that all works as well.
but mvapich2 doesnt like them without telling the hca/port info.
working fine: ibv_devinfo
hca_id: mlx4_0
transport: InfiniBand (0)
fw_ver: 2.31.5910
node_guid: f452:1403:0022:8380
sys_image_guid: f452:1403:0022:8383
vendor_id: 0x02c9
vendor_part_id: 4099
hw_ver: 0x0
board_id: SGI__2669_00X
phys_port_cnt: 2
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 4096 (5)
sm_lid: 1
port_lid: 40927
port_lmc: 0x00
link_layer: InfiniBand
port: 2
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 4096 (5)
sm_lid: 20384
port_lid: 40981
port_lmc: 0x00
link_layer: InfiniBand
machines needing info: ibv_devinfo
hca_id: mlx4_1
transport: InfiniBand (0)
fw_ver: 2.31.5910
node_guid: f452:1403:005b:bd28
sys_image_guid: f452:1403:005b:bd2b
vendor_id: 0x02c9
vendor_part_id: 4099
hw_ver: 0x0
board_id: SGI__2573_00X_1
phys_port_cnt: 1
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 4096 (5)
sm_lid: 20384
port_lid: 40825
port_lmc: 0x00
link_layer: InfiniBand
hca_id: mlx4_0
transport: InfiniBand (0)
fw_ver: 2.31.5910
node_guid: f452:1403:005b:bd20
sys_image_guid: f452:1403:005b:bd23
vendor_id: 0x02c9
vendor_part_id: 4099
hw_ver: 0x0
board_id: SGI__2573_00X_0
phys_port_cnt: 1
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 4096 (5)
sm_lid: 1
port_lid: 40767
port_lmc: 0x00
link_layer: InfiniBand
I will reconfigure the mvapich2 build and rerun shortly.
steve
On 01/09/2015 12:27 PM, Hari Subramoni wrote:
> Hello Steve,
>
> By default MVAPICH2 will identify all available HCAs and use the first port
> on these HCAs for communication. However, it will only use one port unless
> the user explicitly states that MVAPICH2 should use more than one port by
> setting the "MV2_NUM_PORTS" environment variable.
>
> From your e-mail, I'm assuming that things are running fine "MV2_NUM_HCAS=2
> MV2_NUM_PORTS=1" - is this correct?
>
> At this point, I'm guessing that on the system where things are failing,
> there is a bad HCA.
>
> Could you please give us the output ibv_devinfo on the system wher things
> are passing and on the system where things are failing? Also, could you
> please configure MVAPICH2 in debug mode (--enable-g=dbg --enable-fast=none)
> and run it with "MV2_SHOW_ENV_INFO=1 MV2_DEBUG_SHOW_BACKTRACE=1" and send
> us the output?
>
> Regards,
> Hari.
>
> On Fri, Jan 9, 2015 at 3:02 PM, Steve Heistand <steve.heistand at nasa.gov>
> wrote:
>
> so we have the latest mvapich build:
>
> MVAPICH2 2.1rc1 Thu Dec 18 20:00:00 EDT 2014 ch3:mrail
>
> Compilation
> CC: icc -fpic -m64 -DNDEBUG -DNVALGRIND -O2
> CXX: icpc -fpic -m64 -DNDEBUG -DNVALGRIND -O2
> F77: ifort -L/lib -L/lib -m64 -fpic -O2
> FC: ifort -m64 -fpic -O2
>
> Configuration
> --with-device=ch3:mrail --with-rdma=gen2 CC=icc CXX=icpc F77=ifort
> FC=ifort CFLAGS=-fpic
> -m64 CXXFLAGS=-fpic -m64 FFLAGS=-m64 -fpic FCFLAGS=-m64 -fpic
> --enable-f77 --enable-fc
> --enable-cxx --enable-romio --enable-threads=default --with-hwloc
> -disable-multi-aliases
> -enable-xrc=no -enable-hybrid --prefix=XXX --with-file-system=lustre
>
> it was compiled on and for the most part run on machines that have 1 IB
> card with dual
> ports. This is all fine so far.
> However when we run on a system that has dual cards each with a single
> port the job dies
> at startup.
>
> If I tell it that the system is dual hca single port via environment
> variables it runs fine.
>
> Im at this point unsure if it actually uses both ports on either
> configuration.
>
> I would have thought it would have probed the hardware to figure out what
> set up
> it had when it tried to bond to the multiple ports.
>
> unless its actually crashing in the probe section of the mpi_init
> routines...
>
> thoughts?
>
> thanks
>
> s
>
>
>> _______________________________________________
>> mvapich-discuss mailing list
>> mvapich-discuss at cse.ohio-state.edu
>> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>
>
--
************************************************************************
Steve Heistand NASA Ames Research Center
SciCon Group Mail Stop 258-6
steve.heistand at nasa.gov (650) 604-4369 Moffett Field, CA 94035-1000
************************************************************************
"Any opinions expressed are those of our alien overlords, not my own."
# For Remedy #
#Action: Resolve #
#Resolution: Resolved #
#Reason: No Further Action Required #
#Tier1: User Code #
#Tier2: Other #
#Tier3: Assistance #
#Notification: None #
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 198 bytes
Desc: OpenPGP digital signature
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20150109/a77afdd2/attachment-0001.sig>
More information about the mvapich-discuss
mailing list