[mvapich-discuss] perhaps odd behavior..

Steve Heistand steve.heistand at nasa.gov
Fri Jan 9 16:08:54 EST 2015


by default and on most systems I dont specify any HCA/PORT info,
on the new systems I do give MV2_NUM_HCSA=2 & MV2_NUM_PORTS=1 yes.

we have several thousand of the dual IB adapter machines all running MPT
in dual rail mode just fine. so I doubt any bad hardware. 
also our primary network traffic is over one IB port and IO/lustre
traffic is always on the other IB port. that all works as well.

but mvapich2 doesnt like them without telling the hca/port info.

working fine: ibv_devinfo
hca_id:	mlx4_0
	transport:			InfiniBand (0)
	fw_ver:				2.31.5910
	node_guid:			f452:1403:0022:8380
	sys_image_guid:			f452:1403:0022:8383
	vendor_id:			0x02c9
	vendor_part_id:			4099
	hw_ver:				0x0
	board_id:			SGI__2669_00X
	phys_port_cnt:			2
		port:	1
			state:			PORT_ACTIVE (4)
			max_mtu:		4096 (5)
			active_mtu:		4096 (5)
			sm_lid:			1
			port_lid:		40927
			port_lmc:		0x00
			link_layer:		InfiniBand

		port:	2
			state:			PORT_ACTIVE (4)
			max_mtu:		4096 (5)
			active_mtu:		4096 (5)
			sm_lid:			20384
			port_lid:		40981
			port_lmc:		0x00
			link_layer:		InfiniBand

machines needing info: ibv_devinfo
hca_id:	mlx4_1
	transport:			InfiniBand (0)
	fw_ver:				2.31.5910
	node_guid:			f452:1403:005b:bd28
	sys_image_guid:			f452:1403:005b:bd2b
	vendor_id:			0x02c9
	vendor_part_id:			4099
	hw_ver:				0x0
	board_id:			SGI__2573_00X_1
	phys_port_cnt:			1
		port:	1
			state:			PORT_ACTIVE (4)
			max_mtu:		4096 (5)
			active_mtu:		4096 (5)
			sm_lid:			20384
			port_lid:		40825
			port_lmc:		0x00
			link_layer:		InfiniBand

hca_id:	mlx4_0
	transport:			InfiniBand (0)
	fw_ver:				2.31.5910
	node_guid:			f452:1403:005b:bd20
	sys_image_guid:			f452:1403:005b:bd23
	vendor_id:			0x02c9
	vendor_part_id:			4099
	hw_ver:				0x0
	board_id:			SGI__2573_00X_0
	phys_port_cnt:			1
		port:	1
			state:			PORT_ACTIVE (4)
			max_mtu:		4096 (5)
			active_mtu:		4096 (5)
			sm_lid:			1
			port_lid:		40767
			port_lmc:		0x00
			link_layer:		InfiniBand


I will reconfigure the mvapich2 build and rerun shortly.

steve


On 01/09/2015 12:27 PM, Hari Subramoni wrote:
> Hello Steve,
> 
> By default MVAPICH2 will identify all available HCAs and use the first port
> on these HCAs for communication. However, it will only use one port unless
> the user explicitly states that MVAPICH2 should use more than one port by
> setting the "MV2_NUM_PORTS" environment variable.
> 
> From your e-mail, I'm assuming that things are running fine "MV2_NUM_HCAS=2
> MV2_NUM_PORTS=1" - is this correct?
> 
> At this point, I'm guessing that on the system where things are failing,
> there is a bad HCA.
> 
> Could you please give us the output ibv_devinfo on the system wher things
> are passing and on the system where things are failing? Also, could you
> please configure MVAPICH2 in debug mode (--enable-g=dbg --enable-fast=none)
> and run it with "MV2_SHOW_ENV_INFO=1 MV2_DEBUG_SHOW_BACKTRACE=1" and send
> us the output?
> 
> Regards,
> Hari.
> 
> On Fri, Jan 9, 2015 at 3:02 PM, Steve Heistand <steve.heistand at nasa.gov>
> wrote:
> 
> so we have the latest mvapich build:
> 
> MVAPICH2 2.1rc1 Thu Dec 18 20:00:00 EDT 2014 ch3:mrail
> 
> Compilation
> CC: icc -fpic -m64   -DNDEBUG -DNVALGRIND -O2
> CXX: icpc -fpic -m64  -DNDEBUG -DNVALGRIND -O2
> F77: ifort -L/lib -L/lib -m64 -fpic  -O2
> FC: ifort -m64 -fpic  -O2
> 
> Configuration
> --with-device=ch3:mrail --with-rdma=gen2 CC=icc CXX=icpc F77=ifort
> FC=ifort CFLAGS=-fpic
> -m64 CXXFLAGS=-fpic -m64 FFLAGS=-m64 -fpic FCFLAGS=-m64 -fpic
> --enable-f77 --enable-fc
> --enable-cxx --enable-romio --enable-threads=default --with-hwloc
> -disable-multi-aliases
> -enable-xrc=no -enable-hybrid --prefix=XXX --with-file-system=lustre
> 
> it was compiled on and for the most part run on machines that have 1 IB
> card with dual
> ports. This is all fine so far.
> However when we run on a system that has dual cards each with a single
> port the job dies
> at startup.
> 
> If I tell it that the system is dual hca single port via environment
> variables it runs fine.
> 
> Im at this point unsure if it actually uses both ports on either
> configuration.
> 
> I would have thought it would have probed the hardware to figure out what
> set up
> it had when it tried to bond to the multiple ports.
> 
> unless its actually crashing in the probe section of the mpi_init
> routines...
> 
> thoughts?
> 
> thanks
> 
> s
> 
> 
>> _______________________________________________
>> mvapich-discuss mailing list
>> mvapich-discuss at cse.ohio-state.edu
>> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>
> 

-- 
************************************************************************
 Steve Heistand                           NASA Ames Research Center
 SciCon Group                             Mail Stop 258-6
 steve.heistand at nasa.gov  (650) 604-4369  Moffett Field, CA 94035-1000
************************************************************************
 "Any opinions expressed are those of our alien overlords, not my own."

# For Remedy                        #
#Action: Resolve                    #
#Resolution: Resolved               #
#Reason: No Further Action Required #
#Tier1: User Code                   #
#Tier2: Other                       #
#Tier3: Assistance                  #
#Notification: None                 #


-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 198 bytes
Desc: OpenPGP digital signature
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20150109/a77afdd2/attachment-0001.sig>


More information about the mvapich-discuss mailing list