[mvapich-discuss] MVAPICH2 error [Channel Initialization failed]

Barve, Saurabh FORNATL, IN, Contractor, DCS sbarve at nps.edu
Wed May 4 21:47:05 EDT 2011


Hello,

I'm trying to run a parallel run of MM5 using MVAPICH2 on a Linux with the
Infiniband network. As a trial, I'm trying to run the job on a local node.
I use the following command to run the job:

------------
mpiexec -np 16 -hostfile machines ./mm5.mpp
------------


The contents of the host file "machines" are simply:

------------

head
------------




I get the following error when I execute the command above:

------------

[ib_vbuf.c 257] Cannot register vbuf region
Internal Error: invalid error code ffffffff (Ring Index out of range) in
MPID_nem_ib_init:419
Fatal error in MPI_Init: Internal MPI error!, error stack:
MPIR_Init_thread(458):
MPID_Init(274).......: channel initialization failed
MPIDI_CH3_Init(38)...:
MPID_nem_init(234)...:
MPID_nem_ib_init(419): Failed to allocate memory
[ib_vbuf.c 257] Cannot register vbuf region
Internal Error: invalid error code ffffffff (Ring Index out of range) in
MPID_nem_ib_init:419
Fatal error in MPI_Init: Internal MPI error!, error stack:
MPIR_Init_thread(458):
MPID_Init(274).......: channel initialization failed
MPIDI_CH3_Init(38)...:
MPID_nem_init(234)...:
MPID_nem_ib_init(419): Failed to allocate memory

===========================================================================
==========
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   EXIT CODE: 256
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===========================================================================
==========
------------





I'm running the job on an Oracle Linux 6.0 operating system:
------------

[sbarve at head bin]# uname -a
Linux head 2.6.32-100.28.11.el6.x86_64 #1 SMP Wed Apr 13 12:42:21 EDT 2011
x86_64 x86_64 x86_64 GNU/Linux

------------





My MVAPICH2 configuration is as follows:
------------

[sbarve at head bin]# ./mpich2version
MPICH2 Version:    	1.7a
MPICH2 Release date:	Tue Apr 19 12:51:14 EDT 2011
MPICH2 Device:    	ch3:nemesis
MPICH2 configure: 	--enable-echo --enable-error-messages=all
--enable-error-checking=all --enable-g=all --enable-check-compiler-flags
--enable-f77 --enable-fc --enable-cxx --enable-rsh --enable-romio
--enable-rdma-cm --with-device=ch3:nemesis:ib --with-pm=hydra:mpirun
--with-pmi=simple --enable-smpcoll --enable-mpe --enable-threads=default
--enable-base-cache --with-mpe --with-dapl-include=/usr/include
--with-dapl-lib=/usr/lib64 --with-ib-include=/usr/include
--with-ib-libpath=/usr/lib64 --prefix=/work/sbarve/mvapich2/intel
MPICH2 CC: 	icc -O3 -xSSSE3 -ip -no-prec-div   -g
MPICH2 CXX: 	icpc -O3 -xSSSE3 -ip -no-prec-div  -g
MPICH2 F77: 	ifort -O3 -xSSSE3 -ip -no-prec-div  -g
MPICH2 FC: 	ifort -O3 -xSSSE3 -ip -no-prec-div  -g
------------



Intel Compiler build: Version 12.0 Build 20110309



Here is the information about my QLogic QLE7340 Infiniband HCA:
------------

[sbarve at head bin]# ibv_devinfo
hca_id:	qib0
	transport:			InfiniBand (0)
	fw_ver:				0.0.0
	node_guid:			0011:7500:0078:a556
	sys_image_guid:			0011:7500:0078:a556
	vendor_id:			0x1175
	vendor_part_id:			29474
	hw_ver:				0x2
	board_id:			InfiniPath_QLE7340
	phys_port_cnt:			1
		port:	1
			state:			PORT_ACTIVE (4)
			max_mtu:		4096 (5)
			active_mtu:		2048 (4)
			sm_lid:			1
			port_lid:		1
			port_lmc:		0x00
			link_layer:		IB
------------





I have set the stack size to unlimited:
------------

[sbarve at head bin]# ulimit -s

unlimited
------------



I saw in a related thread that I should set the 'max memory size' to be
unlimited as well, but the OS would not allow me to do it as a non-root
user.




When I try to run the job with the "mpirun_rsh -ssh" command, I get almost
the same error:
------------

[ib_vbuf.c 257] Cannot register vbuf region
Internal Error: invalid error code ffffffff (Ring Index out of range) in
MPID_nem_ib_init:419
Fatal error in MPI_Init: Internal MPI error!, error stack:
MPIR_Init_thread(458):
MPID_Init(274).......: channel initialization failed
MPIDI_CH3_Init(38)...:
MPID_nem_init(234)...:
MPID_nem_ib_init(419): Failed to allocate memory
MPI process (rank: 6) terminated unexpectedly on head
Exit code -5 signaled from head
forrtl: error (69): process interrupted (SIGINT)
forrtl: error (69): process interrupted (SIGINT)
forrtl: error (69): process interrupted (SIGINT)
forrtl: error (69): process interrupted (SIGINT)
forrtl: error (69): process interrupted (SIGINT)
forrtl: error (69): process interrupted (SIGINT)
forrtl: error (69): process interrupted (SIGINT)
forrtl: error (69): process interrupted (SIGINT)
forrtl: error (69): process interrupted (SIGINT)
forrtl: error (69): process interrupted (SIGINT)
forrtl: error (69): process interrupted (SIGINT)
forrtl: error (69): process interrupted (SIGINT)
forrtl: error (69): process interrupted (SIGINT)
forrtl: error (69): process interrupted (SIGINT)
Image              PC                Routine            Line        Source
            
libpthread.so.0    000000396A20C163  Unknown               Unknown  Unknown
libipathverbs-rdm  00002B5D14B9717F  Unknown               Unknown  Unknown
mm5.mpp            00000000005F29CA  Unknown               Unknown  Unknown
mm5.mpp            00000000005F2E65  Unknown               Unknown  Unknown
mm5.mpp            00000000005E576C  Unknown               Unknown  Unknown
mm5.mpp            00000000005DC5C2  Unknown               Unknown  Unknown
mm5.mpp            0000000000601607  Unknown               Unknown  Unknown
mm5.mpp            00000000005AE8AD  Unknown               Unknown  Unknown
mm5.mpp            000000000055F963  Unknown               Unknown  Unknown
mm5.mpp            000000000055E902  Unknown               Unknown  Unknown
mm5.mpp            000000000050F38D  Unknown               Unknown  Unknown
mm5.mpp            000000000050BE14  Unknown               Unknown  Unknown
mm5.mpp            00000000004E8DA1  Unknown               Unknown  Unknown
mm5.mpp            0000000000457644  Unknown               Unknown  Unknown
mm5.mpp            0000000000405EEC  Unknown               Unknown  Unknown
libc.so.6          000000396961EC5D  Unknown               Unknown  Unknown
mm5.mpp            0000000000405DE9  Unknown               Unknown  Unknown
forrtl: error (69): process interrupted (SIGINT)
head: Connection refused

------------



The 'connection refused' cannot be due to SSH, since I have password-less
key-based authentication set up for the server.


Should I be using the "ch3:nemesis:ib" device for compiling MVAPICH2? I
have tried using the "ch3:psm" device, but that threw up different errors.
Should I be using a different version of MVAPICH2? Are there special
compile flags I should be using? Currently, I'm only linking in the
"-lfmpich -lmpich" libraries.


Thanks,
Saurabh

====================================

Saurabh Barve 
sbarve at nps.edu





More information about the mvapich-discuss mailing list