[mvapich-discuss] MVAPICH2 error [Channel Initialization failed]
Barve, Saurabh FORNATL, IN, Contractor, DCS
sbarve at nps.edu
Wed May 4 21:47:05 EDT 2011
Hello,
I'm trying to run a parallel run of MM5 using MVAPICH2 on a Linux with the
Infiniband network. As a trial, I'm trying to run the job on a local node.
I use the following command to run the job:
------------
mpiexec -np 16 -hostfile machines ./mm5.mpp
------------
The contents of the host file "machines" are simply:
------------
head
------------
I get the following error when I execute the command above:
------------
[ib_vbuf.c 257] Cannot register vbuf region
Internal Error: invalid error code ffffffff (Ring Index out of range) in
MPID_nem_ib_init:419
Fatal error in MPI_Init: Internal MPI error!, error stack:
MPIR_Init_thread(458):
MPID_Init(274).......: channel initialization failed
MPIDI_CH3_Init(38)...:
MPID_nem_init(234)...:
MPID_nem_ib_init(419): Failed to allocate memory
[ib_vbuf.c 257] Cannot register vbuf region
Internal Error: invalid error code ffffffff (Ring Index out of range) in
MPID_nem_ib_init:419
Fatal error in MPI_Init: Internal MPI error!, error stack:
MPIR_Init_thread(458):
MPID_Init(274).......: channel initialization failed
MPIDI_CH3_Init(38)...:
MPID_nem_init(234)...:
MPID_nem_ib_init(419): Failed to allocate memory
===========================================================================
==========
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= EXIT CODE: 256
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===========================================================================
==========
------------
I'm running the job on an Oracle Linux 6.0 operating system:
------------
[sbarve at head bin]# uname -a
Linux head 2.6.32-100.28.11.el6.x86_64 #1 SMP Wed Apr 13 12:42:21 EDT 2011
x86_64 x86_64 x86_64 GNU/Linux
------------
My MVAPICH2 configuration is as follows:
------------
[sbarve at head bin]# ./mpich2version
MPICH2 Version: 1.7a
MPICH2 Release date: Tue Apr 19 12:51:14 EDT 2011
MPICH2 Device: ch3:nemesis
MPICH2 configure: --enable-echo --enable-error-messages=all
--enable-error-checking=all --enable-g=all --enable-check-compiler-flags
--enable-f77 --enable-fc --enable-cxx --enable-rsh --enable-romio
--enable-rdma-cm --with-device=ch3:nemesis:ib --with-pm=hydra:mpirun
--with-pmi=simple --enable-smpcoll --enable-mpe --enable-threads=default
--enable-base-cache --with-mpe --with-dapl-include=/usr/include
--with-dapl-lib=/usr/lib64 --with-ib-include=/usr/include
--with-ib-libpath=/usr/lib64 --prefix=/work/sbarve/mvapich2/intel
MPICH2 CC: icc -O3 -xSSSE3 -ip -no-prec-div -g
MPICH2 CXX: icpc -O3 -xSSSE3 -ip -no-prec-div -g
MPICH2 F77: ifort -O3 -xSSSE3 -ip -no-prec-div -g
MPICH2 FC: ifort -O3 -xSSSE3 -ip -no-prec-div -g
------------
Intel Compiler build: Version 12.0 Build 20110309
Here is the information about my QLogic QLE7340 Infiniband HCA:
------------
[sbarve at head bin]# ibv_devinfo
hca_id: qib0
transport: InfiniBand (0)
fw_ver: 0.0.0
node_guid: 0011:7500:0078:a556
sys_image_guid: 0011:7500:0078:a556
vendor_id: 0x1175
vendor_part_id: 29474
hw_ver: 0x2
board_id: InfiniPath_QLE7340
phys_port_cnt: 1
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 2048 (4)
sm_lid: 1
port_lid: 1
port_lmc: 0x00
link_layer: IB
------------
I have set the stack size to unlimited:
------------
[sbarve at head bin]# ulimit -s
unlimited
------------
I saw in a related thread that I should set the 'max memory size' to be
unlimited as well, but the OS would not allow me to do it as a non-root
user.
When I try to run the job with the "mpirun_rsh -ssh" command, I get almost
the same error:
------------
[ib_vbuf.c 257] Cannot register vbuf region
Internal Error: invalid error code ffffffff (Ring Index out of range) in
MPID_nem_ib_init:419
Fatal error in MPI_Init: Internal MPI error!, error stack:
MPIR_Init_thread(458):
MPID_Init(274).......: channel initialization failed
MPIDI_CH3_Init(38)...:
MPID_nem_init(234)...:
MPID_nem_ib_init(419): Failed to allocate memory
MPI process (rank: 6) terminated unexpectedly on head
Exit code -5 signaled from head
forrtl: error (69): process interrupted (SIGINT)
forrtl: error (69): process interrupted (SIGINT)
forrtl: error (69): process interrupted (SIGINT)
forrtl: error (69): process interrupted (SIGINT)
forrtl: error (69): process interrupted (SIGINT)
forrtl: error (69): process interrupted (SIGINT)
forrtl: error (69): process interrupted (SIGINT)
forrtl: error (69): process interrupted (SIGINT)
forrtl: error (69): process interrupted (SIGINT)
forrtl: error (69): process interrupted (SIGINT)
forrtl: error (69): process interrupted (SIGINT)
forrtl: error (69): process interrupted (SIGINT)
forrtl: error (69): process interrupted (SIGINT)
forrtl: error (69): process interrupted (SIGINT)
Image PC Routine Line Source
libpthread.so.0 000000396A20C163 Unknown Unknown Unknown
libipathverbs-rdm 00002B5D14B9717F Unknown Unknown Unknown
mm5.mpp 00000000005F29CA Unknown Unknown Unknown
mm5.mpp 00000000005F2E65 Unknown Unknown Unknown
mm5.mpp 00000000005E576C Unknown Unknown Unknown
mm5.mpp 00000000005DC5C2 Unknown Unknown Unknown
mm5.mpp 0000000000601607 Unknown Unknown Unknown
mm5.mpp 00000000005AE8AD Unknown Unknown Unknown
mm5.mpp 000000000055F963 Unknown Unknown Unknown
mm5.mpp 000000000055E902 Unknown Unknown Unknown
mm5.mpp 000000000050F38D Unknown Unknown Unknown
mm5.mpp 000000000050BE14 Unknown Unknown Unknown
mm5.mpp 00000000004E8DA1 Unknown Unknown Unknown
mm5.mpp 0000000000457644 Unknown Unknown Unknown
mm5.mpp 0000000000405EEC Unknown Unknown Unknown
libc.so.6 000000396961EC5D Unknown Unknown Unknown
mm5.mpp 0000000000405DE9 Unknown Unknown Unknown
forrtl: error (69): process interrupted (SIGINT)
head: Connection refused
------------
The 'connection refused' cannot be due to SSH, since I have password-less
key-based authentication set up for the server.
Should I be using the "ch3:nemesis:ib" device for compiling MVAPICH2? I
have tried using the "ch3:psm" device, but that threw up different errors.
Should I be using a different version of MVAPICH2? Are there special
compile flags I should be using? Currently, I'm only linking in the
"-lfmpich -lmpich" libraries.
Thanks,
Saurabh
====================================
Saurabh Barve
sbarve at nps.edu
More information about the mvapich-discuss
mailing list