[mvapich-discuss] MVAPICH2 error [Channel Initialization failed]

Sayantan Sur surs at cse.ohio-state.edu
Wed May 4 23:59:02 EDT 2011


Hi Saurabh,

It looks like you are trying to use Qlogic adapters. Could you please
use the ch3:psm interface? You mention that you had some errors with
that. What were they?

Please refer to our user guide to learn about using the CH3 PSM interface.

http://mvapich.cse.ohio-state.edu/support/user_guide_mvapich2-1.7_alpha.html#x1-160004.7

Thanks.

On Wed, May 4, 2011 at 9:47 PM, Barve, Saurabh FORNATL, IN,
Contractor, DCS <sbarve at nps.edu> wrote:
> Hello,
>
> I'm trying to run a parallel run of MM5 using MVAPICH2 on a Linux with the
> Infiniband network. As a trial, I'm trying to run the job on a local node.
> I use the following command to run the job:
>
> ------------
> mpiexec -np 16 -hostfile machines ./mm5.mpp
> ------------
>
>
> The contents of the host file "machines" are simply:
>
> ------------
>
> head
> ------------
>
>
>
>
> I get the following error when I execute the command above:
>
> ------------
>
> [ib_vbuf.c 257] Cannot register vbuf region
> Internal Error: invalid error code ffffffff (Ring Index out of range) in
> MPID_nem_ib_init:419
> Fatal error in MPI_Init: Internal MPI error!, error stack:
> MPIR_Init_thread(458):
> MPID_Init(274).......: channel initialization failed
> MPIDI_CH3_Init(38)...:
> MPID_nem_init(234)...:
> MPID_nem_ib_init(419): Failed to allocate memory
> [ib_vbuf.c 257] Cannot register vbuf region
> Internal Error: invalid error code ffffffff (Ring Index out of range) in
> MPID_nem_ib_init:419
> Fatal error in MPI_Init: Internal MPI error!, error stack:
> MPIR_Init_thread(458):
> MPID_Init(274).......: channel initialization failed
> MPIDI_CH3_Init(38)...:
> MPID_nem_init(234)...:
> MPID_nem_ib_init(419): Failed to allocate memory
>
> ===========================================================================
> ==========
> =   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
> =   EXIT CODE: 256
> =   CLEANING UP REMAINING PROCESSES
> =   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
> ===========================================================================
> ==========
> ------------
>
>
>
>
>
> I'm running the job on an Oracle Linux 6.0 operating system:
> ------------
>
> [sbarve at head bin]# uname -a
> Linux head 2.6.32-100.28.11.el6.x86_64 #1 SMP Wed Apr 13 12:42:21 EDT 2011
> x86_64 x86_64 x86_64 GNU/Linux
>
> ------------
>
>
>
>
>
> My MVAPICH2 configuration is as follows:
> ------------
>
> [sbarve at head bin]# ./mpich2version
> MPICH2 Version:         1.7a
> MPICH2 Release date:    Tue Apr 19 12:51:14 EDT 2011
> MPICH2 Device:          ch3:nemesis
> MPICH2 configure:       --enable-echo --enable-error-messages=all
> --enable-error-checking=all --enable-g=all --enable-check-compiler-flags
> --enable-f77 --enable-fc --enable-cxx --enable-rsh --enable-romio
> --enable-rdma-cm --with-device=ch3:nemesis:ib --with-pm=hydra:mpirun
> --with-pmi=simple --enable-smpcoll --enable-mpe --enable-threads=default
> --enable-base-cache --with-mpe --with-dapl-include=/usr/include
> --with-dapl-lib=/usr/lib64 --with-ib-include=/usr/include
> --with-ib-libpath=/usr/lib64 --prefix=/work/sbarve/mvapich2/intel
> MPICH2 CC:      icc -O3 -xSSSE3 -ip -no-prec-div   -g
> MPICH2 CXX:     icpc -O3 -xSSSE3 -ip -no-prec-div  -g
> MPICH2 F77:     ifort -O3 -xSSSE3 -ip -no-prec-div  -g
> MPICH2 FC:      ifort -O3 -xSSSE3 -ip -no-prec-div  -g
> ------------
>
>
>
> Intel Compiler build: Version 12.0 Build 20110309
>
>
>
> Here is the information about my QLogic QLE7340 Infiniband HCA:
> ------------
>
> [sbarve at head bin]# ibv_devinfo
> hca_id: qib0
>        transport:                      InfiniBand (0)
>        fw_ver:                         0.0.0
>        node_guid:                      0011:7500:0078:a556
>        sys_image_guid:                 0011:7500:0078:a556
>        vendor_id:                      0x1175
>        vendor_part_id:                 29474
>        hw_ver:                         0x2
>        board_id:                       InfiniPath_QLE7340
>        phys_port_cnt:                  1
>                port:   1
>                        state:                  PORT_ACTIVE (4)
>                        max_mtu:                4096 (5)
>                        active_mtu:             2048 (4)
>                        sm_lid:                 1
>                        port_lid:               1
>                        port_lmc:               0x00
>                        link_layer:             IB
> ------------
>
>
>
>
>
> I have set the stack size to unlimited:
> ------------
>
> [sbarve at head bin]# ulimit -s
>
> unlimited
> ------------
>
>
>
> I saw in a related thread that I should set the 'max memory size' to be
> unlimited as well, but the OS would not allow me to do it as a non-root
> user.
>
>
>
>
> When I try to run the job with the "mpirun_rsh -ssh" command, I get almost
> the same error:
> ------------
>
> [ib_vbuf.c 257] Cannot register vbuf region
> Internal Error: invalid error code ffffffff (Ring Index out of range) in
> MPID_nem_ib_init:419
> Fatal error in MPI_Init: Internal MPI error!, error stack:
> MPIR_Init_thread(458):
> MPID_Init(274).......: channel initialization failed
> MPIDI_CH3_Init(38)...:
> MPID_nem_init(234)...:
> MPID_nem_ib_init(419): Failed to allocate memory
> MPI process (rank: 6) terminated unexpectedly on head
> Exit code -5 signaled from head
> forrtl: error (69): process interrupted (SIGINT)
> forrtl: error (69): process interrupted (SIGINT)
> forrtl: error (69): process interrupted (SIGINT)
> forrtl: error (69): process interrupted (SIGINT)
> forrtl: error (69): process interrupted (SIGINT)
> forrtl: error (69): process interrupted (SIGINT)
> forrtl: error (69): process interrupted (SIGINT)
> forrtl: error (69): process interrupted (SIGINT)
> forrtl: error (69): process interrupted (SIGINT)
> forrtl: error (69): process interrupted (SIGINT)
> forrtl: error (69): process interrupted (SIGINT)
> forrtl: error (69): process interrupted (SIGINT)
> forrtl: error (69): process interrupted (SIGINT)
> forrtl: error (69): process interrupted (SIGINT)
> Image              PC                Routine            Line        Source
>
> libpthread.so.0    000000396A20C163  Unknown               Unknown  Unknown
> libipathverbs-rdm  00002B5D14B9717F  Unknown               Unknown  Unknown
> mm5.mpp            00000000005F29CA  Unknown               Unknown  Unknown
> mm5.mpp            00000000005F2E65  Unknown               Unknown  Unknown
> mm5.mpp            00000000005E576C  Unknown               Unknown  Unknown
> mm5.mpp            00000000005DC5C2  Unknown               Unknown  Unknown
> mm5.mpp            0000000000601607  Unknown               Unknown  Unknown
> mm5.mpp            00000000005AE8AD  Unknown               Unknown  Unknown
> mm5.mpp            000000000055F963  Unknown               Unknown  Unknown
> mm5.mpp            000000000055E902  Unknown               Unknown  Unknown
> mm5.mpp            000000000050F38D  Unknown               Unknown  Unknown
> mm5.mpp            000000000050BE14  Unknown               Unknown  Unknown
> mm5.mpp            00000000004E8DA1  Unknown               Unknown  Unknown
> mm5.mpp            0000000000457644  Unknown               Unknown  Unknown
> mm5.mpp            0000000000405EEC  Unknown               Unknown  Unknown
> libc.so.6          000000396961EC5D  Unknown               Unknown  Unknown
> mm5.mpp            0000000000405DE9  Unknown               Unknown  Unknown
> forrtl: error (69): process interrupted (SIGINT)
> head: Connection refused
>
> ------------
>
>
>
> The 'connection refused' cannot be due to SSH, since I have password-less
> key-based authentication set up for the server.
>
>
> Should I be using the "ch3:nemesis:ib" device for compiling MVAPICH2? I
> have tried using the "ch3:psm" device, but that threw up different errors.
> Should I be using a different version of MVAPICH2? Are there special
> compile flags I should be using? Currently, I'm only linking in the
> "-lfmpich -lmpich" libraries.
>
>
> Thanks,
> Saurabh
>
> ====================================
>
> Saurabh Barve
> sbarve at nps.edu
>
>
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
>



-- 
Sayantan Sur

Research Scientist
Department of Computer Science
http://www.cse.ohio-state.edu/~surs



More information about the mvapich-discuss mailing list