[mvapich-discuss] MVAPICH2 error [Channel Initialization failed]

Barve, Saurabh FORNATL, IN, Contractor, DCS sbarve at nps.edu
Thu May 5 02:09:53 EDT 2011


Following your suggestion, I compiled MVAPICH2 for the "ch3:psm" device. I kept the rest of the compilation options unchanged; I simply replaced "ch3:nemesis:ib" with "ch3:psm".

When I have MVAPICH2 compiled with the "ch3:psm" device, the MM5 run starts. The "rsl.out.*" and "rsl.error.*" are written out. However, after the initial conditions are printed out by the model, no other output gets written out. At the step where the processing of data starts, there are (a) no more updates to the "rsl.out.*" files, and (b) no MM5 output files are written out. I've observed this for as long as 30-35 minutes, after which I kill the job.

I've tried using both 'mpiexec' and 'mpirun_rsh' to start the job. In both cases, the output of "top" shows the process status for multiple instances of the MM5 binary as Running, but the (a) 'mpiexec' and 'mpi_hydra_proxy' processes for "mpiexec", and (b) 'mpirun_rsh' and 'mpispawn' processes for "mpirun_rsh" have the Sleeping (S) status displayed. 

Thanks,
Saurabh
=========================================
Saurabh Barve
sbarve at nps.edu

________________________________________
From: sayantan.sur at gmail.com [sayantan.sur at gmail.com] on behalf of Sayantan Sur [surs at cse.ohio-state.edu]
Sent: Wednesday, May 04, 2011 8:59 PM
To: Barve, Saurabh FORNATL, IN, Contractor, DCS
Cc: mvapich-discuss at cse.ohio-state.edu
Subject: Re: [mvapich-discuss] MVAPICH2 error [Channel Initialization failed]

Hi Saurabh,

It looks like you are trying to use Qlogic adapters. Could you please
use the ch3:psm interface? You mention that you had some errors with
that. What were they?

Please refer to our user guide to learn about using the CH3 PSM interface.

http://mvapich.cse.ohio-state.edu/support/user_guide_mvapich2-1.7_alpha.html#x1-160004.7

Thanks.

On Wed, May 4, 2011 at 9:47 PM, Barve, Saurabh FORNATL, IN,
Contractor, DCS <sbarve at nps.edu> wrote:
> Hello,
>
> I'm trying to run a parallel run of MM5 using MVAPICH2 on a Linux with the
> Infiniband network. As a trial, I'm trying to run the job on a local node.
> I use the following command to run the job:
>
> ------------
> mpiexec -np 16 -hostfile machines ./mm5.mpp
> ------------
>
>
> The contents of the host file "machines" are simply:
>
> ------------
>
> head
> ------------
>
>
>
>
> I get the following error when I execute the command above:
>
> ------------
>
> [ib_vbuf.c 257] Cannot register vbuf region
> Internal Error: invalid error code ffffffff (Ring Index out of range) in
> MPID_nem_ib_init:419
> Fatal error in MPI_Init: Internal MPI error!, error stack:
> MPIR_Init_thread(458):
> MPID_Init(274).......: channel initialization failed
> MPIDI_CH3_Init(38)...:
> MPID_nem_init(234)...:
> MPID_nem_ib_init(419): Failed to allocate memory
> [ib_vbuf.c 257] Cannot register vbuf region
> Internal Error: invalid error code ffffffff (Ring Index out of range) in
> MPID_nem_ib_init:419
> Fatal error in MPI_Init: Internal MPI error!, error stack:
> MPIR_Init_thread(458):
> MPID_Init(274).......: channel initialization failed
> MPIDI_CH3_Init(38)...:
> MPID_nem_init(234)...:
> MPID_nem_ib_init(419): Failed to allocate memory
>
> ===========================================================================
> ==========
> =   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
> =   EXIT CODE: 256
> =   CLEANING UP REMAINING PROCESSES
> =   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
> ===========================================================================
> ==========
> ------------
>
>
>
>
>
> I'm running the job on an Oracle Linux 6.0 operating system:
> ------------
>
> [sbarve at head bin]# uname -a
> Linux head 2.6.32-100.28.11.el6.x86_64 #1 SMP Wed Apr 13 12:42:21 EDT 2011
> x86_64 x86_64 x86_64 GNU/Linux
>
> ------------
>
>
>
>
>
> My MVAPICH2 configuration is as follows:
> ------------
>
> [sbarve at head bin]# ./mpich2version
> MPICH2 Version:         1.7a
> MPICH2 Release date:    Tue Apr 19 12:51:14 EDT 2011
> MPICH2 Device:          ch3:nemesis
> MPICH2 configure:       --enable-echo --enable-error-messages=all
> --enable-error-checking=all --enable-g=all --enable-check-compiler-flags
> --enable-f77 --enable-fc --enable-cxx --enable-rsh --enable-romio
> --enable-rdma-cm --with-device=ch3:nemesis:ib --with-pm=hydra:mpirun
> --with-pmi=simple --enable-smpcoll --enable-mpe --enable-threads=default
> --enable-base-cache --with-mpe --with-dapl-include=/usr/include
> --with-dapl-lib=/usr/lib64 --with-ib-include=/usr/include
> --with-ib-libpath=/usr/lib64 --prefix=/work/sbarve/mvapich2/intel
> MPICH2 CC:      icc -O3 -xSSSE3 -ip -no-prec-div   -g
> MPICH2 CXX:     icpc -O3 -xSSSE3 -ip -no-prec-div  -g
> MPICH2 F77:     ifort -O3 -xSSSE3 -ip -no-prec-div  -g
> MPICH2 FC:      ifort -O3 -xSSSE3 -ip -no-prec-div  -g
> ------------
>
>
>
> Intel Compiler build: Version 12.0 Build 20110309
>
>
>
> Here is the information about my QLogic QLE7340 Infiniband HCA:
> ------------
>
> [sbarve at head bin]# ibv_devinfo
> hca_id: qib0
>        transport:                      InfiniBand (0)
>        fw_ver:                         0.0.0
>        node_guid:                      0011:7500:0078:a556
>        sys_image_guid:                 0011:7500:0078:a556
>        vendor_id:                      0x1175
>        vendor_part_id:                 29474
>        hw_ver:                         0x2
>        board_id:                       InfiniPath_QLE7340
>        phys_port_cnt:                  1
>                port:   1
>                        state:                  PORT_ACTIVE (4)
>                        max_mtu:                4096 (5)
>                        active_mtu:             2048 (4)
>                        sm_lid:                 1
>                        port_lid:               1
>                        port_lmc:               0x00
>                        link_layer:             IB
> ------------
>
>
>
>
>
> I have set the stack size to unlimited:
> ------------
>
> [sbarve at head bin]# ulimit -s
>
> unlimited
> ------------
>
>
>
> I saw in a related thread that I should set the 'max memory size' to be
> unlimited as well, but the OS would not allow me to do it as a non-root
> user.
>
>
>
>
> When I try to run the job with the "mpirun_rsh -ssh" command, I get almost
> the same error:
> ------------
>
> [ib_vbuf.c 257] Cannot register vbuf region
> Internal Error: invalid error code ffffffff (Ring Index out of range) in
> MPID_nem_ib_init:419
> Fatal error in MPI_Init: Internal MPI error!, error stack:
> MPIR_Init_thread(458):
> MPID_Init(274).......: channel initialization failed
> MPIDI_CH3_Init(38)...:
> MPID_nem_init(234)...:
> MPID_nem_ib_init(419): Failed to allocate memory
> MPI process (rank: 6) terminated unexpectedly on head
> Exit code -5 signaled from head
> forrtl: error (69): process interrupted (SIGINT)
> forrtl: error (69): process interrupted (SIGINT)
> forrtl: error (69): process interrupted (SIGINT)
> forrtl: error (69): process interrupted (SIGINT)
> forrtl: error (69): process interrupted (SIGINT)
> forrtl: error (69): process interrupted (SIGINT)
> forrtl: error (69): process interrupted (SIGINT)
> forrtl: error (69): process interrupted (SIGINT)
> forrtl: error (69): process interrupted (SIGINT)
> forrtl: error (69): process interrupted (SIGINT)
> forrtl: error (69): process interrupted (SIGINT)
> forrtl: error (69): process interrupted (SIGINT)
> forrtl: error (69): process interrupted (SIGINT)
> forrtl: error (69): process interrupted (SIGINT)
> Image              PC                Routine            Line        Source
>
> libpthread.so.0    000000396A20C163  Unknown               Unknown  Unknown
> libipathverbs-rdm  00002B5D14B9717F  Unknown               Unknown  Unknown
> mm5.mpp            00000000005F29CA  Unknown               Unknown  Unknown
> mm5.mpp            00000000005F2E65  Unknown               Unknown  Unknown
> mm5.mpp            00000000005E576C  Unknown               Unknown  Unknown
> mm5.mpp            00000000005DC5C2  Unknown               Unknown  Unknown
> mm5.mpp            0000000000601607  Unknown               Unknown  Unknown
> mm5.mpp            00000000005AE8AD  Unknown               Unknown  Unknown
> mm5.mpp            000000000055F963  Unknown               Unknown  Unknown
> mm5.mpp            000000000055E902  Unknown               Unknown  Unknown
> mm5.mpp            000000000050F38D  Unknown               Unknown  Unknown
> mm5.mpp            000000000050BE14  Unknown               Unknown  Unknown
> mm5.mpp            00000000004E8DA1  Unknown               Unknown  Unknown
> mm5.mpp            0000000000457644  Unknown               Unknown  Unknown
> mm5.mpp            0000000000405EEC  Unknown               Unknown  Unknown
> libc.so.6          000000396961EC5D  Unknown               Unknown  Unknown
> mm5.mpp            0000000000405DE9  Unknown               Unknown  Unknown
> forrtl: error (69): process interrupted (SIGINT)
> head: Connection refused
>
> ------------
>
>
>
> The 'connection refused' cannot be due to SSH, since I have password-less
> key-based authentication set up for the server.
>
>
> Should I be using the "ch3:nemesis:ib" device for compiling MVAPICH2? I
> have tried using the "ch3:psm" device, but that threw up different errors.
> Should I be using a different version of MVAPICH2? Are there special
> compile flags I should be using? Currently, I'm only linking in the
> "-lfmpich -lmpich" libraries.
>
>
> Thanks,
> Saurabh
>
> ====================================
>
> Saurabh Barve
> sbarve at nps.edu
>
>
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
>



--
Sayantan Sur

Research Scientist
Department of Computer Science
http://www.cse.ohio-state.edu/~surs



More information about the mvapich-discuss mailing list