[mvapich-discuss] MVAPICH2 error [Channel Initialization failed]

Barve, Saurabh FORNATL, IN, Contractor, DCS sbarve at nps.edu
Thu May 5 13:36:31 EDT 2011


Looks like you are right.

I tried running the two benchmark programs and I got segmentation fault
errors. For the purpose of the tests, I had MVAPICH2 compiled with the
"ch3:psm" device.

First I tried:
------------
mpirun_rsh -np 2 head head ./osu_latency

------------

And then:
------------

mpirun_rsh -np 2 head head ./osu_bw

------------



For both those commands, I got the following errors:

------------

[mpi_rank_0] Caught error: Segmentation fault
[mpi_rank_1] Caught error: Segmentation fault
------------


There are a ton of these errors; they scroll through the screen.


In "/var/log/secure", I see the following messages:
------------

head sshd: Accepted publickey for user sbarve from 192.168.1.101 port
55226 ssh2
head sshd: pam_unix (sshd_session): session opened for user sbarve by
(uid=0)
head sshd: Received disconnect from 192.168.1.101:11 : disconnected by user
------------



The IP address 192.168.1.101 is the address for my Ethernet interface. I
have IPoIB address set up for my IB interface: 10.0.1.101. That address
doesn't show up in the logs.


-Saurabh
====================================

Saurabh Barve 
Digital Consulting Services (DCS)
Supporting the Department of Meteorology
Naval Postgraduate School
831-656-3396 
sbarve at nps.edu




On 5/5/11 6:04 AM, "Jonathan Perkins" <perkinjo at cse.ohio-state.edu> wrote:

>Hi, according to your initial email I think its possible that you're
>facing issues with the amount of lockable memory available to the
>library.  Have you tried to run anything simple such as the osu
>benchmarks?
>
>These should be installed under
>/work/sbarve/mvapich2/intel/libexec/osu-micro-benchmarks/.
>
>Please see 
>http://mvapich.cse.ohio-state.edu/support/user_guide_mvapich2-1.7_alpha.ht
>ml#x1-630007
>for more information on running the osu benchmarks.
>
>If these fail you may have a system setup issue.  For more information
>on setting the max lockable memory point your SysAdmin to the
>following link.
>http://mvapich.cse.ohio-state.edu/support/user_guide_mvapich2-1.7_alpha.ht
>ml#x1-1030009.3.3
>
>On Thu, May 5, 2011 at 2:09 AM, Barve, Saurabh FORNATL, IN,
>Contractor, DCS <sbarve at nps.edu> wrote:
>> Following your suggestion, I compiled MVAPICH2 for the "ch3:psm"
>>device. I kept the rest of the compilation options unchanged; I simply
>>replaced "ch3:nemesis:ib" with "ch3:psm".
>>
>> When I have MVAPICH2 compiled with the "ch3:psm" device, the MM5 run
>>starts. The "rsl.out.*" and "rsl.error.*" are written out. However,
>>after the initial conditions are printed out by the model, no other
>>output gets written out. At the step where the processing of data
>>starts, there are (a) no more updates to the "rsl.out.*" files, and (b)
>>no MM5 output files are written out. I've observed this for as long as
>>30-35 minutes, after which I kill the job.
>>
>> I've tried using both 'mpiexec' and 'mpirun_rsh' to start the job. In
>>both cases, the output of "top" shows the process status for multiple
>>instances of the MM5 binary as Running, but the (a) 'mpiexec' and
>>'mpi_hydra_proxy' processes for "mpiexec", and (b) 'mpirun_rsh' and
>>'mpispawn' processes for "mpirun_rsh" have the Sleeping (S) status
>>displayed.
>>
>> Thanks,
>> Saurabh
>> =========================================
>> Saurabh Barve
>> sbarve at nps.edu
>>
>> ________________________________________
>> From: sayantan.sur at gmail.com [sayantan.sur at gmail.com] on behalf of
>>Sayantan Sur [surs at cse.ohio-state.edu]
>> Sent: Wednesday, May 04, 2011 8:59 PM
>> To: Barve, Saurabh FORNATL, IN, Contractor, DCS
>> Cc: mvapich-discuss at cse.ohio-state.edu
>> Subject: Re: [mvapich-discuss] MVAPICH2 error [Channel Initialization
>>failed]
>>
>> Hi Saurabh,
>>
>> It looks like you are trying to use Qlogic adapters. Could you please
>> use the ch3:psm interface? You mention that you had some errors with
>> that. What were they?
>>
>> Please refer to our user guide to learn about using the CH3 PSM
>>interface.
>>
>> 
>>http://mvapich.cse.ohio-state.edu/support/user_guide_mvapich2-1.7_alpha.h
>>tml#x1-160004.7
>>
>> Thanks.
>>
>> On Wed, May 4, 2011 at 9:47 PM, Barve, Saurabh FORNATL, IN,
>> Contractor, DCS <sbarve at nps.edu> wrote:
>>> Hello,
>>>
>>> I'm trying to run a parallel run of MM5 using MVAPICH2 on a Linux with
>>>the
>>> Infiniband network. As a trial, I'm trying to run the job on a local
>>>node.
>>> I use the following command to run the job:
>>>
>>> ------------
>>> mpiexec -np 16 -hostfile machines ./mm5.mpp
>>> ------------
>>>
>>>
>>> The contents of the host file "machines" are simply:
>>>
>>> ------------
>>>
>>> head
>>> ------------
>>>
>>>
>>>
>>>
>>> I get the following error when I execute the command above:
>>>
>>> ------------
>>>
>>> [ib_vbuf.c 257] Cannot register vbuf region
>>> Internal Error: invalid error code ffffffff (Ring Index out of range)
>>>in
>>> MPID_nem_ib_init:419
>>> Fatal error in MPI_Init: Internal MPI error!, error stack:
>>> MPIR_Init_thread(458):
>>> MPID_Init(274).......: channel initialization failed
>>> MPIDI_CH3_Init(38)...:
>>> MPID_nem_init(234)...:
>>> MPID_nem_ib_init(419): Failed to allocate memory
>>> [ib_vbuf.c 257] Cannot register vbuf region
>>> Internal Error: invalid error code ffffffff (Ring Index out of range)
>>>in
>>> MPID_nem_ib_init:419
>>> Fatal error in MPI_Init: Internal MPI error!, error stack:
>>> MPIR_Init_thread(458):
>>> MPID_Init(274).......: channel initialization failed
>>> MPIDI_CH3_Init(38)...:
>>> MPID_nem_init(234)...:
>>> MPID_nem_ib_init(419): Failed to allocate memory
>>>
>>> 
>>>========================================================================
>>>===
>>> ==========
>>> =   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
>>> =   EXIT CODE: 256
>>> =   CLEANING UP REMAINING PROCESSES
>>> =   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
>>> 
>>>========================================================================
>>>===
>>> ==========
>>> ------------
>>>
>>>
>>>
>>>
>>>
>>> I'm running the job on an Oracle Linux 6.0 operating system:
>>> ------------
>>>
>>> [sbarve at head bin]# uname -a
>>> Linux head 2.6.32-100.28.11.el6.x86_64 #1 SMP Wed Apr 13 12:42:21 EDT
>>>2011
>>> x86_64 x86_64 x86_64 GNU/Linux
>>>
>>> ------------
>>>
>>>
>>>
>>>
>>>
>>> My MVAPICH2 configuration is as follows:
>>> ------------
>>>
>>> [sbarve at head bin]# ./mpich2version
>>> MPICH2 Version:         1.7a
>>> MPICH2 Release date:    Tue Apr 19 12:51:14 EDT 2011
>>> MPICH2 Device:          ch3:nemesis
>>> MPICH2 configure:       --enable-echo --enable-error-messages=all
>>> --enable-error-checking=all --enable-g=all
>>>--enable-check-compiler-flags
>>> --enable-f77 --enable-fc --enable-cxx --enable-rsh --enable-romio
>>> --enable-rdma-cm --with-device=ch3:nemesis:ib --with-pm=hydra:mpirun
>>> --with-pmi=simple --enable-smpcoll --enable-mpe
>>>--enable-threads=default
>>> --enable-base-cache --with-mpe --with-dapl-include=/usr/include
>>> --with-dapl-lib=/usr/lib64 --with-ib-include=/usr/include
>>> --with-ib-libpath=/usr/lib64 --prefix=/work/sbarve/mvapich2/intel
>>> MPICH2 CC:      icc -O3 -xSSSE3 -ip -no-prec-div   -g
>>> MPICH2 CXX:     icpc -O3 -xSSSE3 -ip -no-prec-div  -g
>>> MPICH2 F77:     ifort -O3 -xSSSE3 -ip -no-prec-div  -g
>>> MPICH2 FC:      ifort -O3 -xSSSE3 -ip -no-prec-div  -g
>>> ------------
>>>
>>>
>>>
>>> Intel Compiler build: Version 12.0 Build 20110309
>>>
>>>
>>>
>>> Here is the information about my QLogic QLE7340 Infiniband HCA:
>>> ------------
>>>
>>> [sbarve at head bin]# ibv_devinfo
>>> hca_id: qib0
>>>        transport:                      InfiniBand (0)
>>>        fw_ver:                         0.0.0
>>>        node_guid:                      0011:7500:0078:a556
>>>        sys_image_guid:                 0011:7500:0078:a556
>>>        vendor_id:                      0x1175
>>>        vendor_part_id:                 29474
>>>        hw_ver:                         0x2
>>>        board_id:                       InfiniPath_QLE7340
>>>        phys_port_cnt:                  1
>>>                port:   1
>>>                        state:                  PORT_ACTIVE (4)
>>>                        max_mtu:                4096 (5)
>>>                        active_mtu:             2048 (4)
>>>                        sm_lid:                 1
>>>                        port_lid:               1
>>>                        port_lmc:               0x00
>>>                        link_layer:             IB
>>> ------------
>>>
>>>
>>>
>>>
>>>
>>> I have set the stack size to unlimited:
>>> ------------
>>>
>>> [sbarve at head bin]# ulimit -s
>>>
>>> unlimited
>>> ------------
>>>
>>>
>>>
>>> I saw in a related thread that I should set the 'max memory size' to be
>>> unlimited as well, but the OS would not allow me to do it as a non-root
>>> user.
>>>
>>>
>>>
>>>
>>> When I try to run the job with the "mpirun_rsh -ssh" command, I get
>>>almost
>>> the same error:
>>> ------------
>>>
>>> [ib_vbuf.c 257] Cannot register vbuf region
>>> Internal Error: invalid error code ffffffff (Ring Index out of range)
>>>in
>>> MPID_nem_ib_init:419
>>> Fatal error in MPI_Init: Internal MPI error!, error stack:
>>> MPIR_Init_thread(458):
>>> MPID_Init(274).......: channel initialization failed
>>> MPIDI_CH3_Init(38)...:
>>> MPID_nem_init(234)...:
>>> MPID_nem_ib_init(419): Failed to allocate memory
>>> MPI process (rank: 6) terminated unexpectedly on head
>>> Exit code -5 signaled from head
>>> forrtl: error (69): process interrupted (SIGINT)
>>> forrtl: error (69): process interrupted (SIGINT)
>>> forrtl: error (69): process interrupted (SIGINT)
>>> forrtl: error (69): process interrupted (SIGINT)
>>> forrtl: error (69): process interrupted (SIGINT)
>>> forrtl: error (69): process interrupted (SIGINT)
>>> forrtl: error (69): process interrupted (SIGINT)
>>> forrtl: error (69): process interrupted (SIGINT)
>>> forrtl: error (69): process interrupted (SIGINT)
>>> forrtl: error (69): process interrupted (SIGINT)
>>> forrtl: error (69): process interrupted (SIGINT)
>>> forrtl: error (69): process interrupted (SIGINT)
>>> forrtl: error (69): process interrupted (SIGINT)
>>> forrtl: error (69): process interrupted (SIGINT)
>>> Image              PC                Routine            Line
>>>Source
>>>
>>> libpthread.so.0    000000396A20C163  Unknown               Unknown
>>>Unknown
>>> libipathverbs-rdm  00002B5D14B9717F  Unknown               Unknown
>>>Unknown
>>> mm5.mpp            00000000005F29CA  Unknown               Unknown
>>>Unknown
>>> mm5.mpp            00000000005F2E65  Unknown               Unknown
>>>Unknown
>>> mm5.mpp            00000000005E576C  Unknown               Unknown
>>>Unknown
>>> mm5.mpp            00000000005DC5C2  Unknown               Unknown
>>>Unknown
>>> mm5.mpp            0000000000601607  Unknown               Unknown
>>>Unknown
>>> mm5.mpp            00000000005AE8AD  Unknown               Unknown
>>>Unknown
>>> mm5.mpp            000000000055F963  Unknown               Unknown
>>>Unknown
>>> mm5.mpp            000000000055E902  Unknown               Unknown
>>>Unknown
>>> mm5.mpp            000000000050F38D  Unknown               Unknown
>>>Unknown
>>> mm5.mpp            000000000050BE14  Unknown               Unknown
>>>Unknown
>>> mm5.mpp            00000000004E8DA1  Unknown               Unknown
>>>Unknown
>>> mm5.mpp            0000000000457644  Unknown               Unknown
>>>Unknown
>>> mm5.mpp            0000000000405EEC  Unknown               Unknown
>>>Unknown
>>> libc.so.6          000000396961EC5D  Unknown               Unknown
>>>Unknown
>>> mm5.mpp            0000000000405DE9  Unknown               Unknown
>>>Unknown
>>> forrtl: error (69): process interrupted (SIGINT)
>>> head: Connection refused
>>>
>>> ------------
>>>
>>>
>>>
>>> The 'connection refused' cannot be due to SSH, since I have
>>>password-less
>>> key-based authentication set up for the server.
>>>
>>>
>>> Should I be using the "ch3:nemesis:ib" device for compiling MVAPICH2? I
>>> have tried using the "ch3:psm" device, but that threw up different
>>>errors.
>>> Should I be using a different version of MVAPICH2? Are there special
>>> compile flags I should be using? Currently, I'm only linking in the
>>> "-lfmpich -lmpich" libraries.
>>>
>>>
>>> Thanks,
>>> Saurabh
>>>
>>> ====================================
>>>
>>> Saurabh Barve
>>> sbarve at nps.edu
>>>
>>>
>>>
>>> _______________________________________________
>>> mvapich-discuss mailing list
>>> mvapich-discuss at cse.ohio-state.edu
>>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>>
>>>
>>
>>
>>
>> --
>> Sayantan Sur
>>
>> Research Scientist
>> Department of Computer Science
>> http://www.cse.ohio-state.edu/~surs
>>
>> _______________________________________________
>> mvapich-discuss mailing list
>> mvapich-discuss at cse.ohio-state.edu
>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>
>>
>
>
>
>-- 
>Jonathan Perkins
>http://www.cse.ohio-state.edu/~perkinjo




More information about the mvapich-discuss mailing list