[mvapich-discuss] MVAPICH2 error [Channel Initialization failed]
Barve, Saurabh FORNATL, IN, Contractor, DCS
sbarve at nps.edu
Thu May 5 13:36:31 EDT 2011
Looks like you are right.
I tried running the two benchmark programs and I got segmentation fault
errors. For the purpose of the tests, I had MVAPICH2 compiled with the
"ch3:psm" device.
First I tried:
------------
mpirun_rsh -np 2 head head ./osu_latency
------------
And then:
------------
mpirun_rsh -np 2 head head ./osu_bw
------------
For both those commands, I got the following errors:
------------
[mpi_rank_0] Caught error: Segmentation fault
[mpi_rank_1] Caught error: Segmentation fault
------------
There are a ton of these errors; they scroll through the screen.
In "/var/log/secure", I see the following messages:
------------
head sshd: Accepted publickey for user sbarve from 192.168.1.101 port
55226 ssh2
head sshd: pam_unix (sshd_session): session opened for user sbarve by
(uid=0)
head sshd: Received disconnect from 192.168.1.101:11 : disconnected by user
------------
The IP address 192.168.1.101 is the address for my Ethernet interface. I
have IPoIB address set up for my IB interface: 10.0.1.101. That address
doesn't show up in the logs.
-Saurabh
====================================
Saurabh Barve
Digital Consulting Services (DCS)
Supporting the Department of Meteorology
Naval Postgraduate School
831-656-3396
sbarve at nps.edu
On 5/5/11 6:04 AM, "Jonathan Perkins" <perkinjo at cse.ohio-state.edu> wrote:
>Hi, according to your initial email I think its possible that you're
>facing issues with the amount of lockable memory available to the
>library. Have you tried to run anything simple such as the osu
>benchmarks?
>
>These should be installed under
>/work/sbarve/mvapich2/intel/libexec/osu-micro-benchmarks/.
>
>Please see
>http://mvapich.cse.ohio-state.edu/support/user_guide_mvapich2-1.7_alpha.ht
>ml#x1-630007
>for more information on running the osu benchmarks.
>
>If these fail you may have a system setup issue. For more information
>on setting the max lockable memory point your SysAdmin to the
>following link.
>http://mvapich.cse.ohio-state.edu/support/user_guide_mvapich2-1.7_alpha.ht
>ml#x1-1030009.3.3
>
>On Thu, May 5, 2011 at 2:09 AM, Barve, Saurabh FORNATL, IN,
>Contractor, DCS <sbarve at nps.edu> wrote:
>> Following your suggestion, I compiled MVAPICH2 for the "ch3:psm"
>>device. I kept the rest of the compilation options unchanged; I simply
>>replaced "ch3:nemesis:ib" with "ch3:psm".
>>
>> When I have MVAPICH2 compiled with the "ch3:psm" device, the MM5 run
>>starts. The "rsl.out.*" and "rsl.error.*" are written out. However,
>>after the initial conditions are printed out by the model, no other
>>output gets written out. At the step where the processing of data
>>starts, there are (a) no more updates to the "rsl.out.*" files, and (b)
>>no MM5 output files are written out. I've observed this for as long as
>>30-35 minutes, after which I kill the job.
>>
>> I've tried using both 'mpiexec' and 'mpirun_rsh' to start the job. In
>>both cases, the output of "top" shows the process status for multiple
>>instances of the MM5 binary as Running, but the (a) 'mpiexec' and
>>'mpi_hydra_proxy' processes for "mpiexec", and (b) 'mpirun_rsh' and
>>'mpispawn' processes for "mpirun_rsh" have the Sleeping (S) status
>>displayed.
>>
>> Thanks,
>> Saurabh
>> =========================================
>> Saurabh Barve
>> sbarve at nps.edu
>>
>> ________________________________________
>> From: sayantan.sur at gmail.com [sayantan.sur at gmail.com] on behalf of
>>Sayantan Sur [surs at cse.ohio-state.edu]
>> Sent: Wednesday, May 04, 2011 8:59 PM
>> To: Barve, Saurabh FORNATL, IN, Contractor, DCS
>> Cc: mvapich-discuss at cse.ohio-state.edu
>> Subject: Re: [mvapich-discuss] MVAPICH2 error [Channel Initialization
>>failed]
>>
>> Hi Saurabh,
>>
>> It looks like you are trying to use Qlogic adapters. Could you please
>> use the ch3:psm interface? You mention that you had some errors with
>> that. What were they?
>>
>> Please refer to our user guide to learn about using the CH3 PSM
>>interface.
>>
>>
>>http://mvapich.cse.ohio-state.edu/support/user_guide_mvapich2-1.7_alpha.h
>>tml#x1-160004.7
>>
>> Thanks.
>>
>> On Wed, May 4, 2011 at 9:47 PM, Barve, Saurabh FORNATL, IN,
>> Contractor, DCS <sbarve at nps.edu> wrote:
>>> Hello,
>>>
>>> I'm trying to run a parallel run of MM5 using MVAPICH2 on a Linux with
>>>the
>>> Infiniband network. As a trial, I'm trying to run the job on a local
>>>node.
>>> I use the following command to run the job:
>>>
>>> ------------
>>> mpiexec -np 16 -hostfile machines ./mm5.mpp
>>> ------------
>>>
>>>
>>> The contents of the host file "machines" are simply:
>>>
>>> ------------
>>>
>>> head
>>> ------------
>>>
>>>
>>>
>>>
>>> I get the following error when I execute the command above:
>>>
>>> ------------
>>>
>>> [ib_vbuf.c 257] Cannot register vbuf region
>>> Internal Error: invalid error code ffffffff (Ring Index out of range)
>>>in
>>> MPID_nem_ib_init:419
>>> Fatal error in MPI_Init: Internal MPI error!, error stack:
>>> MPIR_Init_thread(458):
>>> MPID_Init(274).......: channel initialization failed
>>> MPIDI_CH3_Init(38)...:
>>> MPID_nem_init(234)...:
>>> MPID_nem_ib_init(419): Failed to allocate memory
>>> [ib_vbuf.c 257] Cannot register vbuf region
>>> Internal Error: invalid error code ffffffff (Ring Index out of range)
>>>in
>>> MPID_nem_ib_init:419
>>> Fatal error in MPI_Init: Internal MPI error!, error stack:
>>> MPIR_Init_thread(458):
>>> MPID_Init(274).......: channel initialization failed
>>> MPIDI_CH3_Init(38)...:
>>> MPID_nem_init(234)...:
>>> MPID_nem_ib_init(419): Failed to allocate memory
>>>
>>>
>>>========================================================================
>>>===
>>> ==========
>>> = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
>>> = EXIT CODE: 256
>>> = CLEANING UP REMAINING PROCESSES
>>> = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
>>>
>>>========================================================================
>>>===
>>> ==========
>>> ------------
>>>
>>>
>>>
>>>
>>>
>>> I'm running the job on an Oracle Linux 6.0 operating system:
>>> ------------
>>>
>>> [sbarve at head bin]# uname -a
>>> Linux head 2.6.32-100.28.11.el6.x86_64 #1 SMP Wed Apr 13 12:42:21 EDT
>>>2011
>>> x86_64 x86_64 x86_64 GNU/Linux
>>>
>>> ------------
>>>
>>>
>>>
>>>
>>>
>>> My MVAPICH2 configuration is as follows:
>>> ------------
>>>
>>> [sbarve at head bin]# ./mpich2version
>>> MPICH2 Version: 1.7a
>>> MPICH2 Release date: Tue Apr 19 12:51:14 EDT 2011
>>> MPICH2 Device: ch3:nemesis
>>> MPICH2 configure: --enable-echo --enable-error-messages=all
>>> --enable-error-checking=all --enable-g=all
>>>--enable-check-compiler-flags
>>> --enable-f77 --enable-fc --enable-cxx --enable-rsh --enable-romio
>>> --enable-rdma-cm --with-device=ch3:nemesis:ib --with-pm=hydra:mpirun
>>> --with-pmi=simple --enable-smpcoll --enable-mpe
>>>--enable-threads=default
>>> --enable-base-cache --with-mpe --with-dapl-include=/usr/include
>>> --with-dapl-lib=/usr/lib64 --with-ib-include=/usr/include
>>> --with-ib-libpath=/usr/lib64 --prefix=/work/sbarve/mvapich2/intel
>>> MPICH2 CC: icc -O3 -xSSSE3 -ip -no-prec-div -g
>>> MPICH2 CXX: icpc -O3 -xSSSE3 -ip -no-prec-div -g
>>> MPICH2 F77: ifort -O3 -xSSSE3 -ip -no-prec-div -g
>>> MPICH2 FC: ifort -O3 -xSSSE3 -ip -no-prec-div -g
>>> ------------
>>>
>>>
>>>
>>> Intel Compiler build: Version 12.0 Build 20110309
>>>
>>>
>>>
>>> Here is the information about my QLogic QLE7340 Infiniband HCA:
>>> ------------
>>>
>>> [sbarve at head bin]# ibv_devinfo
>>> hca_id: qib0
>>> transport: InfiniBand (0)
>>> fw_ver: 0.0.0
>>> node_guid: 0011:7500:0078:a556
>>> sys_image_guid: 0011:7500:0078:a556
>>> vendor_id: 0x1175
>>> vendor_part_id: 29474
>>> hw_ver: 0x2
>>> board_id: InfiniPath_QLE7340
>>> phys_port_cnt: 1
>>> port: 1
>>> state: PORT_ACTIVE (4)
>>> max_mtu: 4096 (5)
>>> active_mtu: 2048 (4)
>>> sm_lid: 1
>>> port_lid: 1
>>> port_lmc: 0x00
>>> link_layer: IB
>>> ------------
>>>
>>>
>>>
>>>
>>>
>>> I have set the stack size to unlimited:
>>> ------------
>>>
>>> [sbarve at head bin]# ulimit -s
>>>
>>> unlimited
>>> ------------
>>>
>>>
>>>
>>> I saw in a related thread that I should set the 'max memory size' to be
>>> unlimited as well, but the OS would not allow me to do it as a non-root
>>> user.
>>>
>>>
>>>
>>>
>>> When I try to run the job with the "mpirun_rsh -ssh" command, I get
>>>almost
>>> the same error:
>>> ------------
>>>
>>> [ib_vbuf.c 257] Cannot register vbuf region
>>> Internal Error: invalid error code ffffffff (Ring Index out of range)
>>>in
>>> MPID_nem_ib_init:419
>>> Fatal error in MPI_Init: Internal MPI error!, error stack:
>>> MPIR_Init_thread(458):
>>> MPID_Init(274).......: channel initialization failed
>>> MPIDI_CH3_Init(38)...:
>>> MPID_nem_init(234)...:
>>> MPID_nem_ib_init(419): Failed to allocate memory
>>> MPI process (rank: 6) terminated unexpectedly on head
>>> Exit code -5 signaled from head
>>> forrtl: error (69): process interrupted (SIGINT)
>>> forrtl: error (69): process interrupted (SIGINT)
>>> forrtl: error (69): process interrupted (SIGINT)
>>> forrtl: error (69): process interrupted (SIGINT)
>>> forrtl: error (69): process interrupted (SIGINT)
>>> forrtl: error (69): process interrupted (SIGINT)
>>> forrtl: error (69): process interrupted (SIGINT)
>>> forrtl: error (69): process interrupted (SIGINT)
>>> forrtl: error (69): process interrupted (SIGINT)
>>> forrtl: error (69): process interrupted (SIGINT)
>>> forrtl: error (69): process interrupted (SIGINT)
>>> forrtl: error (69): process interrupted (SIGINT)
>>> forrtl: error (69): process interrupted (SIGINT)
>>> forrtl: error (69): process interrupted (SIGINT)
>>> Image PC Routine Line
>>>Source
>>>
>>> libpthread.so.0 000000396A20C163 Unknown Unknown
>>>Unknown
>>> libipathverbs-rdm 00002B5D14B9717F Unknown Unknown
>>>Unknown
>>> mm5.mpp 00000000005F29CA Unknown Unknown
>>>Unknown
>>> mm5.mpp 00000000005F2E65 Unknown Unknown
>>>Unknown
>>> mm5.mpp 00000000005E576C Unknown Unknown
>>>Unknown
>>> mm5.mpp 00000000005DC5C2 Unknown Unknown
>>>Unknown
>>> mm5.mpp 0000000000601607 Unknown Unknown
>>>Unknown
>>> mm5.mpp 00000000005AE8AD Unknown Unknown
>>>Unknown
>>> mm5.mpp 000000000055F963 Unknown Unknown
>>>Unknown
>>> mm5.mpp 000000000055E902 Unknown Unknown
>>>Unknown
>>> mm5.mpp 000000000050F38D Unknown Unknown
>>>Unknown
>>> mm5.mpp 000000000050BE14 Unknown Unknown
>>>Unknown
>>> mm5.mpp 00000000004E8DA1 Unknown Unknown
>>>Unknown
>>> mm5.mpp 0000000000457644 Unknown Unknown
>>>Unknown
>>> mm5.mpp 0000000000405EEC Unknown Unknown
>>>Unknown
>>> libc.so.6 000000396961EC5D Unknown Unknown
>>>Unknown
>>> mm5.mpp 0000000000405DE9 Unknown Unknown
>>>Unknown
>>> forrtl: error (69): process interrupted (SIGINT)
>>> head: Connection refused
>>>
>>> ------------
>>>
>>>
>>>
>>> The 'connection refused' cannot be due to SSH, since I have
>>>password-less
>>> key-based authentication set up for the server.
>>>
>>>
>>> Should I be using the "ch3:nemesis:ib" device for compiling MVAPICH2? I
>>> have tried using the "ch3:psm" device, but that threw up different
>>>errors.
>>> Should I be using a different version of MVAPICH2? Are there special
>>> compile flags I should be using? Currently, I'm only linking in the
>>> "-lfmpich -lmpich" libraries.
>>>
>>>
>>> Thanks,
>>> Saurabh
>>>
>>> ====================================
>>>
>>> Saurabh Barve
>>> sbarve at nps.edu
>>>
>>>
>>>
>>> _______________________________________________
>>> mvapich-discuss mailing list
>>> mvapich-discuss at cse.ohio-state.edu
>>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>>
>>>
>>
>>
>>
>> --
>> Sayantan Sur
>>
>> Research Scientist
>> Department of Computer Science
>> http://www.cse.ohio-state.edu/~surs
>>
>> _______________________________________________
>> mvapich-discuss mailing list
>> mvapich-discuss at cse.ohio-state.edu
>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>
>>
>
>
>
>--
>Jonathan Perkins
>http://www.cse.ohio-state.edu/~perkinjo
More information about the mvapich-discuss
mailing list