[mvapich-discuss] MPI INIT error

Devendar Bureddy bureddy at cse.ohio-state.edu
Sun Apr 7 10:28:02 EDT 2013


Hi Hoot

Good to know that things are running fine with reloading the modules

-Devendar

On Sun, Apr 7, 2013 at 9:28 AM, Hoot Thompson <hoot at ptpnow.com> wrote:

> The problem seems to have been fixed by cycling the rdma module.
>
> Hoot
>
> From: Devendar Bureddy <bureddy at cse.ohio-state.edu>
> Date: Saturday, April 6, 2013 2:56 PM
> To: "Thompson, John H. (GSFC-606.2)[Computer Sciences Corporation]" <
> hoot at ptpnow.com>
> Cc: Karl Schulz <karl at tacc.utexas.edu>, MVAPICH-Core <
> mvapich-core at cse.ohio-state.edu>
> Subject: Re: [mvapich-discuss] MPI INIT error
>
> Hi Hoot
>
> same host to same host communication go through shared memory channel and
> it do not initialize any IB communication.
>
> Were you able to run verb level tests ( ib_send_lat and ib_send_bw)
> between two nodes?
>
> Can you also try with MV2_USE_RING_STARTUP=0 to see if that makes any
> difference?
>
>  /usr/local/other/utilities/mvapich2/bin/mpirun -n 2 -hosts
> rh64-1-ib,rh64-3-ib -env   MV2_USE_RING_STARTUP 0
> /usr/local/other/utilities/mvapich2/libexec/osu-micro-benchmarks/osu_bw
>
> -Devendar
>
>
>
> On Sat, Apr 6, 2013 at 12:46 PM, Hoot Thompson <hoot at ptpnow.com> wrote:
>
>>  [jhthomps at rh64-1-ib ~]$ ibstat
>> CA 'mlx4_0'
>>     CA type: MT26428
>>     Number of ports: 1
>>     Firmware version: 2.10.2000
>>     Hardware version: b0
>>     Node GUID: 0x0002c903000b1d2e
>>     System image GUID: 0x0002c903000b1d31
>>     Port 1:
>>         State: Active
>>         Physical state: LinkUp
>>         Rate: 40
>>         Base lid: 2
>>         LMC: 0
>>         SM lid: 2
>>         Capability mask: 0x0251086a
>>         Port GUID: 0x0002c903000b1d2f
>>         Link layer: InfiniBand
>>
>>
>> [root at rh64-3-ib jhthomps]# ibstat
>> CA 'mlx4_0'
>>     CA type: MT26428
>>     Number of ports: 1
>>     Firmware version: 2.10.2000
>>     Hardware version: b0
>>     Node GUID: 0x0002c903000b1792
>>     System image GUID: 0x0002c903000b1795
>>     Port 1:
>>         State: Active
>>         Physical state: LinkUp
>>         Rate: 40
>>         Base lid: 6
>>         LMC: 0
>>         SM lid: 2
>>         Capability mask: 0x02510868
>>         Port GUID: 0x0002c903000b1793
>>         Link layer: InfiniBand
>>
>>
>> Note that it runs when I do this (same host to same host) ...
>> [jhthomps at rh64-1-ib ~]$ /usr/local/other/utilities/mvapich2/bin/mpirun
>> -n 2 -hosts rh64-1-ib,rh64-1-ib
>> /usr/local/other/utilities/mvapich2/libexec/osu-micro-benchmarks/osu_bw
>> # OSU MPI Bandwidth Test v3.6
>> # Size      Bandwidth (MB/s)
>> 1                       6.08
>> 2                      12.15
>> 4                      24.31
>> 8                      49.85
>> 16                     98.46
>> 32                    200.00
>> 64                    394.21
>> 128                   784.65
>> 256                  1318.23
>> 512                  2436.43
>> 1024                 3690.13
>> 2048                 5944.59
>> 4096                 8100.73
>> 8192                 9854.90
>> 16384               10082.64
>> 32768               10982.76
>> 65536               11123.73
>> 131072              11093.92
>> 262144               9602.59
>> 524288               8995.22
>> 1048576              8842.56
>> 2097152              8649.25
>> 4194304              8874.54
>>
>> [jhthomps at rh64-1-ib ~]$ /usr/local/other/utilities/mvapich2/bin/mpirun
>> -n 2 -hosts rh64-3-ib,rh64-3-ib
>> /usr/local/other/utilities/mvapich2/libexec/osu-micro-benchmarks/osu_bw
>> # OSU MPI Bandwidth Test v3.6
>> # Size      Bandwidth (MB/s)
>> 1                       6.34
>> 2                      12.55
>> 4                      25.27
>> 8                      49.95
>> 16                    100.30
>> 32                    200.79
>> 64                    399.62
>> 128                   792.25
>> 256                  1465.55
>> 512                  2590.26
>> 1024                 4088.62
>> 2048                 6026.70
>> 4096                 8197.36
>> 8192                 9914.89
>> 16384                9089.69
>> 32768               10699.54
>> 65536               11232.76
>> 131072              11266.03
>> 262144               9600.43
>> 524288               8916.69
>> 1048576              8989.08
>> 2097152              9033.31
>> 4194304              8875.78
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> On 04/06/2013 11:25 AM, Karl Schulz wrote:
>>
>> That output seems to indicate it can't initialize the HCA.  Does
>> ibv_devinfo show your IB cards on the hosts you are testing on in an active
>> state?  One possibility is that there is no subnet running if the ports are
>> not active.
>>
>>   On Apr 6, 2013, at Apr 6, 10:18 AM, Hoot Thompson <hoot at ptpnow.com>
>> wrote:
>>
>>  This help?
>>
>> [jhthomps at rh64-1-ib ~]$ /usr/local/other/utilities/mvapich2/bin/mpirun
>> -n 2 -hosts rh64-1-ib,rh64-3-ib
>> /usr/local/other/utilities/mvapich2/libexec/osu-micro-benchmarks/osu_bw
>> [cli_0]: aborting job:
>> Fatal error in MPI_Init:
>> Other MPI error, error stack:
>> MPIR_Init_thread(408).......:
>> MPID_Init(308)..............: channel initialization failed
>> MPIDI_CH3_Init(283).........:
>> MPIDI_CH3I_RDMA_init(171)...:
>> rdma_setup_startup_ring(389): cannot open hca device
>>
>>
>>
>> =====================================================================================
>> =   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
>> =   EXIT CODE: 256
>> =   CLEANING UP REMAINING PROCESSES
>> =   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
>>
>> =====================================================================================
>> [cli_1]: aborting job:
>> Fatal error in MPI_Init:
>> Other MPI error, error stack:
>> MPIR_Init_thread(408).......:
>> MPID_Init(308)..............: channel initialization failed
>> MPIDI_CH3_Init(283).........:
>> MPIDI_CH3I_RDMA_init(171)...:
>> rdma_setup_startup_ring(389): cannot open hca device
>>
>>
>>
>> =====================================================================================
>> =   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
>> =   EXIT CODE: 256
>> =   CLEANING UP REMAINING PROCESSES
>> =   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
>>
>> =====================================================================================
>>
>>
>>
>>
>> On 04/06/2013 10:18 AM, Devendar Bureddy wrote:
>>
>> Hi Hoot
>>
>> Can you configure MVAPICH2 with the additional flags:
>>  "--enable-fast=none --enable-fast=dbg" to see if it shows better error
>> info than "Other MPI error"?
>>
>> Can you aslo give it a try with mpirun_rsh?
>>
>> syntax:    ./mpirun_rsh -n 2  rh64-1-ib rh64-3-ib ./osu_bw
>>
>> -Devendar
>>
>>
>> On Sat, Apr 6, 2013 at 10:00 AM, Hoot Thompson <hoot at ptpnow.com> wrote:
>>
>>> I've been down this path before and I believe I've taken care of my
>>> usual oversights. Here's the background, it's a RHEL6.4 setup using the
>>> distro IB modules (not an OFED download). I'm trying to run the micro
>>> benchmarks and I'm getting (debug output attached) ....
>>>
>>>
>>> =====================================================================================
>>> =   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
>>> =   EXIT CODE: 256
>>> =   CLEANING UP REMAINING PROCESSES
>>> =   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
>>>
>>> =====================================================================================
>>> [proxy:0:1 at rh64-3-ib] got pmi command (from 4): init
>>> pmi_version=1 pmi_subversion=1
>>> [proxy:0:1 at rh64-3-ib] PMI response: cmd=response_to_init pmi_version=1
>>> pmi_subversion=1 rc=0
>>> [proxy:0:1 at rh64-3-ib] got pmi command (from 4): get_maxes
>>>
>>> [proxy:0:1 at rh64-3-ib] PMI response: cmd=maxes kvsname_max=256
>>> keylen_max=64 vallen_max=1024
>>> [proxy:0:1 at rh64-3-ib] got pmi command (from 4): get_appnum
>>>
>>> [proxy:0:1 at rh64-3-ib] PMI response: cmd=appnum appnum=0
>>> [proxy:0:1 at rh64-3-ib] got pmi command (from 4): get_my_kvsname
>>>
>>> [proxy:0:1 at rh64-3-ib] PMI response: cmd=my_kvsname kvsname=kvs_4129_0
>>> [proxy:0:1 at rh64-3-ib] got pmi command (from 4): get_my_kvsname
>>>
>>> [proxy:0:1 at rh64-3-ib] PMI response: cmd=my_kvsname kvsname=kvs_4129_0
>>> [proxy:0:1 at rh64-3-ib] got pmi command (from 4): get
>>> kvsname=kvs_4129_0 key=PMI_process_mapping
>>> [proxy:0:1 at rh64-3-ib] PMI response: cmd=get_result rc=0 msg=success
>>> value=(vector,(0,2,1))
>>> [cli_1]: aborting job:
>>> Fatal error in MPI_Init:
>>> Other MPI error
>>>
>>>
>>>
>>>
>>>
>>> =====================================================================================
>>> =   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
>>> =   EXIT CODE: 256
>>> =   CLEANING UP REMAINING PROCESSES
>>> =   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
>>>
>>> =====================================================================================
>>>
>>>
>>> Here's the output of ulimit on both ends (configured in limits.conf)
>>> [jhthomps at rh64-1-ib ~]$  ulimit -l
>>> unlimited
>>> [root at rh64-3-ib jhthomps]# ulimit -l
>>> unlimited
>>>
>>> Firewalls are down and I think the /etc/hosts files are right.
>>>
>>> Suggestions?
>>>
>>> Thanks,
>>>
>>> Hoot
>>>
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> mvapich-discuss mailing list
>>> mvapich-discuss at cse.ohio-state.edu
>>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>>
>>>
>>
>>
>>  --
>> Devendar
>>
>>
>>  _______________________________________________
>> mvapich-discuss mailing list
>> mvapich-discuss at cse.ohio-state.edu
>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>
>>
>>
>>
>
>
> --
> Devendar
>



-- 
Devendar
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20130407/170c01fe/attachment-0001.html


More information about the mvapich-discuss mailing list