[mvapich-discuss] MPI INIT error
Devendar Bureddy
bureddy at cse.ohio-state.edu
Sun Apr 7 10:28:02 EDT 2013
Hi Hoot
Good to know that things are running fine with reloading the modules
-Devendar
On Sun, Apr 7, 2013 at 9:28 AM, Hoot Thompson <hoot at ptpnow.com> wrote:
> The problem seems to have been fixed by cycling the rdma module.
>
> Hoot
>
> From: Devendar Bureddy <bureddy at cse.ohio-state.edu>
> Date: Saturday, April 6, 2013 2:56 PM
> To: "Thompson, John H. (GSFC-606.2)[Computer Sciences Corporation]" <
> hoot at ptpnow.com>
> Cc: Karl Schulz <karl at tacc.utexas.edu>, MVAPICH-Core <
> mvapich-core at cse.ohio-state.edu>
> Subject: Re: [mvapich-discuss] MPI INIT error
>
> Hi Hoot
>
> same host to same host communication go through shared memory channel and
> it do not initialize any IB communication.
>
> Were you able to run verb level tests ( ib_send_lat and ib_send_bw)
> between two nodes?
>
> Can you also try with MV2_USE_RING_STARTUP=0 to see if that makes any
> difference?
>
> /usr/local/other/utilities/mvapich2/bin/mpirun -n 2 -hosts
> rh64-1-ib,rh64-3-ib -env MV2_USE_RING_STARTUP 0
> /usr/local/other/utilities/mvapich2/libexec/osu-micro-benchmarks/osu_bw
>
> -Devendar
>
>
>
> On Sat, Apr 6, 2013 at 12:46 PM, Hoot Thompson <hoot at ptpnow.com> wrote:
>
>> [jhthomps at rh64-1-ib ~]$ ibstat
>> CA 'mlx4_0'
>> CA type: MT26428
>> Number of ports: 1
>> Firmware version: 2.10.2000
>> Hardware version: b0
>> Node GUID: 0x0002c903000b1d2e
>> System image GUID: 0x0002c903000b1d31
>> Port 1:
>> State: Active
>> Physical state: LinkUp
>> Rate: 40
>> Base lid: 2
>> LMC: 0
>> SM lid: 2
>> Capability mask: 0x0251086a
>> Port GUID: 0x0002c903000b1d2f
>> Link layer: InfiniBand
>>
>>
>> [root at rh64-3-ib jhthomps]# ibstat
>> CA 'mlx4_0'
>> CA type: MT26428
>> Number of ports: 1
>> Firmware version: 2.10.2000
>> Hardware version: b0
>> Node GUID: 0x0002c903000b1792
>> System image GUID: 0x0002c903000b1795
>> Port 1:
>> State: Active
>> Physical state: LinkUp
>> Rate: 40
>> Base lid: 6
>> LMC: 0
>> SM lid: 2
>> Capability mask: 0x02510868
>> Port GUID: 0x0002c903000b1793
>> Link layer: InfiniBand
>>
>>
>> Note that it runs when I do this (same host to same host) ...
>> [jhthomps at rh64-1-ib ~]$ /usr/local/other/utilities/mvapich2/bin/mpirun
>> -n 2 -hosts rh64-1-ib,rh64-1-ib
>> /usr/local/other/utilities/mvapich2/libexec/osu-micro-benchmarks/osu_bw
>> # OSU MPI Bandwidth Test v3.6
>> # Size Bandwidth (MB/s)
>> 1 6.08
>> 2 12.15
>> 4 24.31
>> 8 49.85
>> 16 98.46
>> 32 200.00
>> 64 394.21
>> 128 784.65
>> 256 1318.23
>> 512 2436.43
>> 1024 3690.13
>> 2048 5944.59
>> 4096 8100.73
>> 8192 9854.90
>> 16384 10082.64
>> 32768 10982.76
>> 65536 11123.73
>> 131072 11093.92
>> 262144 9602.59
>> 524288 8995.22
>> 1048576 8842.56
>> 2097152 8649.25
>> 4194304 8874.54
>>
>> [jhthomps at rh64-1-ib ~]$ /usr/local/other/utilities/mvapich2/bin/mpirun
>> -n 2 -hosts rh64-3-ib,rh64-3-ib
>> /usr/local/other/utilities/mvapich2/libexec/osu-micro-benchmarks/osu_bw
>> # OSU MPI Bandwidth Test v3.6
>> # Size Bandwidth (MB/s)
>> 1 6.34
>> 2 12.55
>> 4 25.27
>> 8 49.95
>> 16 100.30
>> 32 200.79
>> 64 399.62
>> 128 792.25
>> 256 1465.55
>> 512 2590.26
>> 1024 4088.62
>> 2048 6026.70
>> 4096 8197.36
>> 8192 9914.89
>> 16384 9089.69
>> 32768 10699.54
>> 65536 11232.76
>> 131072 11266.03
>> 262144 9600.43
>> 524288 8916.69
>> 1048576 8989.08
>> 2097152 9033.31
>> 4194304 8875.78
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> On 04/06/2013 11:25 AM, Karl Schulz wrote:
>>
>> That output seems to indicate it can't initialize the HCA. Does
>> ibv_devinfo show your IB cards on the hosts you are testing on in an active
>> state? One possibility is that there is no subnet running if the ports are
>> not active.
>>
>> On Apr 6, 2013, at Apr 6, 10:18 AM, Hoot Thompson <hoot at ptpnow.com>
>> wrote:
>>
>> This help?
>>
>> [jhthomps at rh64-1-ib ~]$ /usr/local/other/utilities/mvapich2/bin/mpirun
>> -n 2 -hosts rh64-1-ib,rh64-3-ib
>> /usr/local/other/utilities/mvapich2/libexec/osu-micro-benchmarks/osu_bw
>> [cli_0]: aborting job:
>> Fatal error in MPI_Init:
>> Other MPI error, error stack:
>> MPIR_Init_thread(408).......:
>> MPID_Init(308)..............: channel initialization failed
>> MPIDI_CH3_Init(283).........:
>> MPIDI_CH3I_RDMA_init(171)...:
>> rdma_setup_startup_ring(389): cannot open hca device
>>
>>
>>
>> =====================================================================================
>> = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
>> = EXIT CODE: 256
>> = CLEANING UP REMAINING PROCESSES
>> = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
>>
>> =====================================================================================
>> [cli_1]: aborting job:
>> Fatal error in MPI_Init:
>> Other MPI error, error stack:
>> MPIR_Init_thread(408).......:
>> MPID_Init(308)..............: channel initialization failed
>> MPIDI_CH3_Init(283).........:
>> MPIDI_CH3I_RDMA_init(171)...:
>> rdma_setup_startup_ring(389): cannot open hca device
>>
>>
>>
>> =====================================================================================
>> = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
>> = EXIT CODE: 256
>> = CLEANING UP REMAINING PROCESSES
>> = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
>>
>> =====================================================================================
>>
>>
>>
>>
>> On 04/06/2013 10:18 AM, Devendar Bureddy wrote:
>>
>> Hi Hoot
>>
>> Can you configure MVAPICH2 with the additional flags:
>> "--enable-fast=none --enable-fast=dbg" to see if it shows better error
>> info than "Other MPI error"?
>>
>> Can you aslo give it a try with mpirun_rsh?
>>
>> syntax: ./mpirun_rsh -n 2 rh64-1-ib rh64-3-ib ./osu_bw
>>
>> -Devendar
>>
>>
>> On Sat, Apr 6, 2013 at 10:00 AM, Hoot Thompson <hoot at ptpnow.com> wrote:
>>
>>> I've been down this path before and I believe I've taken care of my
>>> usual oversights. Here's the background, it's a RHEL6.4 setup using the
>>> distro IB modules (not an OFED download). I'm trying to run the micro
>>> benchmarks and I'm getting (debug output attached) ....
>>>
>>>
>>> =====================================================================================
>>> = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
>>> = EXIT CODE: 256
>>> = CLEANING UP REMAINING PROCESSES
>>> = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
>>>
>>> =====================================================================================
>>> [proxy:0:1 at rh64-3-ib] got pmi command (from 4): init
>>> pmi_version=1 pmi_subversion=1
>>> [proxy:0:1 at rh64-3-ib] PMI response: cmd=response_to_init pmi_version=1
>>> pmi_subversion=1 rc=0
>>> [proxy:0:1 at rh64-3-ib] got pmi command (from 4): get_maxes
>>>
>>> [proxy:0:1 at rh64-3-ib] PMI response: cmd=maxes kvsname_max=256
>>> keylen_max=64 vallen_max=1024
>>> [proxy:0:1 at rh64-3-ib] got pmi command (from 4): get_appnum
>>>
>>> [proxy:0:1 at rh64-3-ib] PMI response: cmd=appnum appnum=0
>>> [proxy:0:1 at rh64-3-ib] got pmi command (from 4): get_my_kvsname
>>>
>>> [proxy:0:1 at rh64-3-ib] PMI response: cmd=my_kvsname kvsname=kvs_4129_0
>>> [proxy:0:1 at rh64-3-ib] got pmi command (from 4): get_my_kvsname
>>>
>>> [proxy:0:1 at rh64-3-ib] PMI response: cmd=my_kvsname kvsname=kvs_4129_0
>>> [proxy:0:1 at rh64-3-ib] got pmi command (from 4): get
>>> kvsname=kvs_4129_0 key=PMI_process_mapping
>>> [proxy:0:1 at rh64-3-ib] PMI response: cmd=get_result rc=0 msg=success
>>> value=(vector,(0,2,1))
>>> [cli_1]: aborting job:
>>> Fatal error in MPI_Init:
>>> Other MPI error
>>>
>>>
>>>
>>>
>>>
>>> =====================================================================================
>>> = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
>>> = EXIT CODE: 256
>>> = CLEANING UP REMAINING PROCESSES
>>> = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
>>>
>>> =====================================================================================
>>>
>>>
>>> Here's the output of ulimit on both ends (configured in limits.conf)
>>> [jhthomps at rh64-1-ib ~]$ ulimit -l
>>> unlimited
>>> [root at rh64-3-ib jhthomps]# ulimit -l
>>> unlimited
>>>
>>> Firewalls are down and I think the /etc/hosts files are right.
>>>
>>> Suggestions?
>>>
>>> Thanks,
>>>
>>> Hoot
>>>
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> mvapich-discuss mailing list
>>> mvapich-discuss at cse.ohio-state.edu
>>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>>
>>>
>>
>>
>> --
>> Devendar
>>
>>
>> _______________________________________________
>> mvapich-discuss mailing list
>> mvapich-discuss at cse.ohio-state.edu
>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>
>>
>>
>>
>
>
> --
> Devendar
>
--
Devendar
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20130407/170c01fe/attachment-0001.html
More information about the mvapich-discuss
mailing list