[mvapich-discuss] MPI INIT error
Hoot Thompson
hoot at ptpnow.com
Sat Apr 6 12:46:30 EDT 2013
[jhthomps at rh64-1-ib ~]$ ibstat
CA 'mlx4_0'
CA type: MT26428
Number of ports: 1
Firmware version: 2.10.2000
Hardware version: b0
Node GUID: 0x0002c903000b1d2e
System image GUID: 0x0002c903000b1d31
Port 1:
State: Active
Physical state: LinkUp
Rate: 40
Base lid: 2
LMC: 0
SM lid: 2
Capability mask: 0x0251086a
Port GUID: 0x0002c903000b1d2f
Link layer: InfiniBand
[root at rh64-3-ib jhthomps]# ibstat
CA 'mlx4_0'
CA type: MT26428
Number of ports: 1
Firmware version: 2.10.2000
Hardware version: b0
Node GUID: 0x0002c903000b1792
System image GUID: 0x0002c903000b1795
Port 1:
State: Active
Physical state: LinkUp
Rate: 40
Base lid: 6
LMC: 0
SM lid: 2
Capability mask: 0x02510868
Port GUID: 0x0002c903000b1793
Link layer: InfiniBand
Note that it runs when I do this (same host to same host) ...
[jhthomps at rh64-1-ib ~]$ /usr/local/other/utilities/mvapich2/bin/mpirun
-n 2 -hosts rh64-1-ib,rh64-1-ib
/usr/local/other/utilities/mvapich2/libexec/osu-micro-benchmarks/osu_bw
# OSU MPI Bandwidth Test v3.6
# Size Bandwidth (MB/s)
1 6.08
2 12.15
4 24.31
8 49.85
16 98.46
32 200.00
64 394.21
128 784.65
256 1318.23
512 2436.43
1024 3690.13
2048 5944.59
4096 8100.73
8192 9854.90
16384 10082.64
32768 10982.76
65536 11123.73
131072 11093.92
262144 9602.59
524288 8995.22
1048576 8842.56
2097152 8649.25
4194304 8874.54
[jhthomps at rh64-1-ib ~]$ /usr/local/other/utilities/mvapich2/bin/mpirun
-n 2 -hosts rh64-3-ib,rh64-3-ib
/usr/local/other/utilities/mvapich2/libexec/osu-micro-benchmarks/osu_bw
# OSU MPI Bandwidth Test v3.6
# Size Bandwidth (MB/s)
1 6.34
2 12.55
4 25.27
8 49.95
16 100.30
32 200.79
64 399.62
128 792.25
256 1465.55
512 2590.26
1024 4088.62
2048 6026.70
4096 8197.36
8192 9914.89
16384 9089.69
32768 10699.54
65536 11232.76
131072 11266.03
262144 9600.43
524288 8916.69
1048576 8989.08
2097152 9033.31
4194304 8875.78
On 04/06/2013 11:25 AM, Karl Schulz wrote:
> That output seems to indicate it can't initialize the HCA. Does
> ibv_devinfo show your IB cards on the hosts you are testing on in an
> active state? One possibility is that there is no subnet running if
> the ports are not active.
>
> On Apr 6, 2013, at Apr 6, 10:18 AM, Hoot Thompson <hoot at ptpnow.com
> <mailto:hoot at ptpnow.com>> wrote:
>
>> This help?
>>
>> [jhthomps at rh64-1-ib ~]$
>> /usr/local/other/utilities/mvapich2/bin/mpirun -n 2 -hosts
>> rh64-1-ib,rh64-3-ib
>> /usr/local/other/utilities/mvapich2/libexec/osu-micro-benchmarks/osu_bw
>> [cli_0]: aborting job:
>> Fatal error in MPI_Init:
>> Other MPI error, error stack:
>> MPIR_Init_thread(408).......:
>> MPID_Init(308)..............: channel initialization failed
>> MPIDI_CH3_Init(283).........:
>> MPIDI_CH3I_RDMA_init(171)...:
>> rdma_setup_startup_ring(389): cannot open hca device
>>
>>
>> =====================================================================================
>> = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
>> = EXIT CODE: 256
>> = CLEANING UP REMAINING PROCESSES
>> = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
>> =====================================================================================
>> [cli_1]: aborting job:
>> Fatal error in MPI_Init:
>> Other MPI error, error stack:
>> MPIR_Init_thread(408).......:
>> MPID_Init(308)..............: channel initialization failed
>> MPIDI_CH3_Init(283).........:
>> MPIDI_CH3I_RDMA_init(171)...:
>> rdma_setup_startup_ring(389): cannot open hca device
>>
>>
>> =====================================================================================
>> = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
>> = EXIT CODE: 256
>> = CLEANING UP REMAINING PROCESSES
>> = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
>> =====================================================================================
>>
>>
>>
>>
>> On 04/06/2013 10:18 AM, Devendar Bureddy wrote:
>>> Hi Hoot
>>>
>>> Can you configure MVAPICH2 with the additional flags:
>>> "--enable-fast=none --enable-fast=dbg" to see if it shows better
>>> error info than "Other MPI error"?
>>>
>>> Can you aslo give it a try with mpirun_rsh?
>>>
>>> syntax: ./mpirun_rsh -n 2 rh64-1-ib rh64-3-ib ./osu_bw
>>>
>>> -Devendar
>>>
>>>
>>> On Sat, Apr 6, 2013 at 10:00 AM, Hoot Thompson <hoot at ptpnow.com
>>> <mailto:hoot at ptpnow.com>> wrote:
>>>
>>> I've been down this path before and I believe I've taken care of
>>> my usual oversights. Here's the background, it's a RHEL6.4 setup
>>> using the distro IB modules (not an OFED download). I'm trying
>>> to run the micro benchmarks and I'm getting (debug output
>>> attached) ....
>>>
>>> =====================================================================================
>>> = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
>>> = EXIT CODE: 256
>>> = CLEANING UP REMAINING PROCESSES
>>> = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
>>> =====================================================================================
>>> [proxy:0:1 at rh64-3-ib] got pmi command (from 4): init
>>> pmi_version=1 pmi_subversion=1
>>> [proxy:0:1 at rh64-3-ib] PMI response: cmd=response_to_init
>>> pmi_version=1 pmi_subversion=1 rc=0
>>> [proxy:0:1 at rh64-3-ib] got pmi command (from 4): get_maxes
>>>
>>> [proxy:0:1 at rh64-3-ib] PMI response: cmd=maxes kvsname_max=256
>>> keylen_max=64 vallen_max=1024
>>> [proxy:0:1 at rh64-3-ib] got pmi command (from 4): get_appnum
>>>
>>> [proxy:0:1 at rh64-3-ib] PMI response: cmd=appnum appnum=0
>>> [proxy:0:1 at rh64-3-ib] got pmi command (from 4): get_my_kvsname
>>>
>>> [proxy:0:1 at rh64-3-ib] PMI response: cmd=my_kvsname
>>> kvsname=kvs_4129_0
>>> [proxy:0:1 at rh64-3-ib] got pmi command (from 4): get_my_kvsname
>>>
>>> [proxy:0:1 at rh64-3-ib] PMI response: cmd=my_kvsname
>>> kvsname=kvs_4129_0
>>> [proxy:0:1 at rh64-3-ib] got pmi command (from 4): get
>>> kvsname=kvs_4129_0 key=PMI_process_mapping
>>> [proxy:0:1 at rh64-3-ib] PMI response: cmd=get_result rc=0
>>> msg=success value=(vector,(0,2,1))
>>> [cli_1]: aborting job:
>>> Fatal error in MPI_Init:
>>> Other MPI error
>>>
>>>
>>>
>>>
>>> =====================================================================================
>>> = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
>>> = EXIT CODE: 256
>>> = CLEANING UP REMAINING PROCESSES
>>> = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
>>> =====================================================================================
>>>
>>>
>>> Here's the output of ulimit on both ends (configured in limits.conf)
>>> [jhthomps at rh64-1-ib ~]$ ulimit -l
>>> unlimited
>>> [root at rh64-3-ib jhthomps]# ulimit -l
>>> unlimited
>>>
>>> Firewalls are down and I think the /etc/hosts files are right.
>>>
>>> Suggestions?
>>>
>>> Thanks,
>>>
>>> Hoot
>>>
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> mvapich-discuss mailing list
>>> mvapich-discuss at cse.ohio-state.edu
>>> <mailto:mvapich-discuss at cse.ohio-state.edu>
>>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>>
>>>
>>>
>>>
>>> --
>>> Devendar
>>
>> _______________________________________________
>> mvapich-discuss mailing list
>> mvapich-discuss at cse.ohio-state.edu
>> <mailto:mvapich-discuss at cse.ohio-state.edu>
>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20130406/9e128d71/attachment-0001.html
More information about the mvapich-discuss
mailing list