[mvapich-discuss] MPI INIT error

Hoot Thompson hoot at ptpnow.com
Sat Apr 6 12:46:30 EDT 2013


[jhthomps at rh64-1-ib ~]$ ibstat
CA 'mlx4_0'
     CA type: MT26428
     Number of ports: 1
     Firmware version: 2.10.2000
     Hardware version: b0
     Node GUID: 0x0002c903000b1d2e
     System image GUID: 0x0002c903000b1d31
     Port 1:
         State: Active
         Physical state: LinkUp
         Rate: 40
         Base lid: 2
         LMC: 0
         SM lid: 2
         Capability mask: 0x0251086a
         Port GUID: 0x0002c903000b1d2f
         Link layer: InfiniBand


[root at rh64-3-ib jhthomps]# ibstat
CA 'mlx4_0'
     CA type: MT26428
     Number of ports: 1
     Firmware version: 2.10.2000
     Hardware version: b0
     Node GUID: 0x0002c903000b1792
     System image GUID: 0x0002c903000b1795
     Port 1:
         State: Active
         Physical state: LinkUp
         Rate: 40
         Base lid: 6
         LMC: 0
         SM lid: 2
         Capability mask: 0x02510868
         Port GUID: 0x0002c903000b1793
         Link layer: InfiniBand


Note that it runs when I do this (same host to same host) ...
[jhthomps at rh64-1-ib ~]$ /usr/local/other/utilities/mvapich2/bin/mpirun 
-n 2 -hosts rh64-1-ib,rh64-1-ib 
/usr/local/other/utilities/mvapich2/libexec/osu-micro-benchmarks/osu_bw
# OSU MPI Bandwidth Test v3.6
# Size      Bandwidth (MB/s)
1                       6.08
2                      12.15
4                      24.31
8                      49.85
16                     98.46
32                    200.00
64                    394.21
128                   784.65
256                  1318.23
512                  2436.43
1024                 3690.13
2048                 5944.59
4096                 8100.73
8192                 9854.90
16384               10082.64
32768               10982.76
65536               11123.73
131072              11093.92
262144               9602.59
524288               8995.22
1048576              8842.56
2097152              8649.25
4194304              8874.54

[jhthomps at rh64-1-ib ~]$ /usr/local/other/utilities/mvapich2/bin/mpirun 
-n 2 -hosts rh64-3-ib,rh64-3-ib 
/usr/local/other/utilities/mvapich2/libexec/osu-micro-benchmarks/osu_bw
# OSU MPI Bandwidth Test v3.6
# Size      Bandwidth (MB/s)
1                       6.34
2                      12.55
4                      25.27
8                      49.95
16                    100.30
32                    200.79
64                    399.62
128                   792.25
256                  1465.55
512                  2590.26
1024                 4088.62
2048                 6026.70
4096                 8197.36
8192                 9914.89
16384                9089.69
32768               10699.54
65536               11232.76
131072              11266.03
262144               9600.43
524288               8916.69
1048576              8989.08
2097152              9033.31
4194304              8875.78








On 04/06/2013 11:25 AM, Karl Schulz wrote:
> That output seems to indicate it can't initialize the HCA.  Does 
> ibv_devinfo show your IB cards on the hosts you are testing on in an 
> active state?  One possibility is that there is no subnet running if 
> the ports are not active.
>
> On Apr 6, 2013, at Apr 6, 10:18 AM, Hoot Thompson <hoot at ptpnow.com 
> <mailto:hoot at ptpnow.com>> wrote:
>
>> This help?
>>
>> [jhthomps at rh64-1-ib ~]$ 
>> /usr/local/other/utilities/mvapich2/bin/mpirun -n 2 -hosts 
>> rh64-1-ib,rh64-3-ib 
>> /usr/local/other/utilities/mvapich2/libexec/osu-micro-benchmarks/osu_bw
>> [cli_0]: aborting job:
>> Fatal error in MPI_Init:
>> Other MPI error, error stack:
>> MPIR_Init_thread(408).......:
>> MPID_Init(308)..............: channel initialization failed
>> MPIDI_CH3_Init(283).........:
>> MPIDI_CH3I_RDMA_init(171)...:
>> rdma_setup_startup_ring(389): cannot open hca device
>>
>>
>> =====================================================================================
>> =   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
>> =   EXIT CODE: 256
>> =   CLEANING UP REMAINING PROCESSES
>> =   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
>> =====================================================================================
>> [cli_1]: aborting job:
>> Fatal error in MPI_Init:
>> Other MPI error, error stack:
>> MPIR_Init_thread(408).......:
>> MPID_Init(308)..............: channel initialization failed
>> MPIDI_CH3_Init(283).........:
>> MPIDI_CH3I_RDMA_init(171)...:
>> rdma_setup_startup_ring(389): cannot open hca device
>>
>>
>> =====================================================================================
>> =   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
>> =   EXIT CODE: 256
>> =   CLEANING UP REMAINING PROCESSES
>> =   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
>> =====================================================================================
>>
>>
>>
>>
>> On 04/06/2013 10:18 AM, Devendar Bureddy wrote:
>>> Hi Hoot
>>>
>>> Can you configure MVAPICH2 with the additional flags: 
>>>  "--enable-fast=none --enable-fast=dbg" to see if it shows better 
>>> error info than "Other MPI error"?
>>>
>>> Can you aslo give it a try with mpirun_rsh?
>>>
>>> syntax:    ./mpirun_rsh -n 2  rh64-1-ib rh64-3-ib ./osu_bw
>>>
>>> -Devendar
>>>
>>>
>>> On Sat, Apr 6, 2013 at 10:00 AM, Hoot Thompson <hoot at ptpnow.com 
>>> <mailto:hoot at ptpnow.com>> wrote:
>>>
>>>     I've been down this path before and I believe I've taken care of
>>>     my usual oversights. Here's the background, it's a RHEL6.4 setup
>>>     using the distro IB modules (not an OFED download). I'm trying
>>>     to run the micro benchmarks and I'm getting (debug output
>>>     attached) ....
>>>
>>>     =====================================================================================
>>>     =   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
>>>     =   EXIT CODE: 256
>>>     =   CLEANING UP REMAINING PROCESSES
>>>     =   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
>>>     =====================================================================================
>>>     [proxy:0:1 at rh64-3-ib] got pmi command (from 4): init
>>>     pmi_version=1 pmi_subversion=1
>>>     [proxy:0:1 at rh64-3-ib] PMI response: cmd=response_to_init
>>>     pmi_version=1 pmi_subversion=1 rc=0
>>>     [proxy:0:1 at rh64-3-ib] got pmi command (from 4): get_maxes
>>>
>>>     [proxy:0:1 at rh64-3-ib] PMI response: cmd=maxes kvsname_max=256
>>>     keylen_max=64 vallen_max=1024
>>>     [proxy:0:1 at rh64-3-ib] got pmi command (from 4): get_appnum
>>>
>>>     [proxy:0:1 at rh64-3-ib] PMI response: cmd=appnum appnum=0
>>>     [proxy:0:1 at rh64-3-ib] got pmi command (from 4): get_my_kvsname
>>>
>>>     [proxy:0:1 at rh64-3-ib] PMI response: cmd=my_kvsname
>>>     kvsname=kvs_4129_0
>>>     [proxy:0:1 at rh64-3-ib] got pmi command (from 4): get_my_kvsname
>>>
>>>     [proxy:0:1 at rh64-3-ib] PMI response: cmd=my_kvsname
>>>     kvsname=kvs_4129_0
>>>     [proxy:0:1 at rh64-3-ib] got pmi command (from 4): get
>>>     kvsname=kvs_4129_0 key=PMI_process_mapping
>>>     [proxy:0:1 at rh64-3-ib] PMI response: cmd=get_result rc=0
>>>     msg=success value=(vector,(0,2,1))
>>>     [cli_1]: aborting job:
>>>     Fatal error in MPI_Init:
>>>     Other MPI error
>>>
>>>
>>>
>>>
>>>     =====================================================================================
>>>     =   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
>>>     =   EXIT CODE: 256
>>>     =   CLEANING UP REMAINING PROCESSES
>>>     =   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
>>>     =====================================================================================
>>>
>>>
>>>     Here's the output of ulimit on both ends (configured in limits.conf)
>>>     [jhthomps at rh64-1-ib ~]$  ulimit -l
>>>     unlimited
>>>     [root at rh64-3-ib jhthomps]# ulimit -l
>>>     unlimited
>>>
>>>     Firewalls are down and I think the /etc/hosts files are right.
>>>
>>>     Suggestions?
>>>
>>>     Thanks,
>>>
>>>     Hoot
>>>
>>>
>>>
>>>
>>>
>>>     _______________________________________________
>>>     mvapich-discuss mailing list
>>>     mvapich-discuss at cse.ohio-state.edu
>>>     <mailto:mvapich-discuss at cse.ohio-state.edu>
>>>     http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>>
>>>
>>>
>>>
>>> -- 
>>> Devendar
>>
>> _______________________________________________
>> mvapich-discuss mailing list
>> mvapich-discuss at cse.ohio-state.edu 
>> <mailto:mvapich-discuss at cse.ohio-state.edu>
>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20130406/9e128d71/attachment-0001.html


More information about the mvapich-discuss mailing list