[mvapich-discuss] MPI INIT error

Karl Schulz karl at tacc.utexas.edu
Sat Apr 6 11:25:16 EDT 2013


That output seems to indicate it can't initialize the HCA.  Does ibv_devinfo show your IB cards on the hosts you are testing on in an active state?  One possibility is that there is no subnet running if the ports are not active.

On Apr 6, 2013, at Apr 6, 10:18 AM, Hoot Thompson <hoot at ptpnow.com<mailto:hoot at ptpnow.com>> wrote:

This help?

[jhthomps at rh64-1-ib ~]$ /usr/local/other/utilities/mvapich2/bin/mpirun -n 2 -hosts rh64-1-ib,rh64-3-ib /usr/local/other/utilities/mvapich2/libexec/osu-micro-benchmarks/osu_bw
[cli_0]: aborting job:
Fatal error in MPI_Init:
Other MPI error, error stack:
MPIR_Init_thread(408).......:
MPID_Init(308)..............: channel initialization failed
MPIDI_CH3_Init(283).........:
MPIDI_CH3I_RDMA_init(171)...:
rdma_setup_startup_ring(389): cannot open hca device


=====================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   EXIT CODE: 256
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
=====================================================================================
[cli_1]: aborting job:
Fatal error in MPI_Init:
Other MPI error, error stack:
MPIR_Init_thread(408).......:
MPID_Init(308)..............: channel initialization failed
MPIDI_CH3_Init(283).........:
MPIDI_CH3I_RDMA_init(171)...:
rdma_setup_startup_ring(389): cannot open hca device


=====================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   EXIT CODE: 256
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
=====================================================================================




On 04/06/2013 10:18 AM, Devendar Bureddy wrote:
Hi Hoot

Can you configure MVAPICH2 with the additional flags:  "--enable-fast=none --enable-fast=dbg" to see if it shows better error info than "Other MPI error"?

Can you aslo give it a try with mpirun_rsh?

syntax:    ./mpirun_rsh -n 2  rh64-1-ib rh64-3-ib ./osu_bw

-Devendar


On Sat, Apr 6, 2013 at 10:00 AM, Hoot Thompson <hoot at ptpnow.com<mailto:hoot at ptpnow.com>> wrote:
I've been down this path before and I believe I've taken care of my usual oversights. Here's the background, it's a RHEL6.4 setup using the distro IB modules (not an OFED download). I'm trying to run the micro benchmarks and I'm getting (debug output attached) ....

=====================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   EXIT CODE: 256
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
=====================================================================================
[proxy:0:1 at rh64-3-ib] got pmi command (from 4): init
pmi_version=1 pmi_subversion=1
[proxy:0:1 at rh64-3-ib] PMI response: cmd=response_to_init pmi_version=1 pmi_subversion=1 rc=0
[proxy:0:1 at rh64-3-ib] got pmi command (from 4): get_maxes

[proxy:0:1 at rh64-3-ib] PMI response: cmd=maxes kvsname_max=256 keylen_max=64 vallen_max=1024
[proxy:0:1 at rh64-3-ib] got pmi command (from 4): get_appnum

[proxy:0:1 at rh64-3-ib] PMI response: cmd=appnum appnum=0
[proxy:0:1 at rh64-3-ib] got pmi command (from 4): get_my_kvsname

[proxy:0:1 at rh64-3-ib] PMI response: cmd=my_kvsname kvsname=kvs_4129_0
[proxy:0:1 at rh64-3-ib] got pmi command (from 4): get_my_kvsname

[proxy:0:1 at rh64-3-ib] PMI response: cmd=my_kvsname kvsname=kvs_4129_0
[proxy:0:1 at rh64-3-ib] got pmi command (from 4): get
kvsname=kvs_4129_0 key=PMI_process_mapping
[proxy:0:1 at rh64-3-ib] PMI response: cmd=get_result rc=0 msg=success value=(vector,(0,2,1))
[cli_1]: aborting job:
Fatal error in MPI_Init:
Other MPI error




=====================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   EXIT CODE: 256
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
=====================================================================================


Here's the output of ulimit on both ends (configured in limits.conf)
[jhthomps at rh64-1-ib ~]$  ulimit -l
unlimited
[root at rh64-3-ib jhthomps]# ulimit -l
unlimited

Firewalls are down and I think the /etc/hosts files are right.

Suggestions?

Thanks,

Hoot





_______________________________________________
mvapich-discuss mailing list
mvapich-discuss at cse.ohio-state.edu<mailto:mvapich-discuss at cse.ohio-state.edu>
http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss




--
Devendar

_______________________________________________
mvapich-discuss mailing list
mvapich-discuss at cse.ohio-state.edu<mailto:mvapich-discuss at cse.ohio-state.edu>
http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20130406/a441af71/attachment-0001.html


More information about the mvapich-discuss mailing list