[mvapich-discuss] Running MPI jobs on multiple nodes

Jonathan Perkins perkinjo at cse.ohio-state.edu
Sat Jun 21 10:39:08 EDT 2014


Hello Wan Ji.  Do you have an HCA on each machine (192.168.1.1 and
192.168.1.2)?  The error message indicates that each process encountered an
error opening the HCA.

If you do not have an HCA on each machine, then you will need to rebuild
MVAPICH2 using one of the TCP/IP interfaces.  In this scenario please see
http://mvapich.cse.ohio-state.edu/static/media/mvapich/mvapich2-2.0-userguide.html#x1-170004.9
for more information.  Unfortunately, you will not be able to use our CUDA
optimizations with either of the TCP/IP interfaces.

If you do have an HCA on each machine then perhaps they are not in the
correct state.  You will need to check ``ibstat'' and make sure that the
"State" is "Active".  In the event that it is not you may need to consult
your System Admistrator to bring up the Infiniband network to a running
state.

Please let us know if any of this information helps or if there is a
different issue than what I described above.


On Sat, Jun 21, 2014 at 5:36 AM, Ji Wan <wanjime at gmail.com> wrote:

> Hello,
>
> I am currently trying to run MPI jobs on multiple nodes but encountered
> the following errors:
>
> [cli_0]: [cli_1]: aborting job:
> Fatal error in PMPI_Init_thread:
> Other MPI error, error stack:
> MPIR_Init_thread(483).......:
> MPID_Init(367)..............: channel initialization failed
> MPIDI_CH3_Init(362).........:
> MPIDI_CH3I_RDMA_init(170)...:
> rdma_setup_startup_ring(389): cannot open hca device
>
> aborting job:
> Fatal error in PMPI_Init_thread:
> Other MPI error, error stack:
> MPIR_Init_thread(483).......:
> MPID_Init(367)..............: channel initialization failed
> MPIDI_CH3_Init(362).........:
> MPIDI_CH3I_RDMA_init(170)...:
> rdma_setup_startup_ring(389): cannot open hca device
>
> [cli_2]: aborting job:
> Fatal error in PMPI_Init_thread:
> Other MPI error, error stack:
> MPIR_Init_thread(483).......:
> MPID_Init(367)..............: channel initialization failed
> MPIDI_CH3_Init(362).........:
> MPIDI_CH3I_RDMA_init(170)...:
> rdma_setup_startup_ring(389): cannot open hca device
>
> [cli_3]: aborting job:
> Fatal error in PMPI_Init_thread:
> Other MPI error, error stack:
> MPIR_Init_thread(483).......:
> MPID_Init(367)..............: channel initialization failed
> MPIDI_CH3_Init(362).........:
> MPIDI_CH3I_RDMA_init(170)...:
> rdma_setup_startup_ring(389): cannot open hca device
>
> This is the command I used to start the MPI job:
>
> MV2_ENABLE_AFFINITY=0 MV2_USE_CUDA=1 GLOG_logtostderr=1 mpirun_rsh -ssh
> -hostfile hosts -n 4 ./a.out xxx
>
> and this is the *hosts* file:
>
> 192.168.1.1:2
> 192.168.1.2:2
>
> The job was started on node 192.168.1.1, and I can connect to 192.168.1.2
> via ssh without password.
>
> Can anyone help me? Thanks!
>
>
>
> *--Best regards,Wan Ji*
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
>


-- 
Jonathan Perkins
http://www.cse.ohio-state.edu/~perkinjo
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20140621/54548045/attachment-0001.html>


More information about the mvapich-discuss mailing list