[mvapich-discuss] Running MPI jobs on multiple nodes

Ji Wan wanjime at gmail.com
Sat Jun 21 10:43:03 EDT 2014


Hello Jonathan,

Thanks for your reply!

I do not have an HCA on each machine, and this is my configuration for
building mvapich2:

./configure \
  LDFLAGS='-lstdc++ -L/usr/local/cuda/lib64' \
  CPPFLAGS='-I/usr/local/cuda/include' \
  --disable-f77 --disable-fc  \
  --enable-g=dbg --disable-fast \
  --enable-cuda --with-cuda=/usr/local/cuda \
  --enable-threads=multiple

I have tried add --with-device=ch3:nemesis options before, but in that case
MPI cannot work with CUDA correctly.

Do you have some suggestion with making MPI working with both CUDA and
TCP/IP?





*--Best regards,Wan Ji*


On Sat, Jun 21, 2014 at 10:39 PM, Jonathan Perkins <
perkinjo at cse.ohio-state.edu> wrote:

> Hello Wan Ji.  Do you have an HCA on each machine (192.168.1.1 and
> 192.168.1.2)?  The error message indicates that each process encountered an
> error opening the HCA.
>
> If you do not have an HCA on each machine, then you will need to rebuild
> MVAPICH2 using one of the TCP/IP interfaces.  In this scenario please see
> http://mvapich.cse.ohio-state.edu/static/media/mvapich/mvapich2-2.0-userguide.html#x1-170004.9
> for more information.  Unfortunately, you will not be able to use our CUDA
> optimizations with either of the TCP/IP interfaces.
>
> If you do have an HCA on each machine then perhaps they are not in the
> correct state.  You will need to check ``ibstat'' and make sure that the
> "State" is "Active".  In the event that it is not you may need to consult
> your System Admistrator to bring up the Infiniband network to a running
> state.
>
> Please let us know if any of this information helps or if there is a
> different issue than what I described above.
>
>
> On Sat, Jun 21, 2014 at 5:36 AM, Ji Wan <wanjime at gmail.com> wrote:
>
>> Hello,
>>
>> I am currently trying to run MPI jobs on multiple nodes but encountered
>> the following errors:
>>
>> [cli_0]: [cli_1]: aborting job:
>> Fatal error in PMPI_Init_thread:
>> Other MPI error, error stack:
>> MPIR_Init_thread(483).......:
>> MPID_Init(367)..............: channel initialization failed
>> MPIDI_CH3_Init(362).........:
>> MPIDI_CH3I_RDMA_init(170)...:
>> rdma_setup_startup_ring(389): cannot open hca device
>>
>> aborting job:
>> Fatal error in PMPI_Init_thread:
>> Other MPI error, error stack:
>> MPIR_Init_thread(483).......:
>> MPID_Init(367)..............: channel initialization failed
>> MPIDI_CH3_Init(362).........:
>> MPIDI_CH3I_RDMA_init(170)...:
>> rdma_setup_startup_ring(389): cannot open hca device
>>
>> [cli_2]: aborting job:
>> Fatal error in PMPI_Init_thread:
>> Other MPI error, error stack:
>> MPIR_Init_thread(483).......:
>> MPID_Init(367)..............: channel initialization failed
>> MPIDI_CH3_Init(362).........:
>> MPIDI_CH3I_RDMA_init(170)...:
>> rdma_setup_startup_ring(389): cannot open hca device
>>
>> [cli_3]: aborting job:
>> Fatal error in PMPI_Init_thread:
>> Other MPI error, error stack:
>> MPIR_Init_thread(483).......:
>> MPID_Init(367)..............: channel initialization failed
>> MPIDI_CH3_Init(362).........:
>> MPIDI_CH3I_RDMA_init(170)...:
>> rdma_setup_startup_ring(389): cannot open hca device
>>
>> This is the command I used to start the MPI job:
>>
>> MV2_ENABLE_AFFINITY=0 MV2_USE_CUDA=1 GLOG_logtostderr=1 mpirun_rsh -ssh
>> -hostfile hosts -n 4 ./a.out xxx
>>
>> and this is the *hosts* file:
>>
>> 192.168.1.1:2
>> 192.168.1.2:2
>>
>> The job was started on node 192.168.1.1, and I can connect to 192.168.1.2
>> via ssh without password.
>>
>> Can anyone help me? Thanks!
>>
>>
>>
>> *--Best regards,Wan Ji*
>>
>> _______________________________________________
>> mvapich-discuss mailing list
>> mvapich-discuss at cse.ohio-state.edu
>> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>
>>
>
>
> --
> Jonathan Perkins
> http://www.cse.ohio-state.edu/~perkinjo
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20140621/5b02ae7e/attachment.html>


More information about the mvapich-discuss mailing list