[mvapich-discuss] Running MPI jobs on multiple nodes

Jonathan Perkins perkinjo at cse.ohio-state.edu
Sat Jun 21 10:50:55 EDT 2014


I'm sorry, I do not have a good suggestion for you at this time.  At this
point it appears that you'll need to use the Nemesis interface without CUDA
aware MPI.


On Sat, Jun 21, 2014 at 10:43 AM, Ji Wan <wanjime at gmail.com> wrote:

> Hello Jonathan,
>
> Thanks for your reply!
>
> I do not have an HCA on each machine, and this is my configuration for
> building mvapich2:
>
> ./configure \
>   LDFLAGS='-lstdc++ -L/usr/local/cuda/lib64' \
>   CPPFLAGS='-I/usr/local/cuda/include' \
>   --disable-f77 --disable-fc  \
>   --enable-g=dbg --disable-fast \
>   --enable-cuda --with-cuda=/usr/local/cuda \
>   --enable-threads=multiple
>
> I have tried add --with-device=ch3:nemesis options before, but in that
> case MPI cannot work with CUDA correctly.
>
> Do you have some suggestion with making MPI working with both CUDA and
> TCP/IP?
>
>
>
>
>
> *--Best regards, Wan Ji*
>
>
> On Sat, Jun 21, 2014 at 10:39 PM, Jonathan Perkins <
> perkinjo at cse.ohio-state.edu> wrote:
>
>> Hello Wan Ji.  Do you have an HCA on each machine (192.168.1.1 and
>> 192.168.1.2)?  The error message indicates that each process encountered an
>> error opening the HCA.
>>
>> If you do not have an HCA on each machine, then you will need to rebuild
>> MVAPICH2 using one of the TCP/IP interfaces.  In this scenario please see
>> http://mvapich.cse.ohio-state.edu/static/media/mvapich/mvapich2-2.0-userguide.html#x1-170004.9
>> for more information.  Unfortunately, you will not be able to use our CUDA
>> optimizations with either of the TCP/IP interfaces.
>>
>> If you do have an HCA on each machine then perhaps they are not in the
>> correct state.  You will need to check ``ibstat'' and make sure that the
>> "State" is "Active".  In the event that it is not you may need to consult
>> your System Admistrator to bring up the Infiniband network to a running
>> state.
>>
>> Please let us know if any of this information helps or if there is a
>> different issue than what I described above.
>>
>>
>> On Sat, Jun 21, 2014 at 5:36 AM, Ji Wan <wanjime at gmail.com> wrote:
>>
>>> Hello,
>>>
>>> I am currently trying to run MPI jobs on multiple nodes but encountered
>>> the following errors:
>>>
>>> [cli_0]: [cli_1]: aborting job:
>>> Fatal error in PMPI_Init_thread:
>>> Other MPI error, error stack:
>>> MPIR_Init_thread(483).......:
>>> MPID_Init(367)..............: channel initialization failed
>>> MPIDI_CH3_Init(362).........:
>>> MPIDI_CH3I_RDMA_init(170)...:
>>> rdma_setup_startup_ring(389): cannot open hca device
>>>
>>> aborting job:
>>> Fatal error in PMPI_Init_thread:
>>> Other MPI error, error stack:
>>> MPIR_Init_thread(483).......:
>>> MPID_Init(367)..............: channel initialization failed
>>> MPIDI_CH3_Init(362).........:
>>> MPIDI_CH3I_RDMA_init(170)...:
>>> rdma_setup_startup_ring(389): cannot open hca device
>>>
>>> [cli_2]: aborting job:
>>> Fatal error in PMPI_Init_thread:
>>> Other MPI error, error stack:
>>> MPIR_Init_thread(483).......:
>>> MPID_Init(367)..............: channel initialization failed
>>> MPIDI_CH3_Init(362).........:
>>> MPIDI_CH3I_RDMA_init(170)...:
>>> rdma_setup_startup_ring(389): cannot open hca device
>>>
>>> [cli_3]: aborting job:
>>> Fatal error in PMPI_Init_thread:
>>> Other MPI error, error stack:
>>> MPIR_Init_thread(483).......:
>>> MPID_Init(367)..............: channel initialization failed
>>> MPIDI_CH3_Init(362).........:
>>> MPIDI_CH3I_RDMA_init(170)...:
>>> rdma_setup_startup_ring(389): cannot open hca device
>>>
>>> This is the command I used to start the MPI job:
>>>
>>> MV2_ENABLE_AFFINITY=0 MV2_USE_CUDA=1 GLOG_logtostderr=1 mpirun_rsh -ssh
>>> -hostfile hosts -n 4 ./a.out xxx
>>>
>>> and this is the *hosts* file:
>>>
>>> 192.168.1.1:2
>>> 192.168.1.2:2
>>>
>>> The job was started on node 192.168.1.1, and I can connect to
>>> 192.168.1.2 via ssh without password.
>>>
>>> Can anyone help me? Thanks!
>>>
>>>
>>>
>>> *--Best regards,Wan Ji*
>>>
>>> _______________________________________________
>>> mvapich-discuss mailing list
>>> mvapich-discuss at cse.ohio-state.edu
>>> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>>
>>>
>>
>>
>> --
>> Jonathan Perkins
>> http://www.cse.ohio-state.edu/~perkinjo
>>
>
>


-- 
Jonathan Perkins
http://www.cse.ohio-state.edu/~perkinjo
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20140621/9ec036e5/attachment-0001.html>


More information about the mvapich-discuss mailing list