[mvapich-discuss] Running MPI jobs on multiple nodes

Jonathan Perkins perkinjo at cse.ohio-state.edu
Sat Jun 21 11:07:08 EDT 2014


Yes.


On Sat, Jun 21, 2014 at 10:54 AM, Ji Wan <wanjime at gmail.com> wrote:

> Does this mean that I have the following two choices?
>
> 1. Using CUDA aware MPI with infiniband device
> 2. If using TCP/IP, copying data to host memory before call MPI_Send
>
>
>
>
> *--Best regards,Wan Ji*
>
>
> On Sat, Jun 21, 2014 at 10:50 PM, Jonathan Perkins <
> perkinjo at cse.ohio-state.edu> wrote:
>
>> I'm sorry, I do not have a good suggestion for you at this time.  At this
>> point it appears that you'll need to use the Nemesis interface without CUDA
>> aware MPI.
>>
>>
>> On Sat, Jun 21, 2014 at 10:43 AM, Ji Wan <wanjime at gmail.com> wrote:
>>
>>> Hello Jonathan,
>>>
>>> Thanks for your reply!
>>>
>>> I do not have an HCA on each machine, and this is my configuration for
>>> building mvapich2:
>>>
>>> ./configure \
>>>   LDFLAGS='-lstdc++ -L/usr/local/cuda/lib64' \
>>>   CPPFLAGS='-I/usr/local/cuda/include' \
>>>   --disable-f77 --disable-fc  \
>>>   --enable-g=dbg --disable-fast \
>>>   --enable-cuda --with-cuda=/usr/local/cuda \
>>>   --enable-threads=multiple
>>>
>>> I have tried add --with-device=ch3:nemesis options before, but in that
>>> case MPI cannot work with CUDA correctly.
>>>
>>> Do you have some suggestion with making MPI working with both CUDA and
>>> TCP/IP?
>>>
>>>
>>>
>>>
>>>
>>> *-- Best regards, Wan Ji*
>>>
>>>
>>> On Sat, Jun 21, 2014 at 10:39 PM, Jonathan Perkins <
>>> perkinjo at cse.ohio-state.edu> wrote:
>>>
>>>> Hello Wan Ji.  Do you have an HCA on each machine (192.168.1.1 and
>>>> 192.168.1.2)?  The error message indicates that each process encountered an
>>>> error opening the HCA.
>>>>
>>>> If you do not have an HCA on each machine, then you will need to
>>>> rebuild MVAPICH2 using one of the TCP/IP interfaces.  In this scenario
>>>> please see
>>>> http://mvapich.cse.ohio-state.edu/static/media/mvapich/mvapich2-2.0-userguide.html#x1-170004.9
>>>> for more information.  Unfortunately, you will not be able to use our CUDA
>>>> optimizations with either of the TCP/IP interfaces.
>>>>
>>>> If you do have an HCA on each machine then perhaps they are not in the
>>>> correct state.  You will need to check ``ibstat'' and make sure that the
>>>> "State" is "Active".  In the event that it is not you may need to consult
>>>> your System Admistrator to bring up the Infiniband network to a running
>>>> state.
>>>>
>>>> Please let us know if any of this information helps or if there is a
>>>> different issue than what I described above.
>>>>
>>>>
>>>> On Sat, Jun 21, 2014 at 5:36 AM, Ji Wan <wanjime at gmail.com> wrote:
>>>>
>>>>> Hello,
>>>>>
>>>>> I am currently trying to run MPI jobs on multiple nodes but
>>>>> encountered the following errors:
>>>>>
>>>>> [cli_0]: [cli_1]: aborting job:
>>>>> Fatal error in PMPI_Init_thread:
>>>>> Other MPI error, error stack:
>>>>> MPIR_Init_thread(483).......:
>>>>> MPID_Init(367)..............: channel initialization failed
>>>>> MPIDI_CH3_Init(362).........:
>>>>> MPIDI_CH3I_RDMA_init(170)...:
>>>>> rdma_setup_startup_ring(389): cannot open hca device
>>>>>
>>>>> aborting job:
>>>>> Fatal error in PMPI_Init_thread:
>>>>> Other MPI error, error stack:
>>>>> MPIR_Init_thread(483).......:
>>>>> MPID_Init(367)..............: channel initialization failed
>>>>> MPIDI_CH3_Init(362).........:
>>>>> MPIDI_CH3I_RDMA_init(170)...:
>>>>> rdma_setup_startup_ring(389): cannot open hca device
>>>>>
>>>>> [cli_2]: aborting job:
>>>>> Fatal error in PMPI_Init_thread:
>>>>> Other MPI error, error stack:
>>>>> MPIR_Init_thread(483).......:
>>>>> MPID_Init(367)..............: channel initialization failed
>>>>> MPIDI_CH3_Init(362).........:
>>>>> MPIDI_CH3I_RDMA_init(170)...:
>>>>> rdma_setup_startup_ring(389): cannot open hca device
>>>>>
>>>>> [cli_3]: aborting job:
>>>>> Fatal error in PMPI_Init_thread:
>>>>> Other MPI error, error stack:
>>>>> MPIR_Init_thread(483).......:
>>>>> MPID_Init(367)..............: channel initialization failed
>>>>> MPIDI_CH3_Init(362).........:
>>>>> MPIDI_CH3I_RDMA_init(170)...:
>>>>> rdma_setup_startup_ring(389): cannot open hca device
>>>>>
>>>>> This is the command I used to start the MPI job:
>>>>>
>>>>> MV2_ENABLE_AFFINITY=0 MV2_USE_CUDA=1 GLOG_logtostderr=1 mpirun_rsh
>>>>> -ssh -hostfile hosts -n 4 ./a.out xxx
>>>>>
>>>>> and this is the *hosts* file:
>>>>>
>>>>> 192.168.1.1:2
>>>>> 192.168.1.2:2
>>>>>
>>>>> The job was started on node 192.168.1.1, and I can connect to
>>>>> 192.168.1.2 via ssh without password.
>>>>>
>>>>> Can anyone help me? Thanks!
>>>>>
>>>>>
>>>>>
>>>>> *--Best regards,Wan Ji*
>>>>>
>>>>> _______________________________________________
>>>>> mvapich-discuss mailing list
>>>>> mvapich-discuss at cse.ohio-state.edu
>>>>> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Jonathan Perkins
>>>> http://www.cse.ohio-state.edu/~perkinjo
>>>>
>>>
>>>
>>
>>
>> --
>> Jonathan Perkins
>> http://www.cse.ohio-state.edu/~perkinjo
>>
>
>


-- 
Jonathan Perkins
http://www.cse.ohio-state.edu/~perkinjo
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20140621/2b638c2c/attachment-0001.html>


More information about the mvapich-discuss mailing list