[mvapich-discuss] Running MPI jobs on multiple nodes

Ji Wan wanjime at gmail.com
Sat Jun 21 10:54:21 EDT 2014


Does this mean that I have the following two choices?

1. Using CUDA aware MPI with infiniband device
2. If using TCP/IP, copying data to host memory before call MPI_Send




*--Best regards,Wan Ji*


On Sat, Jun 21, 2014 at 10:50 PM, Jonathan Perkins <
perkinjo at cse.ohio-state.edu> wrote:

> I'm sorry, I do not have a good suggestion for you at this time.  At this
> point it appears that you'll need to use the Nemesis interface without CUDA
> aware MPI.
>
>
> On Sat, Jun 21, 2014 at 10:43 AM, Ji Wan <wanjime at gmail.com> wrote:
>
>> Hello Jonathan,
>>
>> Thanks for your reply!
>>
>> I do not have an HCA on each machine, and this is my configuration for
>> building mvapich2:
>>
>> ./configure \
>>   LDFLAGS='-lstdc++ -L/usr/local/cuda/lib64' \
>>   CPPFLAGS='-I/usr/local/cuda/include' \
>>   --disable-f77 --disable-fc  \
>>   --enable-g=dbg --disable-fast \
>>   --enable-cuda --with-cuda=/usr/local/cuda \
>>   --enable-threads=multiple
>>
>> I have tried add --with-device=ch3:nemesis options before, but in that
>> case MPI cannot work with CUDA correctly.
>>
>> Do you have some suggestion with making MPI working with both CUDA and
>> TCP/IP?
>>
>>
>>
>>
>>
>> *-- Best regards, Wan Ji*
>>
>>
>> On Sat, Jun 21, 2014 at 10:39 PM, Jonathan Perkins <
>> perkinjo at cse.ohio-state.edu> wrote:
>>
>>> Hello Wan Ji.  Do you have an HCA on each machine (192.168.1.1 and
>>> 192.168.1.2)?  The error message indicates that each process encountered an
>>> error opening the HCA.
>>>
>>> If you do not have an HCA on each machine, then you will need to rebuild
>>> MVAPICH2 using one of the TCP/IP interfaces.  In this scenario please see
>>> http://mvapich.cse.ohio-state.edu/static/media/mvapich/mvapich2-2.0-userguide.html#x1-170004.9
>>> for more information.  Unfortunately, you will not be able to use our CUDA
>>> optimizations with either of the TCP/IP interfaces.
>>>
>>> If you do have an HCA on each machine then perhaps they are not in the
>>> correct state.  You will need to check ``ibstat'' and make sure that the
>>> "State" is "Active".  In the event that it is not you may need to consult
>>> your System Admistrator to bring up the Infiniband network to a running
>>> state.
>>>
>>> Please let us know if any of this information helps or if there is a
>>> different issue than what I described above.
>>>
>>>
>>> On Sat, Jun 21, 2014 at 5:36 AM, Ji Wan <wanjime at gmail.com> wrote:
>>>
>>>> Hello,
>>>>
>>>> I am currently trying to run MPI jobs on multiple nodes but encountered
>>>> the following errors:
>>>>
>>>> [cli_0]: [cli_1]: aborting job:
>>>> Fatal error in PMPI_Init_thread:
>>>> Other MPI error, error stack:
>>>> MPIR_Init_thread(483).......:
>>>> MPID_Init(367)..............: channel initialization failed
>>>> MPIDI_CH3_Init(362).........:
>>>> MPIDI_CH3I_RDMA_init(170)...:
>>>> rdma_setup_startup_ring(389): cannot open hca device
>>>>
>>>> aborting job:
>>>> Fatal error in PMPI_Init_thread:
>>>> Other MPI error, error stack:
>>>> MPIR_Init_thread(483).......:
>>>> MPID_Init(367)..............: channel initialization failed
>>>> MPIDI_CH3_Init(362).........:
>>>> MPIDI_CH3I_RDMA_init(170)...:
>>>> rdma_setup_startup_ring(389): cannot open hca device
>>>>
>>>> [cli_2]: aborting job:
>>>> Fatal error in PMPI_Init_thread:
>>>> Other MPI error, error stack:
>>>> MPIR_Init_thread(483).......:
>>>> MPID_Init(367)..............: channel initialization failed
>>>> MPIDI_CH3_Init(362).........:
>>>> MPIDI_CH3I_RDMA_init(170)...:
>>>> rdma_setup_startup_ring(389): cannot open hca device
>>>>
>>>> [cli_3]: aborting job:
>>>> Fatal error in PMPI_Init_thread:
>>>> Other MPI error, error stack:
>>>> MPIR_Init_thread(483).......:
>>>> MPID_Init(367)..............: channel initialization failed
>>>> MPIDI_CH3_Init(362).........:
>>>> MPIDI_CH3I_RDMA_init(170)...:
>>>> rdma_setup_startup_ring(389): cannot open hca device
>>>>
>>>> This is the command I used to start the MPI job:
>>>>
>>>> MV2_ENABLE_AFFINITY=0 MV2_USE_CUDA=1 GLOG_logtostderr=1 mpirun_rsh -ssh
>>>> -hostfile hosts -n 4 ./a.out xxx
>>>>
>>>> and this is the *hosts* file:
>>>>
>>>> 192.168.1.1:2
>>>> 192.168.1.2:2
>>>>
>>>> The job was started on node 192.168.1.1, and I can connect to
>>>> 192.168.1.2 via ssh without password.
>>>>
>>>> Can anyone help me? Thanks!
>>>>
>>>>
>>>>
>>>> *--Best regards,Wan Ji*
>>>>
>>>> _______________________________________________
>>>> mvapich-discuss mailing list
>>>> mvapich-discuss at cse.ohio-state.edu
>>>> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>>>
>>>>
>>>
>>>
>>> --
>>> Jonathan Perkins
>>> http://www.cse.ohio-state.edu/~perkinjo
>>>
>>
>>
>
>
> --
> Jonathan Perkins
> http://www.cse.ohio-state.edu/~perkinjo
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20140621/dcdce394/attachment.html>


More information about the mvapich-discuss mailing list