[EXTERNAL] Re: [mvapich-discuss] MVA1.9a --enable-cuda with ch3:socks compile errors

Thu Oct 25 16:30:26 EDT 2012

Hi all

that actually helped so thanks for this. Now I got another problem, 
while some simple tests (including osu_bw.c and some own stuff works) my 
actual code fails at MPI_Init with this message:

[perseus.sandia.gov:mpi_rank_0][cuda_stage_free] cudaMemcpy failed with 
11 at 1564

I actually found a post which told me how to work around that by using 
MV2_CUDA_USE_NAIVE=0 (which btw is not documented in the user-guide). I 
assume that this falls back to some basic communication scheme, and this 
is not that I eventually want. But ok - then I got yet another problem. 
MPI_Init always attaches my processes to GPU 0 only. No matter whether 
or not I set the device to something else via cudaSetDevice or not 
before calling MPI_Init. Even adding a cudaThreadSynchronize before 
MPI_init doesn't change that. For example then running this way with two 
processes and telling them to get GPU 0 and 1 respectively, nvidia-smi 
reports three contexts: two on GPU 0 and one on GPU 1. So I assume for 
rank 1 I got two contexts one chosen by me and one chosen by MPI (which 
defaults to GPU 0). But when MPI and I choose different GPUs the code 
crashes. It does that even if just running 1 process (in which case no 
actual communication using GPU buffers takes place).

terminate called after throwing an instance of 'std::runtime_error'
   what():  cudaDeviceSynchronize() error: unspecified launch failure
Traceback functionality not available

[perseus.sandia.gov:mpi_rank_1][error_sighandler] Caught error: Aborted 
(signal 6)

=====================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   EXIT CODE: 134
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
=====================================================================================
APPLICATION TERMINATED WITH THE EXIT STRING: Aborted (signal 6)

Any ideas whats going on? I can provide the whole code if you are 
interested.

Thanks
Christian

On 10/24/2012 09:06 AM, sreeram potluri wrote:
> You should be able to use MVAPICH2 within a node as long as libibverbs is
> available. I am assuming you are using 1.9a for this test too.
>
> Can you try using these options when you conifgure: --disable-rdmacm
> --disable-mcast
>
> If that does not work, can you give us more details on the issue you are
> facing?
>
> The designs for Internode GPU communication in MVAPICH2 take advantage of
> features offered by InfiniBand. There are no plans to move these to
> CH3-sock at this point.
>
> Sreeram Potluri
>
> On Wed, Oct 24, 2012 at 10:50 AM, Christian Trott<crtrott at sandia.gov>wrote:
>
>> Thanks
>>
>> thats what I thought. Do you know if I can compile for that interface on
>> my local workstation which does not have Infiniband? And if yes do you have
>> a link to a list of stuff I need to install (just adding libibverbs via yum
>> didn't seem to be sufficient. Also is support for the CH3:socks interface
>> for GPU to GPU transfer planned?
>> I am currently in the process of deciding whether or not I rely for direct
>> CUDA support within MPI for a number of projects (currently just evaluation
>> but potentially that would include Trilinos and LAMMPS from Sandia) instead
>> of writing my own data shuffling stuff. My current status is that it seems
>> that we got support on Infininband clusters by both MVAPICH2 and OpenMPI,
>> Cray seems to have something in release soon for their network, and OpenMPI
>> seems to work on my local machine as well.
>>
>> Cheers
>> Christian
>>
>>
>>
>> On 10/24/2012 08:40 AM, sreeram potluri wrote:
>>
>>> Hi Christian
>>>
>>> GPU support is only available with the InfiniBand Gen2 (OFA-IB-CH3)
>>> Interface.
>>>
>>> Please refer to this sections of our userguide on how to build and run
>>>
>>> http://mvapich.cse.ohio-state.**edu/support/user_guide_**
>>> mvapich2-1.9a.html#x1-140004.5<http://mvapich.cse.ohio-state.edu/support/user_guide_mvapich2-1.9a.html#x1-140004.5>
>>> http://mvapich.cse.ohio-state.**edu/support/user_guide_**
>>> mvapich2-1.9a.html#x1-780006.**18<http://mvapich.cse.ohio-state.edu/support/user_guide_mvapich2-1.9a.html#x1-780006.18>
>>>
>>> Best
>>> Sreeram
>>>
>>> On Wed, Oct 24, 2012 at 10:19 AM, Christian Trott<crtrott at sandia.gov>**
>>> wrote:
>>>
>>>   Hi all
>>>> is it possible to use the GPU support with the CH3:socks interface? When
>>>> I
>>>> try to compile the 1.9a release with
>>>> ./configure --enable-cuda --with-cuda=/opt/nvidia/cuda/****5.0.36/
>>>> --with-device=ch3:sock --prefix=/opt/mpi/mvapich2-1.***
>>>> *9/intel-12.1/cuda5036
>>>> CC=/opt/intel/composer_xe_****2011_sp1.9.293/bin/intel64/icc
>>>>
>>>>
>>>> I run into these errors:
>>>>
>>>> CC              ch3_isend.c
>>>> ch3_isend.c(20): error: a value of type "MPIDI_CH3_PktGeneric_t" cannot
>>>> be
>>>> assigned to an entity of type "void *"
>>>>         sreq->dev.pending_pkt = *(MPIDI_CH3_PktGeneric_t *) hdr;
>>>>
>>>>     CC              ch3_isendv.c
>>>> ch3_isendv.c(28): error: a value of type "MPIDI_CH3_PktGeneric_t" cannot
>>>> be assigned to an entity of type "void *"
>>>>         sreq->dev.pending_pkt = *(MPIDI_CH3_PktGeneric_t *)
>>>> iov[0].MPID_IOV_BUF;
>>>>
>>>> Thanks for your help
>>>> Christian
>>>>
>>>> ______________________________****_________________
>>>> mvapich-discuss mailing list
>>>> mvapich-discuss at cse.ohio-**sta**te.edu<http://state.edu><
>>>> mvapich-discuss at cse.**ohio-state.edu<mvapich-discuss at cse.ohio-state.edu>
>>>> http://mail.cse.ohio-state.****edu/mailman/listinfo/mvapich-****discuss<
>>>> http://mail.cse.ohio-**state.edu/mailman/listinfo/**mvapich-discuss<http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss>
>>>>
>>>>
>>>>
>>