[mvapich-discuss] MVAPICH on multi-GPU causes Segmentation fault
khaled hamidouche
hamidouc at cse.ohio-state.edu
Thu May 28 10:08:47 EDT 2015
Hi,
We have not experimented with this approach. Please try it and let us know
whether it works or not
Thanks
On Wed, May 27, 2015 at 10:49 PM, Yutian Li <lyt at megvii.com> wrote:
> Thanks.
>
> Can I use one GPU as my "proxy GPU", accessing memory of all other GPUs
> through this one using UVA. Like when I'm sending GPU1's data out, I first
> set device to GPU0, then I pass the pointer to the memory in GPU1 and call
> MPI_Send. Would that work?
>
> On Wed, May 27, 2015 at 10:00 PM, khaled hamidouche <
> hamidouc at cse.ohio-state.edu> wrote:
>
>> Hi Yutian,
>>
>> In MVAPICH2, an MPI process cannot support multiple context (GPUs). So in
>> your case you need to use multiple processes and each process using it own
>> GPU.
>>
>> Thanks
>>
>> On Wed, May 27, 2015 at 1:18 AM, Yutian Li <lyt at megvii.com> wrote:
>>
>>> I'm using MVAPICH2 2.1 on a Debian 7 machine. It has multiple cards of
>>> Tesla K40m. I am running into segmentation fault when I'm using multiple
>>> GPU cards.
>>> The code is as follows.
>>>
>>> #include <cstdio>
>>> #include <cstdlib>
>>> #include <ctime>
>>> #include <cuda_runtime.h>
>>> #include <mpi.h>
>>>
>>> int main(int argc, char** argv) {
>>> MPI_Status status;
>>> int rank;
>>> MPI_Init(&argc, &argv);
>>> MPI_Comm_rank(MPI_COMM_WORLD, &rank);
>>> cudaSetDevice(0);
>>> if (rank == 0) {
>>> srand(time(0));
>>> float* a;
>>> float num = rand();
>>> cudaMalloc(&a, sizeof(float));
>>> cudaMemcpy(a, &num, sizeof(float), cudaMemcpyDefault);
>>> MPI_Send(a, sizeof(float), MPI_CHAR, 1, 0, MPI_COMM_WORLD);
>>> printf("sent %f\n", num);
>>> } else {
>>> float* a;
>>> float num;
>>> cudaMalloc(&a, sizeof(float));
>>> MPI_Recv(a, sizeof(float), MPI_CHAR, 0, 0, MPI_COMM_WORLD,
>>> &status);
>>> cudaMemcpy(&num, a, sizeof(float), cudaMemcpyDefault);
>>> printf("received %f\n", num);
>>> }
>>> cudaSetDevice(1);
>>> if (rank == 0) {
>>> float* a;
>>> float num = rand();
>>> cudaMalloc(&a, sizeof(float));
>>> cudaMemcpy(a, &num, sizeof(float), cudaMemcpyDefault);
>>> MPI_Send(a, sizeof(float), MPI_CHAR, 1, 0, MPI_COMM_WORLD);
>>> printf("sent %f\n", num);
>>> } else {
>>> float* a;
>>> float num;
>>> cudaMalloc(&a, sizeof(float));
>>> MPI_Recv(a, sizeof(float), MPI_CHAR, 0, 0, MPI_COMM_WORLD,
>>> &status);
>>> cudaMemcpy(&num, a, sizeof(float), cudaMemcpyDefault);
>>> printf("received %f\n", num);
>>> }
>>> MPI_Finalize();
>>> return 0;
>>> }
>>>
>>> In short, I first set device to GPU 0, send something. Then I set device
>>> to GPU 1, send something.
>>>
>>> The output is as follows.
>>>
>>> sent 1778786688.000000
>>> received 1778786688.000000
>>> [debian:mpi_rank_0][error_sighandler] Caught error: Segmentation
>>> fault (signal 11)
>>> [debian:mpispawn_0][readline] Unexpected End-Of-File on file
>>> descriptor 7. MPI process died?
>>> [debian:mpispawn_0][mtpmi_processops] Error while reading PMI
>>> socket. MPI process died?
>>> [debian:mpispawn_0][child_handler] MPI process (rank: 0, pid: 30275)
>>> terminated with signal 11 -> abort job
>>> [debian:mpirun_rsh][process_mpispawn_connection] mpispawn_0 from
>>> node debian aborted: Error while reading a PMI socket (4)
>>>
>>> So the first send is OK. But as soon as I set my device to the other
>>> GPU, and then MPI send, boom! I wonder why this is happening.
>>>
>>> Also, I built MVAPICH with the following command.
>>>
>>> ./configure --enable-cuda --with-cuda=/usr/local/cuda
>>> --with-device=ch3:mrail --enable-rdma-cm
>>>
>>> I have debugging enabled and stack trace printed. Hopefully this helps..
>>>
>>> sent 1377447040.000000
>>> received 1377447040.000000
>>> [debian:mpi_rank_0][error_sighandler] Caught error: Segmentation
>>> fault (signal 11)
>>> [debian:mpi_rank_0][print_backtrace] 0:
>>> /home/lyt/local/lib/libmpi.so.12(print_backtrace+0x1c) [0x7fba26a00b3c]
>>> [debian:mpi_rank_0][print_backtrace] 1:
>>> /home/lyt/local/lib/libmpi.so.12(error_sighandler+0x59) [0x7fba26a00c39]
>>> [debian:mpi_rank_0][print_backtrace] 2:
>>> /lib/x86_64-linux-gnu/libpthread.so.0(+0xf8d0) [0x7fba23ffe8d0]
>>> [debian:mpi_rank_0][print_backtrace] 3:
>>> /usr/lib/libcuda.so.1(+0x21bb30) [0x7fba26fa9b30]
>>> [debian:mpi_rank_0][print_backtrace] 4:
>>> /usr/lib/libcuda.so.1(+0x1f6695) [0x7fba26f84695]
>>> [debian:mpi_rank_0][print_backtrace] 5:
>>> /usr/lib/libcuda.so.1(+0x205586) [0x7fba26f93586]
>>> [debian:mpi_rank_0][print_backtrace] 6:
>>> /usr/lib/libcuda.so.1(+0x17ad88) [0x7fba26f08d88]
>>> [debian:mpi_rank_0][print_backtrace] 7:
>>> /usr/lib/libcuda.so.1(cuStreamWaitEvent+0x63) [0x7fba26ed72e3]
>>> [debian:mpi_rank_0][print_backtrace] 8:
>>> /usr/local/cuda/lib64/libcudart.so.6.5(+0xa023) [0x7fba27cff023]
>>> [debian:mpi_rank_0][print_backtrace] 9:
>>> /usr/local/cuda/lib64/libcudart.so.6.5(cudaStreamWaitEvent+0x1ce)
>>> [0x7fba27d2cf3e]
>>> [debian:mpi_rank_0][print_backtrace] 10:
>>> /home/lyt/local/lib/libmpi.so.12(MPIDI_CH3_CUDAIPC_Rendezvous_push+0x17f)
>>> [0x7fba269f25bf]
>>> [debian:mpi_rank_0][print_backtrace] 11:
>>> /home/lyt/local/lib/libmpi.so.12(MPIDI_CH3_Rendezvous_push+0xe3)
>>> [0x7fba269a0233]
>>> [debian:mpi_rank_0][print_backtrace] 12:
>>> /home/lyt/local/lib/libmpi.so.12(MPIDI_CH3I_MRAILI_Process_rndv+0xa4)
>>> [0x7fba269a0334]
>>> [debian:mpi_rank_0][print_backtrace] 13:
>>> /home/lyt/local/lib/libmpi.so.12(MPIDI_CH3I_Progress+0x19a) [0x7fba2699aeaa]
>>> [debian:mpi_rank_0][print_backtrace] 14:
>>> /home/lyt/local/lib/libmpi.so.12(MPI_Send+0x6ef) [0x7fba268d118f]
>>> [debian:mpi_rank_0][print_backtrace] 15: ./bin/minimal.run()
>>> [0x400c15]
>>> [debian:mpi_rank_0][print_backtrace] 16:
>>> /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5) [0x7fba23c67b45]
>>> [debian:mpi_rank_0][print_backtrace] 17: ./bin/minimal.run()
>>> [0x400c5c]
>>> [debian:mpispawn_0][readline] Unexpected End-Of-File on file
>>> descriptor 6. MPI process died?
>>> [debian:mpispawn_0][mtpmi_processops] Error while reading PMI
>>> socket. MPI process died?
>>> [debian:mpispawn_0][child_handler] MPI process (rank: 0, pid: 355)
>>> terminated with signal 11 -> abort job
>>> [debian:mpirun_rsh][process_mpispawn_connection] mpispawn_0 from
>>> node debian8 aborted: Error while reading a PMI socket (4)
>>>
>>> PS: I also uploaded the same question on Stack Overflow (
>>> http://stackoverflow.com/questions/30455846/mvapich-on-multi-gpu-causes-segmentation-fault).
>>> Refer to this link if you want more syntax highlighting.
>>>
>>> Thanks!
>>>
>>>
>>> _______________________________________________
>>> mvapich-discuss mailing list
>>> mvapich-discuss at cse.ohio-state.edu
>>> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20150528/5d7aa652/attachment-0001.html>
More information about the mvapich-discuss
mailing list