[mvapich-discuss] MVAPICH2-GDR 2.0b : Crashing when sending 2 GB of data using MPI_Send.

sreeram potluri potluri.2 at osu.edu
Fri Feb 7 09:34:27 EST 2014


Hi Amirul,

Thanks for reporting this.

This is a known limitation in the current version of MVAPICH2-GDR library.
It does not support transfers of sizes 2GB or larger. This will be fixed in
the future releases.

Can you tell use a little more about your application use case which
involves 2GB data transfers?

Regards
Sreeram Potluri




On Fri, Feb 7, 2014 at 3:16 AM, Mohamad Amirul Abdullah <
amirul.abdullah at mimos.my> wrote:

>  Hi,
>
> We currently trying to setup GPU cluster (we have 2 machine connected with
> Infiniband point-to-point) and installed MVAPICH2-GDR. We succesfully
> tested GPU DIRECT feature using small data ( 1 - 200 elements). But we
> experiences crashing issue we increase the array size up to 2GB of integers
> (536870912 elements). The symptom of the crash is that whenever it try to
> execute the MPI_Send from the master machine, the master machine suddenly
> SHUTDOWN without warning or promptout anything,.. just TOTAL SHUTDOWN as if
> somebody is pressing the off button.
>
> I have done some analysis of the problem, it seems that is not the GPU
> DIRECT issues, because the same thing happens when im trying to MPI_Send
> from CPU RAM to CPU RAM with 2 GB of data. However it can send 1 GB of data
> without problem. (RAM size is not an issue because i have 16 GB and 8 GB
> respectively in both machine)
>
> I wonder the issues can be reproduce at your site. Even if it can't be
> reproduce, I don't know where to start debug the problem, not sure it the
> hardware problem, driver problem, or the MPI compiler problem
>
> Here is my Machine settings on both machine. The code is also shown below.
>
> *OS*
> Centos 6.4 kernel version  2.6.32-358.el6.x86_64
>
> *Infiniband*
> Model No: CX354A
> Name :Mellanox ConnectX-3 FDR Infiniband + 40GigE (MT27500 Family)
> Driver     (1): MLNX_OFED_LINUX-2.1-1.0.0-rhel6.4-x86_64
>                 (2): nvidia_peer_memory-1.0-0
>
> *GPU*
> Card : Nvidia K20c
> Driver version: 331.20
> CUDA version : 5.5
>
> *MPI*
> MVAPICH2-GDR-2.0b
>
> *The Code*
>
> #include <mpi.h>
> #include <stdio.h>
> #include <stdlib.h>
>
> int main(int argc, char **argv)
> {
>   int myrank;
>
>   /* Initialize MPI */
>   unsigned int* h_dev;
>
>   unsigned int SIZE = 536870912; // 2 GB of unsigned int
>   MPI_Init(&argc, &argv);
>   printf("inited!");
>
>  /* Find out my identity in the default communicator */
>   MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
>   printf("check rank :%i\n",myrank);
>
>   h_dev =(unsigned int*)malloc(sizeof(unsigned int)*SIZE);
>
>   if(myrank==0)
>    {
>     for(unsigned int i=0;i<SIZE;i++)
>        h_dev[i]=1;
>      printf("proc 0: h_dev[0]: %u\n",h_dev[0]);
>    }
>
>    if(myrank==1)
>    {
>     for(unsigned int i=0;i<SIZE;i++)
>        h_dev[i]=2;
>      printf("proc 1: h_dev[0] : %u\n",h_dev[0]);
>    }
>
>    MPI_Barrier(MPI_COMM_WORLD);
>
>    if(myrank==0)
>    {
>      printf("send before %u\n",h_dev[0]);
>      MPI_Send(h_dev, SIZE, MPI_UNSIGNED , 1,0,MPI_COMM_WORLD);
>      printf("Finish send master\n");
>   }
>   else if (myrank == 1)
>   {
>      MPI_Recv(h_dev, SIZE, MPI_UNSIGNED, 0, 0,
> MPI_COMM_WORLD,MPI_STATUS_IGNORE);
>      printf("Receive\n");
>      printf("after %i\n",h_dev[0]);
>   }
>
>   MPI_Barrier(MPI_COMM_WORLD);
>   free(h_dev);
>
> /* Shut down MPI */
>   MPI_Finalize();
>   return 0;
> }
>
>
>
> Regards,
> -Amirul-
>
> ------------------------------------------------------------------
> -
> -
> DISCLAIMER:
>
> This e-mail (including any attachments) is for the addressee(s)
> only and may contain confidential information. If you are not the
> intended recipient, please note that any dealing, review,
> distribution, printing, copying or use of this e-mail is strictly
> prohibited. If you have received this email in error, please notify
> the sender immediately and delete the original message.
> MIMOS Berhad is a research and development institution under
> the purview of the Malaysian Ministry of Science, Technology and
> Innovation. Opinions, conclusions and other information in this e-
> mail that do not relate to the official business of MIMOS Berhad
> and/or its subsidiaries shall be understood as neither given nor
> endorsed by MIMOS Berhad and/or its subsidiaries and neither
> MIMOS Berhad nor its subsidiaries accepts responsibility for the
> same. All liability arising from or in connection with computer
> viruses and/or corrupted e-mails is excluded to the fullest extent
> permitted by law.
>
>
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20140207/bdc00dd9/attachment-0001.html>


More information about the mvapich-discuss mailing list