[mvapich-discuss] MVAPICH2-GDR 2.0b : Crashing when sending 2 GB of data using MPI_Send.

Mohamad Amirul Abdullah amirul.abdullah at mimos.my
Fri Feb 7 03:16:50 EST 2014


Hi,
We currently trying to setup GPU cluster (we have 2 machine connected with Infiniband point-to-point) and installed MVAPICH2-GDR. We succesfully tested GPU DIRECT feature using small data ( 1 - 200 elements). But we experiences crashing issue we increase the array size up to 2GB of integers (536870912 elements). The symptom of the crash is that whenever it try to execute the MPI_Send from the master machine, the master machine suddenly SHUTDOWN without warning or promptout anything,.. just TOTAL SHUTDOWN as if somebody is pressing the off button.
I have done some analysis of the problem, it seems that is not the GPU DIRECT issues, because the same thing happens when im trying to MPI_Send from CPU RAM to CPU RAM with 2 GB of data. However it can send 1 GB of data without problem. (RAM size is not an issue because i have 16 GB and 8 GB respectively in both machine)
I wonder the issues can be reproduce at your site. Even if it can't be reproduce, I don't know where to start debug the problem, not sure it the hardware problem, driver problem, or the MPI compiler problem
Here is my Machine settings on both machine. The code is also shown below.
OS
Centos 6.4 kernel version  2.6.32-358.el6.x86_64

Infiniband
Model No: CX354A
Name :Mellanox ConnectX-3 FDR Infiniband + 40GigE (MT27500 Family)
Driver     (1): MLNX_OFED_LINUX-2.1-1.0.0-rhel6.4-x86_64
                (2): nvidia_peer_memory-1.0-0

GPU
Card : Nvidia K20c
Driver version: 331.20
CUDA version : 5.5

MPI
MVAPICH2-GDR-2.0b
The Code
#include <mpi.h>
#include <stdio.h>
#include <stdlib.h>

int main(int argc, char **argv)
{
  int myrank;

  /* Initialize MPI */
  unsigned int* h_dev;

  unsigned int SIZE = 536870912; // 2 GB of unsigned int
  MPI_Init(&argc, &argv);
  printf("inited!");

 /* Find out my identity in the default communicator */
  MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
  printf("check rank :%i\n",myrank);

  h_dev =(unsigned int*)malloc(sizeof(unsigned int)*SIZE);

  if(myrank==0)
   {
    for(unsigned int i=0;i<SIZE;i++)
       h_dev[i]=1;
     printf("proc 0: h_dev[0]: %u\n",h_dev[0]);
   }

   if(myrank==1)
   {
    for(unsigned int i=0;i<SIZE;i++)
       h_dev[i]=2;
     printf("proc 1: h_dev[0] : %u\n",h_dev[0]);
   }

   MPI_Barrier(MPI_COMM_WORLD);

   if(myrank==0)
   {
     printf("send before %u\n",h_dev[0]);
     MPI_Send(h_dev, SIZE, MPI_UNSIGNED , 1,0,MPI_COMM_WORLD);
     printf("Finish send master\n");
  }
  else if (myrank == 1)
  {
     MPI_Recv(h_dev, SIZE, MPI_UNSIGNED, 0, 0, MPI_COMM_WORLD,MPI_STATUS_IGNORE);
     printf("Receive\n");
     printf("after %i\n",h_dev[0]);
  }

  MPI_Barrier(MPI_COMM_WORLD);
  free(h_dev);

/* Shut down MPI */
  MPI_Finalize();
  return 0;
}

Regards,
-Amirul-


------------------------------------------------------------------
-
-
DISCLAIMER: 

This e-mail (including any attachments) is for the addressee(s) 
only and may contain confidential information. If you are not the 
intended recipient, please note that any dealing, review, 
distribution, printing, copying or use of this e-mail is strictly 
prohibited. If you have received this email in error, please notify 
the sender  immediately and delete the original message. 
MIMOS Berhad is a research and development institution under 
the purview of the Malaysian Ministry of Science, Technology and 
Innovation. Opinions, conclusions and other information in this e-
mail that do not relate to the official business of MIMOS Berhad 
and/or its subsidiaries shall be understood as neither given nor 
endorsed by MIMOS Berhad and/or its subsidiaries and neither 
MIMOS Berhad nor its subsidiaries accepts responsibility for the 
same. All liability arising from or in connection with computer 
viruses and/or corrupted e-mails is excluded to the fullest extent 
permitted by law.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20140207/b080cccd/attachment.html>


More information about the mvapich-discuss mailing list