[mvapich-discuss] MVAPICH2-GDR 2.0b : Crashing when sending 2 GB of data using MPI_Send.
Mohamad Amirul Abdullah
amirul.abdullah at mimos.my
Fri Feb 7 03:16:50 EST 2014
Hi,
We currently trying to setup GPU cluster (we have 2 machine connected with Infiniband point-to-point) and installed MVAPICH2-GDR. We succesfully tested GPU DIRECT feature using small data ( 1 - 200 elements). But we experiences crashing issue we increase the array size up to 2GB of integers (536870912 elements). The symptom of the crash is that whenever it try to execute the MPI_Send from the master machine, the master machine suddenly SHUTDOWN without warning or promptout anything,.. just TOTAL SHUTDOWN as if somebody is pressing the off button.
I have done some analysis of the problem, it seems that is not the GPU DIRECT issues, because the same thing happens when im trying to MPI_Send from CPU RAM to CPU RAM with 2 GB of data. However it can send 1 GB of data without problem. (RAM size is not an issue because i have 16 GB and 8 GB respectively in both machine)
I wonder the issues can be reproduce at your site. Even if it can't be reproduce, I don't know where to start debug the problem, not sure it the hardware problem, driver problem, or the MPI compiler problem
Here is my Machine settings on both machine. The code is also shown below.
OS
Centos 6.4 kernel version 2.6.32-358.el6.x86_64
Infiniband
Model No: CX354A
Name :Mellanox ConnectX-3 FDR Infiniband + 40GigE (MT27500 Family)
Driver (1): MLNX_OFED_LINUX-2.1-1.0.0-rhel6.4-x86_64
(2): nvidia_peer_memory-1.0-0
GPU
Card : Nvidia K20c
Driver version: 331.20
CUDA version : 5.5
MPI
MVAPICH2-GDR-2.0b
The Code
#include <mpi.h>
#include <stdio.h>
#include <stdlib.h>
int main(int argc, char **argv)
{
int myrank;
/* Initialize MPI */
unsigned int* h_dev;
unsigned int SIZE = 536870912; // 2 GB of unsigned int
MPI_Init(&argc, &argv);
printf("inited!");
/* Find out my identity in the default communicator */
MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
printf("check rank :%i\n",myrank);
h_dev =(unsigned int*)malloc(sizeof(unsigned int)*SIZE);
if(myrank==0)
{
for(unsigned int i=0;i<SIZE;i++)
h_dev[i]=1;
printf("proc 0: h_dev[0]: %u\n",h_dev[0]);
}
if(myrank==1)
{
for(unsigned int i=0;i<SIZE;i++)
h_dev[i]=2;
printf("proc 1: h_dev[0] : %u\n",h_dev[0]);
}
MPI_Barrier(MPI_COMM_WORLD);
if(myrank==0)
{
printf("send before %u\n",h_dev[0]);
MPI_Send(h_dev, SIZE, MPI_UNSIGNED , 1,0,MPI_COMM_WORLD);
printf("Finish send master\n");
}
else if (myrank == 1)
{
MPI_Recv(h_dev, SIZE, MPI_UNSIGNED, 0, 0, MPI_COMM_WORLD,MPI_STATUS_IGNORE);
printf("Receive\n");
printf("after %i\n",h_dev[0]);
}
MPI_Barrier(MPI_COMM_WORLD);
free(h_dev);
/* Shut down MPI */
MPI_Finalize();
return 0;
}
Regards,
-Amirul-
------------------------------------------------------------------
-
-
DISCLAIMER:
This e-mail (including any attachments) is for the addressee(s)
only and may contain confidential information. If you are not the
intended recipient, please note that any dealing, review,
distribution, printing, copying or use of this e-mail is strictly
prohibited. If you have received this email in error, please notify
the sender immediately and delete the original message.
MIMOS Berhad is a research and development institution under
the purview of the Malaysian Ministry of Science, Technology and
Innovation. Opinions, conclusions and other information in this e-
mail that do not relate to the official business of MIMOS Berhad
and/or its subsidiaries shall be understood as neither given nor
endorsed by MIMOS Berhad and/or its subsidiaries and neither
MIMOS Berhad nor its subsidiaries accepts responsibility for the
same. All liability arising from or in connection with computer
viruses and/or corrupted e-mails is excluded to the fullest extent
permitted by law.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20140207/b080cccd/attachment.html>
More information about the mvapich-discuss
mailing list