[mvapich-discuss] Crash in MPICH2
Choudhury, Durga
Durga.Choudhury at drs-ss.com
Tue Sep 12 11:12:06 EDT 2006
Hi all
I am trying to parallelize a certain image manipulation program to run
on a cluster of SMP machines. Each node has 4 CPUs and at present I have
two nodes.
Since the target hardware is expensive, I am trying this on two Red Hat
Linux PCs. Each of the four cores are simulated by launching 4 pthreads
in each node and MPICH2 (over TCP/IP/Ethernet) is used to talk across
nodes. I assume the MPI thread model I am using is MPI_THREAD_FUNNELLED
(since that is the default and I did not explicitly specify anything
else). I think I have written my code to be thread-safe.
After the slave node is done calculating it's part, it sends it's half
of the image over to the mpd master. Here is the code fragment:
if (isParentThread())
{
if (my_rank)
{
int MPI_tag = 0;
MPI_Send((void *)(&rl2.width), 1, MPI_INT, 0, MPI_tag++,
MPI_COMM_WORLD);
MPI_Send((void *)(&rl2.height), 1, MPI_INT, 0, MPI_tag++,
MPI_COMM_WORLD);
MPI_Send((void *)(rl2.file), rl2.width*rl2.height, MPI_FLOAT, 0,
MPI_tag++, MPI_COMM_WORLD);
MPI_Send((void *)(&cmpl2.width), 1, MPI_INT, 0, MPI_tag++,
MPI_COMM_WORLD);
MPI_Send((void *)(&cmpl2.height), 1, MPI_INT, 0, MPI_tag++,
MPI_COMM_WORLD);
MPI_Send((void *)(cmpl2.file), cmpl2.width*cmpl2.height,
MPI_COMPLEX, 0, MPI_tag++, MPI_COMM_WORLD);
}
else
{
int ranks;
for (ranks=1; ranks < NMPI_COMM; ranks++)
{
int MPI_tag = 0;
MPI_Status status;
int width, height;
MPI_Recv((void *)&width, 1, MPI_INT, ranks, MPI_tag++,
MPI_COMM_WORLD, &status);
MPI_Recv((void *)&height, 1, MPI_INT, ranks, MPI_tag++,
MPI_COMM_WORLD, &status);
void *rl2tmp = malloc(width * height * sizeof(float));
MPI_Recv(rl2tmp, width * height, MPI_FLOAT, ranks,
MPI_tag++, MPI_COMM_WORLD, &status);
finsert(rl2.file, rl2.width, rl2.height, rl2tmp, width,
height, 0, rl2.height * ranks/NMPI_COMM);
free(rl2tmp);
MPI_Recv((void *)&width, 1, MPI_INT, ranks, MPI_tag++,
MPI_COMM_WORLD, &status);
MPI_Recv((void *)&height, 1, MPI_INT, ranks, MPI_tag++,
MPI_COMM_WORLD, &status);
void *cmpl2tmp = malloc(width * height * sizeof(__complex__
float));
printf("cmpl2tmp = 0x%x\n", cmpl2tmp);
fflush(stdout);
MPI_Recv(cmpl2tmp, width * height, MPI_COMPLEX, ranks,
MPI_tag++, MPI_COMM_WORLD, &status);
cinsert(cmpl2.file, cmpl2.width, cmpl2.height, rl2tmp,
width, height, 0, cmpl2.height * ranks/NMPI_COMM);
free(cmpl2tmp);
}
}
printf("BIG MPI BARRIER STARTING ON %s\n", proc_name);
MPI_Barrier(MPI_COMM_WORLD);
printf("BIG MPI BARRIER ENDING ON %s\n", proc_name);
}
The line I have highlighted is the one causing a crash as follows:
aborting job:
Fatal error in MPI_Send: Other MPI error, error stack:
MPI_Send(177): MPI_Send(buf=0xb0451008, count=2162000, MPI_COMPLEX,
dest=0, tag=5, MPI_COMM_WORLD) failed
MPIDI_CH3_Progress_wait(209): an error occurred while handling an event
returned by MPIDU_Sock_Wait()
MPIDI_CH3I_Progress_handle_sock_event(489):
connection_recv_fail(1836):
MPIDU_Socki_handle_read(658): connection failure
(set=0,sock=1,errno=104:Connection reset by peer)
rank 0 in job 17 Durga_32937 caused collective abort of all ranks
exit status of rank 0: killed by signal 11
Notice that the previous call to MPI_Send(), with a similar sized data
block, was successful. The only difference in the call that is
successful and the one that is crashing is the data type: MPI_FLOAT vs.
MPI_COMPLEX. I am mapping a declaration of
__complex__ float foo;
to MPI_COMPLEX.
So finally, my question obviously is: what am I doing wrong?
Thanks for reading through this rather lengthy mail and I thank you in
advance for any help.
Durga
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20060912/4354bedd/attachment-0001.html
More information about the mvapich-discuss
mailing list