[mvapich-discuss] Crash in MPICH2

Choudhury, Durga Durga.Choudhury at drs-ss.com
Tue Sep 12 11:12:06 EDT 2006


Hi all

 

I am trying to parallelize a certain image manipulation program to run
on a cluster of SMP machines. Each node has 4 CPUs and at present I have
two nodes.

 

Since the target hardware is expensive, I am trying this on two Red Hat
Linux PCs.  Each of the four cores are simulated by launching 4 pthreads
in each node and MPICH2 (over TCP/IP/Ethernet) is used to talk across
nodes. I assume the MPI thread model I am using is MPI_THREAD_FUNNELLED
(since that is the default and I did not explicitly specify anything
else). I think I have written my code to be thread-safe.

 

After the slave node is done calculating it's part, it sends it's half
of the image over to the mpd master. Here is the code fragment:

 

if (isParentThread())

    {

      if (my_rank)

      {

        int MPI_tag = 0;

        MPI_Send((void *)(&rl2.width), 1, MPI_INT, 0, MPI_tag++,
MPI_COMM_WORLD);

        MPI_Send((void *)(&rl2.height), 1, MPI_INT, 0, MPI_tag++,
MPI_COMM_WORLD);

        MPI_Send((void *)(rl2.file), rl2.width*rl2.height, MPI_FLOAT, 0,
MPI_tag++, MPI_COMM_WORLD);

        MPI_Send((void *)(&cmpl2.width), 1, MPI_INT, 0, MPI_tag++,
MPI_COMM_WORLD);

        MPI_Send((void *)(&cmpl2.height), 1, MPI_INT, 0, MPI_tag++,
MPI_COMM_WORLD);

        MPI_Send((void *)(cmpl2.file), cmpl2.width*cmpl2.height,
MPI_COMPLEX, 0, MPI_tag++, MPI_COMM_WORLD);

      }

      else

      {

        int ranks;

        for (ranks=1; ranks < NMPI_COMM; ranks++)

          {

            int MPI_tag = 0;

            MPI_Status status;

            int width, height;

            MPI_Recv((void *)&width, 1, MPI_INT, ranks, MPI_tag++,
MPI_COMM_WORLD, &status);

            MPI_Recv((void *)&height, 1, MPI_INT, ranks, MPI_tag++,
MPI_COMM_WORLD, &status);

            void *rl2tmp = malloc(width * height * sizeof(float));

            MPI_Recv(rl2tmp, width * height, MPI_FLOAT, ranks,
MPI_tag++, MPI_COMM_WORLD, &status);

            finsert(rl2.file, rl2.width, rl2.height, rl2tmp, width,
height, 0, rl2.height * ranks/NMPI_COMM);

            free(rl2tmp);

            MPI_Recv((void *)&width, 1, MPI_INT, ranks, MPI_tag++,
MPI_COMM_WORLD, &status);

            MPI_Recv((void *)&height, 1, MPI_INT, ranks, MPI_tag++,
MPI_COMM_WORLD, &status);

            void *cmpl2tmp = malloc(width * height * sizeof(__complex__
float));

            printf("cmpl2tmp = 0x%x\n", cmpl2tmp);

            fflush(stdout);

            MPI_Recv(cmpl2tmp, width * height, MPI_COMPLEX, ranks,
MPI_tag++, MPI_COMM_WORLD, &status);

            cinsert(cmpl2.file, cmpl2.width, cmpl2.height, rl2tmp,
width, height, 0, cmpl2.height * ranks/NMPI_COMM);

            free(cmpl2tmp);

          }

      }

      printf("BIG MPI BARRIER STARTING ON %s\n", proc_name);

      MPI_Barrier(MPI_COMM_WORLD);

      printf("BIG MPI BARRIER ENDING ON %s\n", proc_name);

    }

 

The line I have highlighted is the one causing a crash as follows:

 

aborting job:

Fatal error in MPI_Send: Other MPI error, error stack:

MPI_Send(177): MPI_Send(buf=0xb0451008, count=2162000, MPI_COMPLEX,
dest=0, tag=5, MPI_COMM_WORLD) failed

MPIDI_CH3_Progress_wait(209): an error occurred while handling an event
returned by MPIDU_Sock_Wait()

MPIDI_CH3I_Progress_handle_sock_event(489):

connection_recv_fail(1836):

MPIDU_Socki_handle_read(658): connection failure
(set=0,sock=1,errno=104:Connection reset by peer)

rank 0 in job 17  Durga_32937   caused collective abort of all ranks

  exit status of rank 0: killed by signal 11

 

 

Notice that the previous call to MPI_Send(), with a similar sized data
block, was successful. The only difference in the call that is
successful and the one that is crashing is the data type: MPI_FLOAT vs.
MPI_COMPLEX. I am mapping a declaration of 

 

__complex__ float foo;

 

to MPI_COMPLEX.

 

So finally, my question obviously is: what am I doing wrong?

 

Thanks for reading through this rather lengthy mail and I thank you in
advance for any help.

 

Durga

 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20060912/4354bedd/attachment-0001.html


More information about the mvapich-discuss mailing list