[mvapich-discuss] Crash in MPICH2

Sayantan Sur surs at cse.ohio-state.edu
Tue Sep 12 12:40:07 EDT 2006


Hello Durga,

Since your enquiry is with MPICH2 and specifically with the 
TCP/IP/Ethernet devices, I think MPICH2 group will be able to 
solve/point out your problem faster. You could post your enquiry to 
mpich-discuss. You can take a look at the following link for 
instructions on how to get on that list.

http://www-unix.mcs.anl.gov/mpi/mpich2/maillist.htm

As an aside, you could try to run your program with MPICH2-1.0.4p1 which 
is their latest release. MPI_THREAD_MULTIPLE is the default, so you are 
no longer restricted to using isParentThread() conditions before calling 
MPI functions.

Thanks,
Sayantan.

Choudhury, Durga wrote:

> Hi all
>
> I am trying to parallelize a certain image manipulation program to run 
> on a cluster of SMP machines. Each node has 4 CPUs and at present I 
> have two nodes.
>
> Since the target hardware is expensive, I am trying this on two Red 
> Hat Linux PCs. Each of the four cores are simulated by launching 4 
> pthreads in each node and MPICH2 (over TCP/IP/Ethernet) is used to 
> talk across nodes. I assume the MPI thread model I am using is 
> MPI_THREAD_FUNNELLED (since that is the default and I did not 
> explicitly specify anything else). I think I have written my code to 
> be thread-safe.
>
> After the slave node is done calculating it’s part, it sends it’s half 
> of the image over to the mpd master. Here is the code fragment:
>
> if (isParentThread())
>
> {
>
> if (my_rank)
>
> {
>
> int MPI_tag = 0;
>
> MPI_Send((void *)(&rl2.width), 1, MPI_INT, 0, MPI_tag++, MPI_COMM_WORLD);
>
> MPI_Send((void *)(&rl2.height), 1, MPI_INT, 0, MPI_tag++, MPI_COMM_WORLD);
>
> MPI_Send((void *)(rl2.file), rl2.width*rl2.height, MPI_FLOAT, 0, 
> MPI_tag++, MPI_COMM_WORLD);
>
> MPI_Send((void *)(&cmpl2.width), 1, MPI_INT, 0, MPI_tag++, 
> MPI_COMM_WORLD);
>
> MPI_Send((void *)(&cmpl2.height), 1, MPI_INT, 0, MPI_tag++, 
> MPI_COMM_WORLD);
>
> *MPI_Send((void *)(cmpl2.file), cmpl2.width*cmpl2.height, MPI_COMPLEX, 
> 0, MPI_tag++, MPI_COMM_WORLD);*
>
> }
>
> else
>
> {
>
> int ranks;
>
> for (ranks=1; ranks < NMPI_COMM; ranks++)
>
> {
>
> int MPI_tag = 0;
>
> MPI_Status status;
>
> int width, height;
>
> MPI_Recv((void *)&width, 1, MPI_INT, ranks, MPI_tag++, MPI_COMM_WORLD, 
> &status);
>
> MPI_Recv((void *)&height, 1, MPI_INT, ranks, MPI_tag++, 
> MPI_COMM_WORLD, &status);
>
> void *rl2tmp = malloc(width * height * sizeof(float));
>
> MPI_Recv(rl2tmp, width * height, MPI_FLOAT, ranks, MPI_tag++, 
> MPI_COMM_WORLD, &status);
>
> finsert(rl2.file, rl2.width, rl2.height, rl2tmp, width, height, 0, 
> rl2.height * ranks/NMPI_COMM);
>
> free(rl2tmp);
>
> MPI_Recv((void *)&width, 1, MPI_INT, ranks, MPI_tag++, MPI_COMM_WORLD, 
> &status);
>
> MPI_Recv((void *)&height, 1, MPI_INT, ranks, MPI_tag++, 
> MPI_COMM_WORLD, &status);
>
> void *cmpl2tmp = malloc(width * height * sizeof(__complex__ float));
>
> printf("cmpl2tmp = 0x%x\n", cmpl2tmp);
>
> fflush(stdout);
>
> MPI_Recv(cmpl2tmp, width * height, MPI_COMPLEX, ranks, MPI_tag++, 
> MPI_COMM_WORLD, &status);
>
> cinsert(cmpl2.file, cmpl2.width, cmpl2.height, rl2tmp, width, height, 
> 0, cmpl2.height * ranks/NMPI_COMM);
>
> free(cmpl2tmp);
>
> }
>
> }
>
> printf("BIG MPI BARRIER STARTING ON %s\n", proc_name);
>
> MPI_Barrier(MPI_COMM_WORLD);
>
> printf("BIG MPI BARRIER ENDING ON %s\n", proc_name);
>
> }
>
> The line I have highlighted is the one causing a crash as follows:
>
> aborting job:
>
> Fatal error in MPI_Send: Other MPI error, error stack:
>
> *MPI_Send(177): MPI_Send(buf=0xb0451008, count=2162000, MPI_COMPLEX, 
> dest=0, tag=5, MPI_COMM_WORLD) failed*
>
> MPIDI_CH3_Progress_wait(209): an error occurred while handling an 
> event returned by MPIDU_Sock_Wait()
>
> MPIDI_CH3I_Progress_handle_sock_event(489):
>
> connection_recv_fail(1836):
>
> MPIDU_Socki_handle_read(658): connection failure 
> (set=0,sock=1,errno=104:Connection reset by peer)
>
> rank 0 in job 17 Durga_32937 caused collective abort of all ranks
>
> exit status of rank 0: killed by signal 11
>
> Notice that the previous call to MPI_Send(), with a similar sized data 
> block, was successful. The only difference in the call that is 
> successful and the one that is crashing is the data type: MPI_FLOAT 
> vs. MPI_COMPLEX. I am mapping a declaration of
>
> __complex__ float foo;
>
> to MPI_COMPLEX.
>
> So finally, my question obviously is: what am I doing wrong?
>
> Thanks for reading through this rather lengthy mail and I thank you in 
> advance for any help.
>
> Durga
>
>------------------------------------------------------------------------
>
>_______________________________________________
>mvapich-discuss mailing list
>mvapich-discuss at mail.cse.ohio-state.edu
>http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>  
>


-- 
http://www.cse.ohio-state.edu/~surs



More information about the mvapich-discuss mailing list