[mvapich-discuss] messege truncated

Justin luitjens at cs.utah.edu
Fri Nov 21 10:39:51 EST 2008


One thing that I have used to track down bugs of this nature in the past 
is to use the MPI_Errhandler functionality. 

Try placing this in your code after MPI_Init:

 MPI_Errhandler_set(MPI_COMM_WORLD,MPI_ERRORS_RETURN);


Then at your MPI_Recv's add an if around them and some debugging output:

if(MPI_Recv(...)!=MPI_SUCCESS)
{
	char hostname[100];
	gethostname(hostname,100);
	cout << "MPI Recv returned error on " << hostname << ":" << getpid() << endl;
	cout << "Waiting for a debugger\n";
	while(1);
}


Then from here you should be able to ssh into the back node doing the 
processing (specified by the hostname above) and then attach gdb to the 
process (specified by the pid above).  Make sure you have compiled with 
-g.  Then look at the parameters to MPI_Recv and see if something 
doesn't look right.

Good Luck,
Justin

nilesh awate wrote:
>
> Hi Justine,
>
> We are running Pallas over mpi( dapl interconnect), I got the same 
> error while running Pallas with tcp-ip(ethernet) network.
>
> Fatal error in MPI_Recv:
> Message truncated, error stack:
> MPI_Recv(186)..........................: MPI_Recv(buf=0x7fff23cdd22c, 
> count=976479459, MPI_INT, src=2, tag=1000,MPI_COMM_WORLD, 
> status=0x7fff23cdd210) failed
> MPIDI_CH3U_Post_data_receive_found(163): Message from rank 2 and tag 
> 1000 truncated; 4 bytes received but buffersize is -389049460
>
> I am running it over AMD 5 nodes cluster having this (1Ghz Dual-Core 
> AMD Opteron Processor 1216) configuration.
>
> I don't know how MPI_Recv got such a huge count. . .when Pallas is 
> sending max 4194304Bytes
>
> is this some garbage value it receives ?
>
> waiting for reply,
>
> Nilesh
>
>
>   
>  
>
>
>
>
> ------------------------------------------------------------------------
> *From:* Justin <luitjens at cs.utah.edu>
> *To:* nilesh awate <nilesh_awate at yahoo.com>
> *Cc:* Dhabaleswar Panda <panda at cse.ohio-state.edu>; MVAPICH2 
> <mvapich-discuss at cse.ohio-state.edu>
> *Sent:* Thursday, 20 November, 2008 9:27:42 PM
> *Subject:* Re: [mvapich-discuss] messege truncated
>
> The message means mpi received a message larger than the buffer size 
> you specified.  Namely in this case the buffer length is '-514665432'  
> thus any length of message would be bigger than it.  What I find odd 
> is the parameters you are sending MPI_Recv.  You are sending a count 
> of '945075466'  are you really sending a message that is a gigabyte in 
> size?  It might be possible that the count is being converted to a 
> signed int causing it to wrap to a negative number.  Check the size 
> that you are specifying for the buffer.  It is odd that you have it 
> specified to be a GB in size when you are only receiving 2 bytes.
> nilesh awate wrote:
> >
> > Thanks for suggestion (use mvapich2-1.2) sir,
> >
> > I have tried the same but still we are facing same problem
> >
> > Fatal error in MPI_Recv:
> > Message truncated, error stack:
> > MPI_Recv(186).......................: MPI_Recv(buf=0x7fff1faf6008, 
> count=945075466, MPI_INT, src=2, tag=1000, MPI_COMM_WORLD, 
> status=0x7fff1faf5fe0) failed
> > MPIDI_CH3U_Request_unpack_uebuf(590): Message truncated; 4 bytes 
> received but buffer size is -514665432
> > rank 0 in job 4  test01_52519  caused collective abort of all ranks
> > exit status of rank 0: killed by signal 9
> >
> > is there any suggestion ?
> >
> > what does this error mean mean ?
> >
> > is this a result of data curruption/packet missing, or something else ?
> >
> > wating for reply
> > Nilesh Awate
> >
> >
> >
> > ------------------------------------------------------------------------
> > *From:* Dhabaleswar Panda <panda at cse.ohio-state.edu 
> <mailto:panda at cse.ohio-state.edu>>
> > *To:* nilesh awate <nilesh_awate at yahoo.com 
> <mailto:nilesh_awate at yahoo.com>>
> > *Cc:* MVAPICH2 <mvapich-discuss at cse..ohio-state.edu 
> <mailto:mvapich-discuss at cse.ohio-state.edu>>
> > *Sent:* Wednesday, 19 November, 2008 9:27:36 PM
> > *Subject:* Re: [mvapich-discuss] messege truncated
> >
> > MVAPICH2 1.2 was released around two weeks back. Can you try the latest
> > version.
> >
> > DK
> >
> > On Wed, 19 Nov 2008, nilesh awate wrote:
> >
> > > Hi all,
> > I  am using  mvapich2-1.0.3  with  dapl  interconnect (its a 
> proprietary  nic & dapl library)
> > I got following error while running pallas over (amd dual core) 5 
> nodes cluster.
> >
> > Fatal error in MPI_Recv:
> > Message truncated, error stack:
> > MPI_Recv(186)..........................: 
> MPI_Recv(buf=0x7fff24744cec, count=952788905, MPI_INT, src=2, 
> tag=1000,MPI_COMM_WORLD, status=0x7fff24744cd0) failed
> > MPIDI_CH3U_Post_data_receive_found(243): Message from rank 2 and tag 
> 1000 truncated; 4 bytes received but buffersize is -483811676
> > rank 0 in job 2  test01_40634  caused collective abort of all ranks
> >  exit status of rank 0: killed by signal 9
> >
> >
> > will you suggest where we should look for solving above error ?
> > what can we interpret from above message ?
> >
> > wating for reply
> > thanking
> > Nilesh
> >
> >
> >      Bring your gang together. Do your thing. Find your favourite 
> Yahoo! group at http://in.promos.yahoo.com/groups/
> >
> >
> > ------------------------------------------------------------------------
> > Add more friends to your messenger and enjoy! Invite them now. 
> <http://in.rd.yahoo.com/tagline_messenger_6/*http://messenger.yahoo.com/invite/> 
>
> > ------------------------------------------------------------------------
> >
> > _______________________________________________
> > mvapich-discuss mailing list
> > mvapich-discuss at cse..ohio-state.edu 
> <mailto:mvapich-discuss at cse.ohio-state.edu>
> > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> > 
>
>
> ------------------------------------------------------------------------
> Add more friends to your messenger and enjoy! Invite them now. 
> <http://in.rd.yahoo.com/tagline_messenger_6/*http://messenger.yahoo.com/invite/>



More information about the mvapich-discuss mailing list