[mvapich-discuss] messege truncated

Dhabaleswar Panda panda at cse.ohio-state.edu
Wed Nov 26 16:55:02 EST 2008


Which version of Pallas are you running? As you might be knowing, Pallas
benchmarks are outdated. They have been replaced with Intel MPI Benchmarks
(IMB). The latest version is 3.1. Can you try your tests with IMB 3.1.

Thanks,

DK

On Tue, 25 Nov 2008, nilesh awate wrote:

>
Hi all,

I want to detail the information regarding this discussion as all my trials are failing over standards

I am using RHEL5 on AMD opteron dual core, mvapich2-1.2(dapl interconnect; with and without RDMA_FAST_PATH)  with  mellanox network.

I am running Pallas (with check) with above setup.

I got following error

Fatal error in MPI_Recv:
Message truncated, error stack:
MPI_Recv(186)..........................: MPI_Recv(buf=0x7fff3072accc, count=896311571, MPI_INT, src=2, tag=1000, MPI_COMM_WORLD, status=0x7fff3072acb0) failed
MPIDI_CH3U_Post_data_receive_found(243): Message from rank 2 and tag 1000 truncated; 4 bytes received but buffer size is -709721012
rank 0 in job 5  test01_44984   caused collective abort of all ranks
  exit status of rank 0: killed by signal 9

Above error occurs in SendRecv benchmark most of the time.

I ran same thing with gen2, it worked fine . . .

but with dapl interconnect its failing

waiting for reply,
Nilesh









 Nilesh Awate
C-DAC R&D






________________________________
From: Justin <luitjens at cs.utah.edu>
To: nilesh awate <nilesh_awate at yahoo.com>
Cc: MVAPICH2 <mvapich-discuss at cse.ohio-state.edu>
Sent: Friday, 21 November, 2008 9:09:51 PM
Subject: Re: [mvapich-discuss] messege truncated

One thing that I have used to track down bugs of this nature in the past is to use the MPI_Errhandler functionality.
Try placing this in your code after MPI_Init:

MPI_Errhandler_set(MPI_COMM_WORLD,MPI_ERRORS_RETURN);


Then at your MPI_Recv's add an if around them and some debugging output:

if(MPI_Recv(...)!=MPI_SUCCESS)
{
    char hostname[100];
    gethostname(hostname,100);
    cout << "MPI Recv returned error on " << hostname << ":" << getpid() << endl;
    cout << "Waiting for a debugger\n";
    while(1);
}


Then from here you should be able to ssh into the back node doing the processing (specified by the hostname above) and then attach gdb to the process (specified by the pid above).  Make sure you have compiled with -g.  Then look at the parameters to MPI_Recv and see if something doesn't look right.

Good Luck,
Justin

nilesh awate wrote:
>
> Hi Justine,
>
> We are running Pallas over mpi( dapl interconnect), I got the same error while running Pallas with tcp-ip(ethernet) network.
>
> Fatal error in MPI_Recv:
> Message truncated, error stack:
> MPI_Recv(186)..........................: MPI_Recv(buf=0x7fff23cdd22c, count=976479459, MPI_INT, src=2, tag=1000,MPI_COMM_WORLD, status=0x7fff23cdd210) failed
> MPIDI_CH3U_Post_data_receive_found(163): Message from rank 2 and tag 1000 truncated; 4 bytes received but buffersize is -389049460
>
> I am running it over AMD 5 nodes cluster having this (1Ghz Dual-Core AMD Opteron Processor 1216) configuration.
>
> I don't know how MPI_Recv got such a huge count. . .when Pallas is sending max 4194304Bytes
>
> is this some garbage value it receives ?
>
> waiting for reply,
>
> Nilesh
>
>
>
>
>
>
> ------------------------------------------------------------------------
> *From:* Justin <luitjens at cs.utah.edu>
> *To:* nilesh awate <nilesh_awate at yahoo.com>
> *Cc:* Dhabaleswar Panda <panda at cse.ohio-state.edu>; MVAPICH2 <mvapich-discuss at cse.ohio-state.edu>
> *Sent:* Thursday, 20 November, 2008 9:27:42 PM
> *Subject:* Re: [mvapich-discuss] messege truncated
>
> The message means mpi received a message larger than the buffer size you specified.  Namely in this case the buffer length is '-514665432'  thus any length of message would be bigger than it.  What I find odd is the parameters you are sending MPI_Recv.  You are sending a count of '945075466'  are you really sending a message that is a gigabyte in size?  It might be possible that the count is being converted to a signed int causing it to wrap to a negative number.  Check the size that you are specifying for the buffer.  It is odd that you have it specified to be a GB in size when you are only receiving 2 bytes.
> nilesh awate wrote:
> >
> > Thanks for suggestion (use mvapich2-1.2) sir,
> >
> > I have tried the same but still we are facing same problem
> >
> > Fatal error in MPI_Recv:
> > Message truncated, error stack:
> > MPI_Recv(186)........................: MPI_Recv(buf=0x7fff1faf6008, count=945075466, MPI_INT, src=2, tag=1000, MPI_COMM_WORLD, status=0x7fff1faf5fe0) failed
> > MPIDI_CH3U_Request_unpack_uebuf(590): Message truncated; 4 bytes received but buffer size is -514665432
> > rank 0 in job 4  test01_52519  caused collective abort of all ranks
> > exit status of rank 0: killed by signal 9
> >
> > is there any suggestion ?
> >
> > what does this error mean mean ?
> >
> > is this a result of data curruption/packet missing, or something else ?
> >
> > wating for reply
> > Nilesh Awate
> >
> >
> >
> > ------------------------------------------------------------------------
> > *From:* Dhabaleswar Panda <panda at cse.ohio-state.edu <mailto:panda at cse.ohio-state.edu>>
> > *To:* nilesh awate <nilesh_awate at yahoo.com <mailto:nilesh_awate at yahoo.com>>
> > *Cc:* MVAPICH2 <mvapich-discuss at cse..ohio-state.edu <mailto:mvapich-discuss at cse.ohio-state.edu>>
> > *Sent:* Wednesday, 19 November, 2008 9:27:36 PM
> > *Subject:* Re: [mvapich-discuss] messege truncated
> >
> > MVAPICH2 1.2 was released around two weeks back. Can you try the latest
> > version.
> >
> > DK
> >
> > On Wed, 19 Nov 2008, nilesh awate wrote:
> >
> > > Hi all,
> > I  am using  mvapich2-1.0.3  with  dapl  interconnect (its a proprietary  nic & dapl library)
> > I got following error while running pallas over (amd dual core) 5 nodes cluster.
> >
> > Fatal error in MPI_Recv:
> > Message truncated, error stack:
> > MPI_Recv(186)..........................: MPI_Recv(buf=0x7fff24744cec, count=952788905, MPI_INT, src=2, tag=1000,MPI_COMM_WORLD, status=0x7fff24744cd0) failed
> > MPIDI_CH3U_Post_data_receive_found(243): Message from rank 2 and tag 1000 truncated; 4 bytes received but buffersize is -483811676
> > rank 0 in job 2  test01_40634  caused collective abort of all ranks
> >  exit status of rank 0: killed by signal 9
> >
> >
> > will you suggest where we should look for solving above error ?
> > what can we interpret from above message ?
> >
> > wating for reply
> > thanking
> > Nilesh
> >
> >
> >      Bring your gang together. Do your thing. Find your favourite Yahoo! group at http://in.promos.yahoo.com/groups/
> >
> >
> > ------------------------------------------------------------------------
> > Add more friends to your messenger and enjoy! Invite them now. <http://in.rd.yahoo.com/tagline_messenger_6/*http://messenger.yahoo.com/invite/>
> > ------------------------------------------------------------------------
> >
> > _______________________________________________
> > mvapich-discuss mailing list
> > mvapich-discuss at cse..ohio-state.edu <mailto:mvapich-discuss at cse.ohio-state..edu>
> > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> >
>
> ------------------------------------------------------------------------
> Add more friends to your messenger and enjoy! Invite them now. <http://in.rd.yahoo.com/tagline_messenger_6/*http://messenger.yahoo.com/invite/>


      Add more friends to your messenger and enjoy! Go to http://messenger.yahoo.com/invite/



More information about the mvapich-discuss mailing list