[mvapich-discuss] messege truncated
Matthew Koop
koop at cse.ohio-state.edu
Tue Dec 2 14:57:19 EST 2008
Nilesh,
Using RDMA fastpath requires that your network adapter place data into the
destination buffer with the last byte being placed last.
If this guarantee does not hold, then there will be corruption. Mellanox
gives this guarantee for their InfiniBand adapters -- if your hardware
does not (or you are not sure), the RDMA fast path method should be turned
off for your system.
Thanks,
Matt
On Mon, 1 Dec 2008, nilesh awate wrote:
>
Thanks for suggestion,
sorry for late reply,
initially we were using Pallas V2.2.
Now with IMB3.2 over proprietary network(nic & dapl), we tried
mvapich2-1.2 without RDMA_FAST_PATH. only send-recv path, it worked fine
for a long duration run. but with rdma path its failing. error file is
attached.
for cross check we ran same thing over mellanox network(dapl) its working fine.
what can we deduce from above error ?
________________________________
From: Dhabaleswar Panda <panda at cse.ohio-state.edu>
To: nilesh awate <nilesh_awate at yahoo.com>
Cc: MVAPICH2 <mvapich-discuss at cse.ohio-state.edu>; pmb at pallas.com
Sent: Thursday, 27 November, 2008 3:25:02 AM
Subject: Re: [mvapich-discuss] messege truncated
Which version of Pallas are you running? As you might be knowing, Pallas
benchmarks are outdated. They have been replaced with Intel MPI Benchmarks
(IMB). The latest version is 3.1. Can you try your tests with IMB 3.1.
Thanks,
DK
On Tue, 25 Nov 2008, nilesh awate wrote:
>
Hi all,
I want to detail the information regarding this discussion as all my trials are failing over standards
I am using RHEL5 on AMD opteron dual core, mvapich2-1.2(dapl interconnect; with and without RDMA_FAST_PATH) with mellanox network.
I am running Pallas (with check) with above setup.
I got following error
Fatal error in MPI_Recv:
Message truncated, error stack:
MPI_Recv(186)...........................: MPI_Recv(buf=0x7fff3072accc, count=896311571, MPI_INT, src=2, tag=1000, MPI_COMM_WORLD, status=0x7fff3072acb0) failed
MPIDI_CH3U_Post_data_receive_found(243): Message from rank 2 and tag 1000 truncated; 4 bytes received but buffer size is -709721012
rank 0 in job 5 test01_44984 caused collective abort of all ranks
exit status of rank 0: killed by signal 9
Above error occurs in SendRecv benchmark most of the time.
I ran same thing with gen2, it worked fine . . .
but with dapl interconnect its failing
waiting for reply,
Nilesh
Nilesh Awate
C-DAC R&D
________________________________
From: Justin <luitjens at cs.utah.edu>
To: nilesh awate <nilesh_awate at yahoo.com>
Cc: MVAPICH2 <mvapich-discuss at cse.ohio-state.edu>
Sent: Friday, 21 November, 2008 9:09:51 PM
Subject: Re: [mvapich-discuss] messege truncated
One thing that I have used to track down bugs of this nature in the past is to use the MPI_Errhandler functionality.
Try placing this in your code after MPI_Init:
MPI_Errhandler_set(MPI_COMM_WORLD,MPI_ERRORS_RETURN);
Then at your MPI_Recv's add an if around them and some debugging output:
if(MPI_Recv(...)!=MPI_SUCCESS)
{
char hostname[100];
gethostname(hostname,100);
cout << "MPI Recv returned error on " << hostname << ":" << getpid() << endl;
cout << "Waiting for a debugger\n";
while(1);
}
Then from here you should be able to ssh into the back node doing the processing (specified by the hostname above) and then attach gdb to the process (specified by the pid above). Make sure you have compiled with -g. Then look at the parameters to MPI_Recv and see if something doesn't look right.
Good Luck,
Justin
nilesh awate wrote:
>
> Hi Justine,
>
> We are running Pallas over mpi( dapl interconnect), I got the same error while running Pallas with tcp-ip(ethernet) network.
>
> Fatal error in MPI_Recv:
> Message truncated, error stack:
> MPI_Recv(186)...........................: MPI_Recv(buf=0x7fff23cdd22c, count=976479459, MPI_INT, src=2, tag=1000,MPI_COMM_WORLD, status=0x7fff23cdd210) failed
> MPIDI_CH3U_Post_data_receive_found(163): Message from rank 2 and tag 1000 truncated; 4 bytes received but buffersize is -389049460
>
> I am running it over AMD 5 nodes cluster having this (1Ghz Dual-Core AMD Opteron Processor 1216) configuration.
>
> I don't know how MPI_Recv got such a huge count. . .when Pallas is sending max 4194304Bytes
>
> is this some garbage value it receives ?
>
> waiting for reply,
>
> Nilesh
>
>
>
>
>
>
> ------------------------------------------------------------------------
> *From:* Justin <luitjens at cs.utah.edu>
> *To:* nilesh awate <nilesh_awate at yahoo.com>
> *Cc:* Dhabaleswar Panda <panda at cse.ohio-state.edu>; MVAPICH2 <mvapich-discuss at cse.ohio-state.edu>
> *Sent:* Thursday, 20 November, 2008 9:27:42 PM
> *Subject:* Re: [mvapich-discuss] messege truncated
>
> The message means mpi received a message larger than the buffer size you specified. Namely in this case the buffer length is '-514665432' thus any length of message would be bigger than it. What I find odd is the parameters you are sending MPI_Recv. You are sending a count of '945075466' are you really sending a message that is a gigabyte in size? It might be possible that the count is being converted to a signed int causing it to wrap to a negative number. Check the size that you are specifying for the buffer. It is odd that you have it specified to be a GB in size when you are only receiving 2 bytes.
> nilesh awate wrote:
> >
> > Thanks for suggestion (use mvapich2-1.2) sir,
> >
> > I have tried the same but still we are facing same problem
> >
> > Fatal error in MPI_Recv:
> > Message truncated, error stack:
> > MPI_Recv(186).........................: MPI_Recv(buf=0x7fff1faf6008, count=945075466, MPI_INT, src=2, tag=1000, MPI_COMM_WORLD, status=0x7fff1faf5fe0) failed
> > MPIDI_CH3U_Request_unpack_uebuf(590): Message truncated; 4 bytes received but buffer size is -514665432
> > rank 0 in job 4 test01_52519 caused collective abort of all ranks
> > exit status of rank 0: killed by signal 9
> >
> > is there any suggestion ?
> >
> > what does this error mean mean ?
> >
> > is this a result of data curruption/packet missing, or something else ?
> >
> > wating for reply
> > Nilesh Awate
> >
> >
> >
> > ------------------------------------------------------------------------
> > *From:* Dhabaleswar Panda <panda at cse.ohio-state.edu <mailto:panda at cse.ohio-state.edu>>
> > *To:* nilesh awate <nilesh_awate at yahoo..com <mailto:nilesh_awate at yahoo.com>>
> > *Cc:* MVAPICH2 <mvapich-discuss at cse..ohio-state.edu <mailto:mvapich-discuss at cse.ohio-state.edu>>
> > *Sent:* Wednesday, 19 November, 2008 9:27:36 PM
> > *Subject:* Re: [mvapich-discuss] messege truncated
> >
> > MVAPICH2 1.2 was released around two weeks back. Can you try the latest
> > version.
> >
> > DK
> >
> > On Wed, 19 Nov 2008, nilesh awate wrote:
> >
> > > Hi all,
> > I am using mvapich2-1.0.3 with dapl interconnect (its a proprietary nic & dapl library)
> > I got following error while running pallas over (amd dual core) 5 nodes cluster.
> >
> > Fatal error in MPI_Recv:
> > Message truncated, error stack:
> > MPI_Recv(186)..........................: MPI_Recv(buf=0x7fff24744cec, count=952788905, MPI_INT, src=2, tag=1000,MPI_COMM_WORLD, status=0x7fff24744cd0) failed
> > MPIDI_CH3U_Post_data_receive_found(243): Message from rank 2 and tag 1000 truncated; 4 bytes received but buffersize is -483811676
> > rank 0 in job 2 test01_40634 caused collective abort of all ranks
> > exit status of rank 0: killed by signal 9
> >
> >
> > will you suggest where we should look for solving above error ?
> > what can we interpret from above message ?
> >
> > wating for reply
> > thanking
> > Nilesh
> >
> >
> > Bring your gang together. Do your thing. Find your favourite Yahoo! group at http://in.promos.yahoo.com/groups/
> >
> >
> > ------------------------------------------------------------------------
> > Add more friends to your messenger and enjoy! Invite them now. <http://in.rd.yahoo.com/tagline_messenger_6/*http://messenger.yahoo.com/invite/>
> > ------------------------------------------------------------------------
> >
> > _______________________________________________
> > mvapich-discuss mailing list
> > mvapich-discuss at cse..ohio-state.edu <mailto:mvapich-discuss at cse.ohio-state..edu>
> > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> >
>
> ------------------------------------------------------------------------
> Add more friends to your messenger and enjoy! Invite them now. <http://in.rd.yahoo.com/tagline_messenger_6/*http://messenger.yahoo.com/invite/>
Add more friends to your messenger and enjoy! Go to http://messenger.yahoo.com/invite/
Add more friends to your messenger and enjoy! Go to http://messenger.yahoo.com/invite/
More information about the mvapich-discuss
mailing list