[mvapich-discuss] Help problem MPI_Bcast fails on np=8 with 8MB buffer

Sun Aug 24 17:20:32 EDT 2008

I'm glad the latest version is working for you now. The MPI buffer limit
is a well-known issue with MPI. Since the datatype is an 'int' you cannot
increase the number of elements. You should be able to use other datatypes
to allow a larger buffer though.

Matt

On Fri, 22 Aug 2008 Terrence.LIAO at total.com wrote:

> Hi, DK,
>
> Yes, you are right.  Using the new version Aug 21.  The MPI_Bcast  no
> longer core dump and can Bcast to the 2GB buffer limit.
> I do have another question,  How can I extend  MPI buffer beyond the 2GB
> limit?
>
> Thank you very much.
>
> -- Terrence
> --------------------------------------------------------
> Terrence Liao, Ph.D.
> Research Computer Scientist
> TOTAL E&P RESEARCH & TECHNOLOGY USA, LLC
> 1201 Louisiana, Suite 1800, Houston, TX 77002
> Tel: 713.647.3498  Fax: 713.647.3638
> Email: terrence.liao at total.com
>
>
>
>
>
>
> Dhabaleswar Panda <panda at cse.ohio-state.edu>
> 08/21/2008 09:01 PM
>
> To
> Terrence.LIAO at total.com
> cc
> mvapich-discuss at cse.ohio-state.edu
> Subject
> Re: [mvapich-discuss] Help problem MPI_Bcast fails on np=8 with 8MB buffer
>
>
>
>
>
>
> Hi Terrence,
>
> Thanks for reporting this problem. After MVAPICH 1.0 release, we had a
> bug-fix release of 1.0.1 on 05/30/08.  After that some more fixes also
> have gone into the 1.0 branch based on the feedbacks we have received from
> the users.
>
> Here are some check-ins which we believe might be related to the failure
> symptom you have described.
>
> ----------------------------------------------
> r2179 | mamidala | 2008-03-04 18:40:24 -0500 (Tue, 04 Mar 2008) | 3 lines
> checking in a fix for BLACS seg. fault problem. Problem occurs when
> application holds onto MPI communicators not freeing immediately
> ----------------------------------------------
> r2783 | kumarra | 2008-06-24 23:11:04 -0400 (Tue, 24 Jun 2008) | 1 line
> shared memory bcast buffer overflow. Reported by David Kewley at Dell.
> ---------------------------------------------
> r2805 | kumarra | 2008-06-30 13:28:54 -0400 (Mon, 30 Jun 2008) | 1 line
> Do not try to use shmem broadcast if shmem_bcast shared memory
> initialization fails
> ---------------------------------------------
>
> Can you try MVAPICH 1.0.1 release, the bugfix 1.0 branch or the trunk and
> let us know whether the problem persists. If the problem persists, we will
> take a look at this issue further.
>
> You can get these latest versions through svn checkout or through
> tarballs.
>
> FYI, daily tarballs of the 1.0 bugfix branch are available here:
> http://mvapich.cse.ohio-state.edu/nightly/mvapich/branches/1.0/
>
> Similarly, daily tarballs of the trunk are available here:
> http://mvapich.cse.ohio-state.edu/nightly/mvapich/trunk/
>
> Thanks,
>
> DK
>
> On Thu, 21 Aug 2008 Terrence.LIAO at total.com wrote:
>
> > Dear mvapich,
> >
> > I got a core dump when MPI_Bcast(buffer, n, MPI_DOUBLE,...) when n is
> > 1024*1024,  i,e 8MB buffer on np=8 on 8 compute nodes.    I have NO
> > problem when using np = 7.  I am using mvapich-1.0 Feb 28 2008 download
> on
> >  AMD cluster - quad-core dual sockets 16GB mem, with 4xDDR IB.  mvapich
> is
> > built on pgi 7.1 compiler.    Below is the gdb output.   Any suggestion
> I
> > should do to fix this problem?  Thank you very much.  -- Terrence
> >
> >
> > Program received signal SIGSEGV, Segmentation fault.
> > [Switching to Thread 182894245856 (LWP 18383)]
> > 0x00000036d80723e3 in memcpy () from /lib64/tls/libc.so.6
> > (gdb) where
> > #0  0x00000036d80723e3 in memcpy () from /lib64/tls/libc.so.6
> > #1  0x0000000000449c09 in MPID_VIA_self_start (buf=0x2a96546010,
> > len=8388608, src_lrank=0, tag=2,
> >     context_id=0, shandle=0x57a1e8) at viasend.c:276
> > #2  0x000000000044c205 in MPID_IsendContig (comm_ptr=0x5a2060,
> > buf=0x2a96546010, len=8388608,
> >     src_lrank=0, tag=2, context_id=0, dest_grank=0,
> > msgrep=MPID_MSGREP_RECEIVER, request=0x57a1e8,
> >     error_code=0x7fbfffe66c) at mpid_send.c:84
> > #3  0x0000000000435cfd in MPID_IsendDatatype (comm_ptr=0x5a2060,
> > buf=0x2a96546010, count=1048576,
> >     dtype_ptr=0x56ac60, src_lrank=0, tag=2, context_id=0, dest_grank=0,
> > request=0x57a1e8,
> >     error_code=0x7fbfffe66c) at mpid_hsend.c:129
> > #4  0x0000000000443215 in PMPI_Isend (buf=0x2a96546010, count=1048576,
> > datatype=11, dest=0, tag=2,
> >     comm=91, request=0x7fbfffe710) at isend.c:97
> > #5  0x0000000000444710 in PMPI_Sendrecv (sendbuf=0x2a96546010,
> > sendcount=1048576, sendtype=11,
> >     dest=0, sendtag=2, recvbuf=0x2a96d4bc00, recvcount=1048576,
> > recvtype=11, source=0, recvtag=2,
> >     comm=91, status=0x7fbfffe820) at sendrecv.c:95
> > #6  0x000000000041c355 in intra_shmem_Bcast_Large (buffer=0x2a96546010,
> > count=1048576,
> >     datatype=0x56ac60, nbytes=8388608, root=0, comm=0x5a2060) at
> > intra_fns_new.c:1704
> > #7  0x000000000041b6b4 in intra_Bcast_Large (buffer=0x2a96546010,
> > count=1048576, datatype=0x56ac60,
> >     nbytes=8388608, root=0, comm=0x5a2060) at intra_fns_new.c:1309
> > #8  0x000000000041b157 in intra_newBcast (buffer=0x2a96546010,
> > count=1048576, datatype=0x56ac60,
> >     root=0, comm=0x5a2060) at intra_fns_new.c:1117
> > #9  0x0000000000412008 in PMPI_Bcast (buffer=0x2a96546010,
> count=1048576,
> > datatype=11, root=0,
> >     comm=91) at bcast.c:122
> > #10 0x00000000004042de in main (argc=2, argv=0x7fbfffee98) at
> > large-mpi_bcast_test.c:159
> > (gdb)
> >
> >
> >
> >
> >
>
>