[mvapich-discuss] Help problem MPI_Bcast fails on np=8 with 8MB buffer

Terrence.LIAO at total.com Terrence.LIAO at total.com
Fri Aug 22 11:03:29 EDT 2008


Hi, DK,

Yes, you are right.  Using the new version Aug 21.  The MPI_Bcast  no 
longer core dump and can Bcast to the 2GB buffer limit. 
I do have another question,  How can I extend  MPI buffer beyond the 2GB 
limit?

Thank you very much.

-- Terrence
--------------------------------------------------------
Terrence Liao, Ph.D.
Research Computer Scientist
TOTAL E&P RESEARCH & TECHNOLOGY USA, LLC
1201 Louisiana, Suite 1800, Houston, TX 77002 
Tel: 713.647.3498  Fax: 713.647.3638
Email: terrence.liao at total.com






Dhabaleswar Panda <panda at cse.ohio-state.edu> 
08/21/2008 09:01 PM

To
Terrence.LIAO at total.com
cc
mvapich-discuss at cse.ohio-state.edu
Subject
Re: [mvapich-discuss] Help problem MPI_Bcast fails on np=8 with 8MB buffer






Hi Terrence,

Thanks for reporting this problem. After MVAPICH 1.0 release, we had a
bug-fix release of 1.0.1 on 05/30/08.  After that some more fixes also
have gone into the 1.0 branch based on the feedbacks we have received from
the users.

Here are some check-ins which we believe might be related to the failure
symptom you have described.

----------------------------------------------
r2179 | mamidala | 2008-03-04 18:40:24 -0500 (Tue, 04 Mar 2008) | 3 lines
checking in a fix for BLACS seg. fault problem. Problem occurs when
application holds onto MPI communicators not freeing immediately
----------------------------------------------
r2783 | kumarra | 2008-06-24 23:11:04 -0400 (Tue, 24 Jun 2008) | 1 line
shared memory bcast buffer overflow. Reported by David Kewley at Dell.
---------------------------------------------
r2805 | kumarra | 2008-06-30 13:28:54 -0400 (Mon, 30 Jun 2008) | 1 line
Do not try to use shmem broadcast if shmem_bcast shared memory
initialization fails
---------------------------------------------

Can you try MVAPICH 1.0.1 release, the bugfix 1.0 branch or the trunk and
let us know whether the problem persists. If the problem persists, we will
take a look at this issue further.

You can get these latest versions through svn checkout or through
tarballs.

FYI, daily tarballs of the 1.0 bugfix branch are available here:
http://mvapich.cse.ohio-state.edu/nightly/mvapich/branches/1.0/

Similarly, daily tarballs of the trunk are available here:
http://mvapich.cse.ohio-state.edu/nightly/mvapich/trunk/

Thanks,

DK

On Thu, 21 Aug 2008 Terrence.LIAO at total.com wrote:

> Dear mvapich,
>
> I got a core dump when MPI_Bcast(buffer, n, MPI_DOUBLE,...) when n is
> 1024*1024,  i,e 8MB buffer on np=8 on 8 compute nodes.    I have NO
> problem when using np = 7.  I am using mvapich-1.0 Feb 28 2008 download 
on
>  AMD cluster - quad-core dual sockets 16GB mem, with 4xDDR IB.  mvapich 
is
> built on pgi 7.1 compiler.    Below is the gdb output.   Any suggestion 
I
> should do to fix this problem?  Thank you very much.  -- Terrence
>
>
> Program received signal SIGSEGV, Segmentation fault.
> [Switching to Thread 182894245856 (LWP 18383)]
> 0x00000036d80723e3 in memcpy () from /lib64/tls/libc.so.6
> (gdb) where
> #0  0x00000036d80723e3 in memcpy () from /lib64/tls/libc.so.6
> #1  0x0000000000449c09 in MPID_VIA_self_start (buf=0x2a96546010,
> len=8388608, src_lrank=0, tag=2,
>     context_id=0, shandle=0x57a1e8) at viasend.c:276
> #2  0x000000000044c205 in MPID_IsendContig (comm_ptr=0x5a2060,
> buf=0x2a96546010, len=8388608,
>     src_lrank=0, tag=2, context_id=0, dest_grank=0,
> msgrep=MPID_MSGREP_RECEIVER, request=0x57a1e8,
>     error_code=0x7fbfffe66c) at mpid_send.c:84
> #3  0x0000000000435cfd in MPID_IsendDatatype (comm_ptr=0x5a2060,
> buf=0x2a96546010, count=1048576,
>     dtype_ptr=0x56ac60, src_lrank=0, tag=2, context_id=0, dest_grank=0,
> request=0x57a1e8,
>     error_code=0x7fbfffe66c) at mpid_hsend.c:129
> #4  0x0000000000443215 in PMPI_Isend (buf=0x2a96546010, count=1048576,
> datatype=11, dest=0, tag=2,
>     comm=91, request=0x7fbfffe710) at isend.c:97
> #5  0x0000000000444710 in PMPI_Sendrecv (sendbuf=0x2a96546010,
> sendcount=1048576, sendtype=11,
>     dest=0, sendtag=2, recvbuf=0x2a96d4bc00, recvcount=1048576,
> recvtype=11, source=0, recvtag=2,
>     comm=91, status=0x7fbfffe820) at sendrecv.c:95
> #6  0x000000000041c355 in intra_shmem_Bcast_Large (buffer=0x2a96546010,
> count=1048576,
>     datatype=0x56ac60, nbytes=8388608, root=0, comm=0x5a2060) at
> intra_fns_new.c:1704
> #7  0x000000000041b6b4 in intra_Bcast_Large (buffer=0x2a96546010,
> count=1048576, datatype=0x56ac60,
>     nbytes=8388608, root=0, comm=0x5a2060) at intra_fns_new.c:1309
> #8  0x000000000041b157 in intra_newBcast (buffer=0x2a96546010,
> count=1048576, datatype=0x56ac60,
>     root=0, comm=0x5a2060) at intra_fns_new.c:1117
> #9  0x0000000000412008 in PMPI_Bcast (buffer=0x2a96546010, 
count=1048576,
> datatype=11, root=0,
>     comm=91) at bcast.c:122
> #10 0x00000000004042de in main (argc=2, argv=0x7fbfffee98) at
> large-mpi_bcast_test.c:159
> (gdb)
>
>
>
>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20080822/a86907b3/attachment.html


More information about the mvapich-discuss mailing list