[mvapich-discuss] Help problem MPI_Bcast fails on np=8 with 8MB
buffer
Terrence.LIAO at total.com
Terrence.LIAO at total.com
Fri Aug 22 11:03:29 EDT 2008
Hi, DK,
Yes, you are right. Using the new version Aug 21. The MPI_Bcast no
longer core dump and can Bcast to the 2GB buffer limit.
I do have another question, How can I extend MPI buffer beyond the 2GB
limit?
Thank you very much.
-- Terrence
--------------------------------------------------------
Terrence Liao, Ph.D.
Research Computer Scientist
TOTAL E&P RESEARCH & TECHNOLOGY USA, LLC
1201 Louisiana, Suite 1800, Houston, TX 77002
Tel: 713.647.3498 Fax: 713.647.3638
Email: terrence.liao at total.com
Dhabaleswar Panda <panda at cse.ohio-state.edu>
08/21/2008 09:01 PM
To
Terrence.LIAO at total.com
cc
mvapich-discuss at cse.ohio-state.edu
Subject
Re: [mvapich-discuss] Help problem MPI_Bcast fails on np=8 with 8MB buffer
Hi Terrence,
Thanks for reporting this problem. After MVAPICH 1.0 release, we had a
bug-fix release of 1.0.1 on 05/30/08. After that some more fixes also
have gone into the 1.0 branch based on the feedbacks we have received from
the users.
Here are some check-ins which we believe might be related to the failure
symptom you have described.
----------------------------------------------
r2179 | mamidala | 2008-03-04 18:40:24 -0500 (Tue, 04 Mar 2008) | 3 lines
checking in a fix for BLACS seg. fault problem. Problem occurs when
application holds onto MPI communicators not freeing immediately
----------------------------------------------
r2783 | kumarra | 2008-06-24 23:11:04 -0400 (Tue, 24 Jun 2008) | 1 line
shared memory bcast buffer overflow. Reported by David Kewley at Dell.
---------------------------------------------
r2805 | kumarra | 2008-06-30 13:28:54 -0400 (Mon, 30 Jun 2008) | 1 line
Do not try to use shmem broadcast if shmem_bcast shared memory
initialization fails
---------------------------------------------
Can you try MVAPICH 1.0.1 release, the bugfix 1.0 branch or the trunk and
let us know whether the problem persists. If the problem persists, we will
take a look at this issue further.
You can get these latest versions through svn checkout or through
tarballs.
FYI, daily tarballs of the 1.0 bugfix branch are available here:
http://mvapich.cse.ohio-state.edu/nightly/mvapich/branches/1.0/
Similarly, daily tarballs of the trunk are available here:
http://mvapich.cse.ohio-state.edu/nightly/mvapich/trunk/
Thanks,
DK
On Thu, 21 Aug 2008 Terrence.LIAO at total.com wrote:
> Dear mvapich,
>
> I got a core dump when MPI_Bcast(buffer, n, MPI_DOUBLE,...) when n is
> 1024*1024, i,e 8MB buffer on np=8 on 8 compute nodes. I have NO
> problem when using np = 7. I am using mvapich-1.0 Feb 28 2008 download
on
> AMD cluster - quad-core dual sockets 16GB mem, with 4xDDR IB. mvapich
is
> built on pgi 7.1 compiler. Below is the gdb output. Any suggestion
I
> should do to fix this problem? Thank you very much. -- Terrence
>
>
> Program received signal SIGSEGV, Segmentation fault.
> [Switching to Thread 182894245856 (LWP 18383)]
> 0x00000036d80723e3 in memcpy () from /lib64/tls/libc.so.6
> (gdb) where
> #0 0x00000036d80723e3 in memcpy () from /lib64/tls/libc.so.6
> #1 0x0000000000449c09 in MPID_VIA_self_start (buf=0x2a96546010,
> len=8388608, src_lrank=0, tag=2,
> context_id=0, shandle=0x57a1e8) at viasend.c:276
> #2 0x000000000044c205 in MPID_IsendContig (comm_ptr=0x5a2060,
> buf=0x2a96546010, len=8388608,
> src_lrank=0, tag=2, context_id=0, dest_grank=0,
> msgrep=MPID_MSGREP_RECEIVER, request=0x57a1e8,
> error_code=0x7fbfffe66c) at mpid_send.c:84
> #3 0x0000000000435cfd in MPID_IsendDatatype (comm_ptr=0x5a2060,
> buf=0x2a96546010, count=1048576,
> dtype_ptr=0x56ac60, src_lrank=0, tag=2, context_id=0, dest_grank=0,
> request=0x57a1e8,
> error_code=0x7fbfffe66c) at mpid_hsend.c:129
> #4 0x0000000000443215 in PMPI_Isend (buf=0x2a96546010, count=1048576,
> datatype=11, dest=0, tag=2,
> comm=91, request=0x7fbfffe710) at isend.c:97
> #5 0x0000000000444710 in PMPI_Sendrecv (sendbuf=0x2a96546010,
> sendcount=1048576, sendtype=11,
> dest=0, sendtag=2, recvbuf=0x2a96d4bc00, recvcount=1048576,
> recvtype=11, source=0, recvtag=2,
> comm=91, status=0x7fbfffe820) at sendrecv.c:95
> #6 0x000000000041c355 in intra_shmem_Bcast_Large (buffer=0x2a96546010,
> count=1048576,
> datatype=0x56ac60, nbytes=8388608, root=0, comm=0x5a2060) at
> intra_fns_new.c:1704
> #7 0x000000000041b6b4 in intra_Bcast_Large (buffer=0x2a96546010,
> count=1048576, datatype=0x56ac60,
> nbytes=8388608, root=0, comm=0x5a2060) at intra_fns_new.c:1309
> #8 0x000000000041b157 in intra_newBcast (buffer=0x2a96546010,
> count=1048576, datatype=0x56ac60,
> root=0, comm=0x5a2060) at intra_fns_new.c:1117
> #9 0x0000000000412008 in PMPI_Bcast (buffer=0x2a96546010,
count=1048576,
> datatype=11, root=0,
> comm=91) at bcast.c:122
> #10 0x00000000004042de in main (argc=2, argv=0x7fbfffee98) at
> large-mpi_bcast_test.c:159
> (gdb)
>
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20080822/a86907b3/attachment.html
More information about the mvapich-discuss
mailing list