[mvapich-discuss] SIGSEV in F90: An MPI bug?

David Stuebe dstuebe at umassd.edu
Thu Jan 31 13:32:23 EST 2008


Hi Jeff, Brian

Maybe I don't fully understand all the issues involved but I did read
through several web sites that discuss the dangers of passing temporary
arrays to non blocking MPI calls. Is MPI_BCAST non-blocking - I assumed that
was a blocking call anyway?

Again, my concern is that MPI call returns the data on all processors as
(perhaps, naively) expected, it is later in the program that an alloc called
on entry to a different subroutine for an explicit shape array causes a sig
sev. There is further evidence that it is an MPI issue because the problem
is memory-size dependent, and only occurs when run using more than one node,
using mvapich2.0. MPICH2.0 when I tested that on our cluster which does not
have infiniband.

Have you had a chance to experiment with the demo code that I sent. I think
the behavior warrants a little further investigation.

Thanks

David

On Jan 31, 2008 1:06 PM, Jeff Squyres <jsquyres at cisco.com> wrote:

> Brian is completely correct - if the F90 compiler chooses to make
> temporary buffers in order to pass array subsections to non-blocking
> MPI functions, there's little that an MPI implementation can do.
> Simply put: MPI requires that when you use non-blocking
> communications, the buffer must be available until you call some
> flavor of MPI_TEST or MPI_WAIT to complete the communication.
>
> I don't know of any way for an MPI implementation to know whether it
> has been handed a temporary buffer (e.g., one that a compiler silently
> created to pass an array subsection).  Do you know if there is a way?
>
>
>
> On Jan 31, 2008, at 12:36 PM, Brian Curtis wrote:
>
> > David,
> >
> > The MPI-2 documentation goes into great detail on issues with
> > Fortran-90 bindings (
> http://www.mpi-forum.org/docs/mpi-20-html/node236.htm#Node236
> > ).  The conditions you are seeing should be directed to Intel.
> >
> >
> > Brian
> >
> >
> > On Jan 31, 2008, at 11:59 AM, David Stuebe wrote:
> >
> >>
> >> Hi again Brian
> >>
> >> I just ran my test code on our cluster using ifort 10.1.011 and
> >> MVAPICH 1.0.1, but the behavior is still the same.
> >>
> >> Have you had a chance to try it on any of your test machines?
> >>
> >> David
> >>
> >>
> >>
> >>
> >> On Jan 25, 2008 12:31 PM, Brian Curtis <curtisbr at cse.ohio-
> >> state.edu> wrote:
> >> David,
> >>
> >> I did some research on this issue and it looks like you have posted
> >> the
> >> bug with Intel.  Please let us know what you find out.
> >>
> >>
> >> Brian
> >>
> >> David Stuebe wrote:
> >> > Hi Brian
> >> >
> >> > I downloaded the public release, it seems silly but I am not sure
> >> how to get
> >> > a rev number from the source... there does not seem to be a '-
> >> version'
> >> > option that gives more info, although I did not look to hard.
> >> >
> >> > I have not tried MVAPICH 1.0.1, but once I have intel ifort 10 on
> >> the
> >> > cluster I will try 1.0.1 and see if it goes away.
> >> >
> >> > In the mean time please let me know if you can recreate the
> >> problem?
> >> >
> >> > David
> >> >
> >> > PS - Just want to make sure you understand my issue, I think it
> >> is a bad
> >> > idea to try and pass a non-contiguous F90 memory pointer, I
> >> should not do
> >> > that... but the way that it breaks has caused me headaches for
> >> weeks now. If
> >> > it reliably caused a sigsev on entering MPI_BCAST that would be
> >> great! As it
> >> > is it is really hard to trace the problem.
> >> >
> >> >
> >> >
> >> >
> >> > On Jan 23, 2008 3:23 PM, Brian Curtis <curtisbr at cse.ohio-
> >> state.edu> wrote:
> >> >
> >> >
> >> >> David,
> >> >>
> >> >> Sorry to hear you are experience problems with the MVAPICH2
> >> Fortran 90
> >> >> interface.  I will be investigating this issue, but need some
> >> additional
> >> >> information about your setup.  What is the exact version of
> >> MVAPICH2 1.0
> >> >> you are utilizing (daily tarball or release)?  Have you tried
> >> MVAPICH2
> >> >> 1.0.1?
> >> >>
> >> >> Brian
> >> >>
> >> >> David Stuebe wrote:
> >> >>
> >> >>> Hello MVAPICH
> >> >>> I have found a strange bug in MVAPICH2 using IFORT. The
> >> behavior is very
> >> >>> strange indeed - it seems to be related to how ifort deals with
> >> passing
> >> >>> pointers to the MVAPICH FORTRAN 90 INTERFACE.
> >> >>> The MPI call returns successfully, but later calls to a dummy
> >> subroutine
> >> >>> cause a sigsev.
> >> >>>
> >> >>>  Please look at the following code:
> >> >>>
> >> >>>
> >> >>>
> >> >> !
> >> =
> >> =
> >> =
> >> =
> >> =
> >> =
> >> =
> >> =
> >> =
> >> =
> >> =
> >> =
> >> =====================================================================
> >> >>
> >> >> !
> >> =
> >> =
> >> =
> >> =
> >> =
> >> =
> >> =
> >> =
> >> =
> >> =
> >> =
> >> =
> >> =====================================================================
> >> >>
> >> >> !
> >> =
> >> =
> >> =
> >> =
> >> =
> >> =
> >> =
> >> =
> >> =
> >> =
> >> =
> >> =
> >> =====================================================================
> >> >>
> >> >>> ! TEST CODE TO FOR POSSIBLE BUG IN MVAPICH2 COMPILED ON IFORT
> >> >>> ! WRITEN BY: DAVID STUEBE
> >> >>> ! DATE: JAN 23, 2008
> >> >>> !
> >> >>> ! COMPILE WITH: mpif90 -xP mpi_prog.f90 -o xtest
> >> >>> !
> >> >>> ! KNOWN BEHAVIOR:
> >> >>> ! PASSING A NONE CONTIGUOUS POINTER TO MPI_BCAST CAUSES FAILURE
> >> OF
> >> >>> ! SUBROUTINES USING MULTI DIMENSIONAL EXPLICT SHAPE ARRAYS
> >> WITHOUT AN
> >> >>> INTERFACE -
> >> >>> ! EVEN THOUGH THE MPI_BCAST COMPLETES SUCCESUFULLY, RETURNING
> >> VALID
> >> >>>
> >> >> DATA.
> >> >>
> >> >>> !
> >> >>> ! COMMENTS:
> >> >>> ! I REALIZE PASSING NON CONTIGUOUS POINTERS IS DANGEROUS -
> >> SHAME ON
> >> >>> ! ME FOR MAKING THAT MISTAKE. HOWEVER, IT SHOULD EITHER WORK OR
> >> NOT.
> >> >>> ! RETURNING SUCCESSFULLY BUT CAUSING INTERFACE ERRORS LATER IS
> >> >>> ! EXTREMELY DIFFICULT TO DEBUG!
> >> >>> !
> >> >>> ! CONDITIONS FOR OCCURANCE:
> >> >>> !    COMPILER MUST OPTIMIZE USING 'VECTORIZATION'
> >> >>> !    ARRAY MUST BE 'LARGE' -SYSTEM DEPENDENT ?
> >> >>> !    MUST BE RUN ON MORE THAN ONE NODE TO CAUSE CRASH...
> >> >>> !    ie  Running inside one SMP box does not crash.
> >> >>> !
> >> >>> !    RUNNING UNDER MPD, ALL PROCESSES SIGSEV
> >> >>> !    RUNNING UNDER MPIEXEC0.82 FOR PBS,
> >> >>> !       ONLY SOME PROCESSES SIGSEV ?
> >> >>> !
> >> >>> ! ENVIRONMENTAL INFO:
> >> >>> ! NODES: DELL 1850 3.0GHZ, 2GB RAM, INFINIBAND PCI-EX 4X
> >> >>> ! SYSTEM: ROCKS 4.2
> >> >>> ! gcc version 3.4.6 20060404 (Red Hat 3.4.6-3)
> >> >>> !
> >> >>> ! IFORT/ICC:
> >> >>> !   Intel(R) Fortran Compiler for Intel(R) EM64T-based
> >> applications,
> >> >>> !   Version 9.1 Build 20061101 Package ID: l_fc_c_9.1.040
> >> >>> !
> >> >>> ! MVAPICH2: mpif90 for mvapich2-1.0
> >> >>> ! ./configure --prefix=/usr/local/share/mvapich2/1.0
> >> >>> --with-device=osu_ch3:mrail --with-rdma=vapi --with-pm=mpd --
> >> enable-f90
> >> >>> --enable-cxx --disable-romio --without-mpe
> >> >>> !
> >> >>>
> >> >>>
> >> >> !
> >> =
> >> =
> >> =
> >> =
> >> =
> >> =
> >> =
> >> =
> >> =
> >> =
> >> =
> >> =
> >> =====================================================================
> >> >>
> >> >> !
> >> =
> >> =
> >> =
> >> =
> >> =
> >> =
> >> =
> >> =
> >> =
> >> =
> >> =
> >> =
> >> =====================================================================
> >> >>
> >> >> !
> >> =
> >> =
> >> =
> >> =
> >> =
> >> =
> >> =
> >> =
> >> =
> >> =
> >> =
> >> =
> >> =====================================================================
> >> >>
> >> >>> Module vars
> >> >>>   USE MPI
> >> >>>   implicit none
> >> >>>
> >> >>>
> >> >>>   integer :: n,m,MYID,NPROCS
> >> >>>   integer :: ipt
> >> >>>
> >> >>>   integer, allocatable, target :: data(:,:)
> >> >>>
> >> >>>   contains
> >> >>>
> >> >>>     subroutine alloc_vars
> >> >>>       implicit none
> >> >>>
> >> >>>       integer Status
> >> >>>
> >> >>>       allocate(data(n,m),stat=status)
> >> >>>       if (status /=0) then
> >> >>>          write(ipt,*) "allocation error"
> >> >>>          stop
> >> >>>       end if
> >> >>>
> >> >>>       data = 0
> >> >>>
> >> >>>     end subroutine alloc_vars
> >> >>>
> >> >>>    SUBROUTINE INIT_MPI_ENV(ID,NP)
> >> >>>
> >> >>>
> >> >> !
> >> =
> >> =
> >> =
> >> =
> >> =
> >> =
> >> =
> >> =
> >> =
> >> =
> >> =
> >> =
> >> =
> >> =
> >> =
> >> ====================================================================|
> >> >>
> >> >>> !  INITIALIZE MPI
> >> >>>
> >> ENVIRONMENT                                                       |
> >> >>>
> >> >>>
> >> >> !
> >> =
> >> =
> >> =
> >> =
> >> =
> >> =
> >> =
> >> =
> >> =
> >> =
> >> =
> >> =
> >> =
> >> =
> >> =
> >> ====================================================================|
> >> >>
> >> >>>      INTEGER, INTENT(OUT) :: ID,NP
> >> >>>      INTEGER IERR
> >> >>>
> >> >>>      IERR=0
> >> >>>
> >> >>>      CALL MPI_INIT(IERR)
> >> >>>      IF(IERR/=0) WRITE(*,*) "BAD MPI_INIT", ID
> >> >>>      CALL MPI_COMM_RANK(MPI_COMM_WORLD,ID,IERR)
> >> >>>      IF(IERR/=0) WRITE(*,*) "BAD MPI_COMM_RANK", ID
> >> >>>      CALL MPI_COMM_SIZE(MPI_COMM_WORLD,NP,IERR)
> >> >>>      IF(IERR/=0) WRITE(*,*) "BAD MPI_COMM_SIZE", ID
> >> >>>
> >> >>>    END SUBROUTINE INIT_MPI_ENV
> >> >>>
> >> >>>
> >> >>>
> >> >>>
> >> >> !
> >> =
> >> =
> >> =
> >> =
> >> =
> >> =
> >> =
> >> =
> >> =
> >> =
> >> ====================================================================|
> >> >>
> >> >>>   SUBROUTINE PSHUTDOWN
> >> >>>
> >> >>>
> >> >>>
> >> >> !
> >> =
> >> =
> >> =
> >> =
> >> =
> >> =
> >> =
> >> =
> >> =
> >> =
> >> ====================================================================|
> >> >>
> >> >>>     INTEGER IERR
> >> >>>
> >> >>>     IERR=0
> >> >>>     CALL MPI_FINALIZE(IERR)
> >> >>>     if(ierr /=0) write(ipt,*) "BAD MPI_FINALIZE", MYID
> >> >>>     close(IPT)
> >> >>>     STOP
> >> >>>
> >> >>>   END SUBROUTINE PSHUTDOWN
> >> >>>
> >> >>>
> >> >>>   SUBROUTINE CONTIGUOUS_WORKS
> >> >>>     IMPLICIT NONE
> >> >>>     INTEGER, pointer :: ptest(:,:)
> >> >>>     INTEGER :: IERR, I,J
> >> >>>
> >> >>>
> >> >>>     write(ipt,*) "START CONTIGUOUS:"
> >> >>>     n=2000 ! Set size here...
> >> >>>     m=n+10
> >> >>>
> >> >>>     call alloc_vars
> >> >>>     write(ipt,*) "ALLOCATED DATA"
> >> >>>     ptest => data(1:N,1:N)
> >> >>>
> >> >>>     IF (MYID == 0) ptest=6
> >> >>>     write(ipt,*) "Made POINTER"
> >> >>>
> >> >>>     call MPI_BCAST(ptest,N*N,MPI_INTEGER,0,MPI_COMM_WORLD,IERR)
> >> >>>     IF(IERR /= 0) WRITE(IPT,*) "BAD BCAST", MYID
> >> >>>
> >> >>>     write(ipt,*) "BROADCAST Data; a value:",data(1,6)
> >> >>>
> >> >>>     DO I = 1,N
> >> >>>        DO J = 1,N
> >> >>>           if(data(I,J) /= 6) &
> >> >>>                & write(ipt,*) "INCORRECT VALUE!", I,J,data(I,J)
> >> >>>        END DO
> >> >>>
> >> >>>        DO J = N+1,M
> >> >>>           if(data(I,J) /= 0) &
> >> >>>                & write(ipt,*) "INCORRECT VALUE!", I,J,data(I,J)
> >> >>>        END DO
> >> >>>
> >> >>>     END DO
> >> >>>
> >> >>>     ! CALL THREE DIFFERENT EXAMPLES OF SUBROUTINES W/OUT AN
> >> ITERFACE
> >> >>>     ! THAT USE AN EXPLICIT SHAPE ARRAY
> >> >>>     write(ipt,*) "CALLING DUMMY1"
> >> >>>     CALL DUMMY1
> >> >>>
> >> >>>     write(ipt,*) "CALLING DUMMY2"
> >> >>>     call Dummy2(m,n)
> >> >>>
> >> >>>     write(ipt,*) "CALLING DUMMY3"
> >> >>>     call Dummy3
> >> >>>     write(ipt,*) "FINISHED!"
> >> >>>
> >> >>>   END SUBROUTINE CONTIGUOUS_WORKS
> >> >>>
> >> >>>   SUBROUTINE NON_CONTIGUOUS_FAILS
> >> >>>     IMPLICIT NONE
> >> >>>     INTEGER, pointer :: ptest(:,:)
> >> >>>     INTEGER :: IERR, I,J
> >> >>>
> >> >>>
> >> >>>     write(ipt,*) "START NON_CONTIGUOUS:"
> >> >>>
> >> >>>     m=200 ! Set size here - crash is size dependent!
> >> >>>     n=m+10
> >> >>>
> >> >>>     call alloc_vars
> >> >>>     write(ipt,*) "ALLOCATED DATA"
> >> >>>     ptest => data(1:M,1:M)
> >> >>>
> >> >>> !===================================================
> >> >>> ! IF YOU CALL DUMMY2 HERE TOO, THEN EVERYTHING PASSES  ???
> >> >>> !===================================================
> >> >>> !    CALL DUMMY1 ! THIS ONE HAS NO EFFECT
> >> >>> !    CALL DUMMY2 ! THIS ONE 'FIXES' THE BUG
> >> >>>
> >> >>>     IF (MYID == 0) ptest=6
> >> >>>     write(ipt,*) "Made POINTER"
> >> >>>
> >> >>>     call MPI_BCAST(ptest,M*M,MPI_INTEGER,0,MPI_COMM_WORLD,IERR)
> >> >>>     IF(IERR /= 0) WRITE(IPT,*) "BAD BCAST"
> >> >>>
> >> >>>     write(ipt,*) "BROADCAST Data; a value:",data(1,6)
> >> >>>
> >> >>>     DO I = 1,M
> >> >>>        DO J = 1,M
> >> >>>           if(data(J,I) /= 6) &
> >> >>>                & write(ipt,*) "INCORRECT VALUE!",I,J,DATA(I,J)
> >> >>>        END DO
> >> >>>
> >> >>>        DO J = M+1,N
> >> >>>           if(data(J,I) /= 0) &
> >> >>>                & write(ipt,*) "INCORRECT VALUE!",I,J,DATA(I,J)
> >> >>>        END DO
> >> >>>     END DO
> >> >>>
> >> >>>     ! CALL THREE DIFFERENT EXAMPLES OF SUBROUTINES W/OUT AN
> >> ITERFACE
> >> >>>     ! THAT USE AN EXPLICIT SHAPE ARRAY
> >> >>>     write(ipt,*) "CALLING DUMMY1"
> >> >>>     CALL DUMMY1
> >> >>>
> >> >>>     write(ipt,*) "CALLING DUMMY2"
> >> >>>     call Dummy2(m,n) ! SHOULD CRASH HERE!
> >> >>>
> >> >>>     write(ipt,*) "CALLING DUMMY3"
> >> >>>     call Dummy3
> >> >>>     write(ipt,*) "FINISHED!"
> >> >>>
> >> >>>   END SUBROUTINE NON_CONTIGUOUS_FAILS
> >> >>>
> >> >>>
> >> >>>   End Module vars
> >> >>>
> >> >>>
> >> >>> Program main
> >> >>>   USE vars
> >> >>>   implicit none
> >> >>>
> >> >>>
> >> >>>   CALL INIT_MPI_ENV(MYID,NPROCS)
> >> >>>
> >> >>>   ipt=myid+10
> >> >>>   OPEN(ipt)
> >> >>>
> >> >>>
> >> >>>   write(ipt,*) "Start memory test!"
> >> >>>
> >> >>>   CALL NON_CONTIGUOUS_FAILS
> >> >>>
> >> >>> !  CALL CONTIGUOUS_WORKS
> >> >>>
> >> >>>   write(ipt,*) "End memory test!"
> >> >>>
> >> >>>   CALL PSHUTDOWN
> >> >>>
> >> >>> END Program main
> >> >>>
> >> >>>
> >> >>>
> >> >>> ! TWO DUMMY SUBROUTINE WITH EXPLICIT SHAPE ARRAYS
> >> >>> ! DUMMY1 DECLARES A VECTOR  - THIS ONE NEVER CAUSES FAILURE
> >> >>> ! DUMMY2 DECLARES AN ARRAY  - THIS ONE CAUSES FAILURE
> >> >>>
> >> >>> SUBROUTINE DUMMY1
> >> >>>   USE vars
> >> >>>   implicit none
> >> >>>   real, dimension(m) :: my_data
> >> >>>
> >> >>>   write(ipt,*) "m,n",m,n
> >> >>>
> >> >>>   write(ipt,*) "DUMMY 1", size(my_data)
> >> >>>
> >> >>> END SUBROUTINE DUMMY1
> >> >>>
> >> >>>
> >> >>> SUBROUTINE DUMMY2(i,j)
> >> >>>   USE vars
> >> >>>   implicit none
> >> >>>   INTEGER, INTENT(IN) ::i,j
> >> >>>
> >> >>>
> >> >>>   real, dimension(i,j) :: my_data
> >> >>>
> >> >>>   write(ipt,*) "start: DUMMY 2", size(my_data)
> >> >>>
> >> >>>
> >> >>> END SUBROUTINE DUMMY2
> >> >>>
> >> >>> SUBROUTINE DUMMY3
> >> >>>   USE vars
> >> >>>   implicit none
> >> >>>
> >> >>>
> >> >>>   real, dimension(m,n) :: my_data
> >> >>>
> >> >>>
> >> >>>   write(ipt,*) "start: DUMMY 3", size(my_data)
> >> >>>
> >> >>>
> >> >>> END SUBROUTINE DUMMY3
> >> >>>
> >> >>>
> >> >>>
> >>
> ------------------------------------------------------------------------
> >> >>>
> >> >>> _______________________________________________
> >> >>> mvapich-discuss mailing list
> >> >>> mvapich-discuss at cse.ohio-state.edu
> >> >>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> >> >>>
> >> >>>
> >> >
> >> >
> >>
> >
> > _______________________________________________
> > mvapich-discuss mailing list
> > mvapich-discuss at cse.ohio-state.edu
> > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
>
> --
> Jeff Squyres
> Cisco Systems
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20080131/4053f5a3/attachment-0001.html


More information about the mvapich-discuss mailing list