[mvapich-discuss] SIGSEV in F90: An MPI bug?

Brian Curtis curtisbr at cse.ohio-state.edu
Thu Jan 31 14:28:04 EST 2008


David,

Have you compiled and testing the application with other f90 compilers?


Brian

On Jan 31, 2008, at 1:32 PM, David Stuebe wrote:

>
> Hi Jeff, Brian
>
> Maybe I don't fully understand all the issues involved but I did  
> read through several web sites that discuss the dangers of passing  
> temporary arrays to non blocking MPI calls. Is MPI_BCAST non- 
> blocking - I assumed that was a blocking call anyway?
>
> Again, my concern is that MPI call returns the data on all  
> processors as (perhaps, naively) expected, it is later in the  
> program that an alloc called on entry to a different subroutine for  
> an explicit shape array causes a sig sev. There is further evidence  
> that it is an MPI issue because the problem is memory-size  
> dependent, and only occurs when run using more than one node, using  
> mvapich2.0. MPICH2.0 when I tested that on our cluster which does  
> not have infiniband.
>
> Have you had a chance to experiment with the demo code that I sent.  
> I think the behavior warrants a little further investigation.
>
> Thanks
>
> David
>
> On Jan 31, 2008 1:06 PM, Jeff Squyres <jsquyres at cisco.com> wrote:
> Brian is completely correct - if the F90 compiler chooses to make
> temporary buffers in order to pass array subsections to non-blocking
> MPI functions, there's little that an MPI implementation can do.
> Simply put: MPI requires that when you use non-blocking
> communications, the buffer must be available until you call some
> flavor of MPI_TEST or MPI_WAIT to complete the communication.
>
> I don't know of any way for an MPI implementation to know whether it
> has been handed a temporary buffer (e.g., one that a compiler silently
> created to pass an array subsection).  Do you know if there is a way?
>
>
>
> On Jan 31, 2008, at 12:36 PM, Brian Curtis wrote:
>
> > David,
> >
> > The MPI-2 documentation goes into great detail on issues with
> > Fortran-90 bindings (http://www.mpi-forum.org/docs/mpi-20-html/ 
> node236.htm#Node236
> > ).  The conditions you are seeing should be directed to Intel.
> >
> >
> > Brian
> >
> >
> > On Jan 31, 2008, at 11:59 AM, David Stuebe wrote:
> >
> >>
> >> Hi again Brian
> >>
> >> I just ran my test code on our cluster using ifort 10.1.011 and
> >> MVAPICH 1.0.1, but the behavior is still the same.
> >>
> >> Have you had a chance to try it on any of your test machines?
> >>
> >> David
> >>
> >>
> >>
> >>
> >> On Jan 25, 2008 12:31 PM, Brian Curtis <curtisbr at cse.ohio-
> >> state.edu> wrote:
> >> David,
> >>
> >> I did some research on this issue and it looks like you have posted
> >> the
> >> bug with Intel.  Please let us know what you find out.
> >>
> >>
> >> Brian
> >>
> >> David Stuebe wrote:
> >> > Hi Brian
> >> >
> >> > I downloaded the public release, it seems silly but I am not sure
> >> how to get
> >> > a rev number from the source... there does not seem to be a '-
> >> version'
> >> > option that gives more info, although I did not look to hard.
> >> >
> >> > I have not tried MVAPICH 1.0.1, but once I have intel ifort 10 on
> >> the
> >> > cluster I will try 1.0.1 and see if it goes away.
> >> >
> >> > In the mean time please let me know if you can recreate the
> >> problem?
> >> >
> >> > David
> >> >
> >> > PS - Just want to make sure you understand my issue, I think it
> >> is a bad
> >> > idea to try and pass a non-contiguous F90 memory pointer, I
> >> should not do
> >> > that... but the way that it breaks has caused me headaches for
> >> weeks now. If
> >> > it reliably caused a sigsev on entering MPI_BCAST that would be
> >> great! As it
> >> > is it is really hard to trace the problem.
> >> >
> >> >
> >> >
> >> >
> >> > On Jan 23, 2008 3:23 PM, Brian Curtis <curtisbr at cse.ohio-
> >> state.edu> wrote:
> >> >
> >> >
> >> >> David,
> >> >>
> >> >> Sorry to hear you are experience problems with the MVAPICH2
> >> Fortran 90
> >> >> interface.  I will be investigating this issue, but need some
> >> additional
> >> >> information about your setup.  What is the exact version of
> >> MVAPICH2 1.0
> >> >> you are utilizing (daily tarball or release)?  Have you tried
> >> MVAPICH2
> >> >> 1.0.1?
> >> >>
> >> >> Brian
> >> >>
> >> >> David Stuebe wrote:
> >> >>
> >> >>> Hello MVAPICH
> >> >>> I have found a strange bug in MVAPICH2 using IFORT. The
> >> behavior is very
> >> >>> strange indeed - it seems to be related to how ifort deals with
> >> passing
> >> >>> pointers to the MVAPICH FORTRAN 90 INTERFACE.
> >> >>> The MPI call returns successfully, but later calls to a dummy
> >> subroutine
> >> >>> cause a sigsev.
> >> >>>
> >> >>>  Please look at the following code:
> >> >>>
> >> >>>
> >> >>>
> >> >> !
> >> =
> >> =
> >> =
> >> =
> >> =
> >> =
> >> =
> >> =
> >> =
> >> =
> >> =
> >> =
> >>  
> =====================================================================
> >> >>
> >> >> !
> >> =
> >> =
> >> =
> >> =
> >> =
> >> =
> >> =
> >> =
> >> =
> >> =
> >> =
> >> =
> >>  
> =====================================================================
> >> >>
> >> >> !
> >> =
> >> =
> >> =
> >> =
> >> =
> >> =
> >> =
> >> =
> >> =
> >> =
> >> =
> >> =
> >>  
> =====================================================================
> >> >>
> >> >>> ! TEST CODE TO FOR POSSIBLE BUG IN MVAPICH2 COMPILED ON IFORT
> >> >>> ! WRITEN BY: DAVID STUEBE
> >> >>> ! DATE: JAN 23, 2008
> >> >>> !
> >> >>> ! COMPILE WITH: mpif90 -xP mpi_prog.f90 -o xtest
> >> >>> !
> >> >>> ! KNOWN BEHAVIOR:
> >> >>> ! PASSING A NONE CONTIGUOUS POINTER TO MPI_BCAST CAUSES FAILURE
> >> OF
> >> >>> ! SUBROUTINES USING MULTI DIMENSIONAL EXPLICT SHAPE ARRAYS
> >> WITHOUT AN
> >> >>> INTERFACE -
> >> >>> ! EVEN THOUGH THE MPI_BCAST COMPLETES SUCCESUFULLY, RETURNING
> >> VALID
> >> >>>
> >> >> DATA.
> >> >>
> >> >>> !
> >> >>> ! COMMENTS:
> >> >>> ! I REALIZE PASSING NON CONTIGUOUS POINTERS IS DANGEROUS -
> >> SHAME ON
> >> >>> ! ME FOR MAKING THAT MISTAKE. HOWEVER, IT SHOULD EITHER WORK OR
> >> NOT.
> >> >>> ! RETURNING SUCCESSFULLY BUT CAUSING INTERFACE ERRORS LATER IS
> >> >>> ! EXTREMELY DIFFICULT TO DEBUG!
> >> >>> !
> >> >>> ! CONDITIONS FOR OCCURANCE:
> >> >>> !    COMPILER MUST OPTIMIZE USING 'VECTORIZATION'
> >> >>> !    ARRAY MUST BE 'LARGE' -SYSTEM DEPENDENT ?
> >> >>> !    MUST BE RUN ON MORE THAN ONE NODE TO CAUSE CRASH...
> >> >>> !    ie  Running inside one SMP box does not crash.
> >> >>> !
> >> >>> !    RUNNING UNDER MPD, ALL PROCESSES SIGSEV
> >> >>> !    RUNNING UNDER MPIEXEC0.82 FOR PBS,
> >> >>> !       ONLY SOME PROCESSES SIGSEV ?
> >> >>> !
> >> >>> ! ENVIRONMENTAL INFO:
> >> >>> ! NODES: DELL 1850 3.0GHZ, 2GB RAM, INFINIBAND PCI-EX 4X
> >> >>> ! SYSTEM: ROCKS 4.2
> >> >>> ! gcc version 3.4.6 20060404 (Red Hat 3.4.6-3)
> >> >>> !
> >> >>> ! IFORT/ICC:
> >> >>> !   Intel(R) Fortran Compiler for Intel(R) EM64T-based
> >> applications,
> >> >>> !   Version 9.1 Build 20061101 Package ID: l_fc_c_9.1.040
> >> >>> !
> >> >>> ! MVAPICH2: mpif90 for mvapich2-1.0
> >> >>> ! ./configure --prefix=/usr/local/share/mvapich2/1.0
> >> >>> --with-device=osu_ch3:mrail --with-rdma=vapi --with-pm=mpd --
> >> enable-f90
> >> >>> --enable-cxx --disable-romio --without-mpe
> >> >>> !
> >> >>>
> >> >>>
> >> >> !
> >> =
> >> =
> >> =
> >> =
> >> =
> >> =
> >> =
> >> =
> >> =
> >> =
> >> =
> >> =
> >>  
> =====================================================================
> >> >>
> >> >> !
> >> =
> >> =
> >> =
> >> =
> >> =
> >> =
> >> =
> >> =
> >> =
> >> =
> >> =
> >> =
> >>  
> =====================================================================
> >> >>
> >> >> !
> >> =
> >> =
> >> =
> >> =
> >> =
> >> =
> >> =
> >> =
> >> =
> >> =
> >> =
> >> =
> >>  
> =====================================================================
> >> >>
> >> >>> Module vars
> >> >>>   USE MPI
> >> >>>   implicit none
> >> >>>
> >> >>>
> >> >>>   integer :: n,m,MYID,NPROCS
> >> >>>   integer :: ipt
> >> >>>
> >> >>>   integer, allocatable, target :: data(:,:)
> >> >>>
> >> >>>   contains
> >> >>>
> >> >>>     subroutine alloc_vars
> >> >>>       implicit none
> >> >>>
> >> >>>       integer Status
> >> >>>
> >> >>>       allocate(data(n,m),stat=status)
> >> >>>       if (status /=0) then
> >> >>>          write(ipt,*) "allocation error"
> >> >>>          stop
> >> >>>       end if
> >> >>>
> >> >>>       data = 0
> >> >>>
> >> >>>     end subroutine alloc_vars
> >> >>>
> >> >>>    SUBROUTINE INIT_MPI_ENV(ID,NP)
> >> >>>
> >> >>>
> >> >> !
> >> =
> >> =
> >> =
> >> =
> >> =
> >> =
> >> =
> >> =
> >> =
> >> =
> >> =
> >> =
> >> =
> >> =
> >> =
> >>  
> ====================================================================|
> >> >>
> >> >>> !  INITIALIZE MPI
> >> >>>
> >> ENVIRONMENT                                                       |
> >> >>>
> >> >>>
> >> >> !
> >> =
> >> =
> >> =
> >> =
> >> =
> >> =
> >> =
> >> =
> >> =
> >> =
> >> =
> >> =
> >> =
> >> =
> >> =
> >>  
> ====================================================================|
> >> >>
> >> >>>      INTEGER, INTENT(OUT) :: ID,NP
> >> >>>      INTEGER IERR
> >> >>>
> >> >>>      IERR=0
> >> >>>
> >> >>>      CALL MPI_INIT(IERR)
> >> >>>      IF(IERR/=0) WRITE(*,*) "BAD MPI_INIT", ID
> >> >>>      CALL MPI_COMM_RANK(MPI_COMM_WORLD,ID,IERR)
> >> >>>      IF(IERR/=0) WRITE(*,*) "BAD MPI_COMM_RANK", ID
> >> >>>      CALL MPI_COMM_SIZE(MPI_COMM_WORLD,NP,IERR)
> >> >>>      IF(IERR/=0) WRITE(*,*) "BAD MPI_COMM_SIZE", ID
> >> >>>
> >> >>>    END SUBROUTINE INIT_MPI_ENV
> >> >>>
> >> >>>
> >> >>>
> >> >>>
> >> >> !
> >> =
> >> =
> >> =
> >> =
> >> =
> >> =
> >> =
> >> =
> >> =
> >> =
> >>  
> ====================================================================|
> >> >>
> >> >>>   SUBROUTINE PSHUTDOWN
> >> >>>
> >> >>>
> >> >>>
> >> >> !
> >> =
> >> =
> >> =
> >> =
> >> =
> >> =
> >> =
> >> =
> >> =
> >> =
> >>  
> ====================================================================|
> >> >>
> >> >>>     INTEGER IERR
> >> >>>
> >> >>>     IERR=0
> >> >>>     CALL MPI_FINALIZE(IERR)
> >> >>>     if(ierr /=0) write(ipt,*) "BAD MPI_FINALIZE", MYID
> >> >>>     close(IPT)
> >> >>>     STOP
> >> >>>
> >> >>>   END SUBROUTINE PSHUTDOWN
> >> >>>
> >> >>>
> >> >>>   SUBROUTINE CONTIGUOUS_WORKS
> >> >>>     IMPLICIT NONE
> >> >>>     INTEGER, pointer :: ptest(:,:)
> >> >>>     INTEGER :: IERR, I,J
> >> >>>
> >> >>>
> >> >>>     write(ipt,*) "START CONTIGUOUS:"
> >> >>>     n=2000 ! Set size here...
> >> >>>     m=n+10
> >> >>>
> >> >>>     call alloc_vars
> >> >>>     write(ipt,*) "ALLOCATED DATA"
> >> >>>     ptest => data(1:N,1:N)
> >> >>>
> >> >>>     IF (MYID == 0) ptest=6
> >> >>>     write(ipt,*) "Made POINTER"
> >> >>>
> >> >>>     call MPI_BCAST(ptest,N*N,MPI_INTEGER,0,MPI_COMM_WORLD,IERR)
> >> >>>     IF(IERR /= 0) WRITE(IPT,*) "BAD BCAST", MYID
> >> >>>
> >> >>>     write(ipt,*) "BROADCAST Data; a value:",data(1,6)
> >> >>>
> >> >>>     DO I = 1,N
> >> >>>        DO J = 1,N
> >> >>>           if(data(I,J) /= 6) &
> >> >>>                & write(ipt,*) "INCORRECT VALUE!", I,J,data(I,J)
> >> >>>        END DO
> >> >>>
> >> >>>        DO J = N+1,M
> >> >>>           if(data(I,J) /= 0) &
> >> >>>                & write(ipt,*) "INCORRECT VALUE!", I,J,data(I,J)
> >> >>>        END DO
> >> >>>
> >> >>>     END DO
> >> >>>
> >> >>>     ! CALL THREE DIFFERENT EXAMPLES OF SUBROUTINES W/OUT AN
> >> ITERFACE
> >> >>>     ! THAT USE AN EXPLICIT SHAPE ARRAY
> >> >>>     write(ipt,*) "CALLING DUMMY1"
> >> >>>     CALL DUMMY1
> >> >>>
> >> >>>     write(ipt,*) "CALLING DUMMY2"
> >> >>>     call Dummy2(m,n)
> >> >>>
> >> >>>     write(ipt,*) "CALLING DUMMY3"
> >> >>>     call Dummy3
> >> >>>     write(ipt,*) "FINISHED!"
> >> >>>
> >> >>>   END SUBROUTINE CONTIGUOUS_WORKS
> >> >>>
> >> >>>   SUBROUTINE NON_CONTIGUOUS_FAILS
> >> >>>     IMPLICIT NONE
> >> >>>     INTEGER, pointer :: ptest(:,:)
> >> >>>     INTEGER :: IERR, I,J
> >> >>>
> >> >>>
> >> >>>     write(ipt,*) "START NON_CONTIGUOUS:"
> >> >>>
> >> >>>     m=200 ! Set size here - crash is size dependent!
> >> >>>     n=m+10
> >> >>>
> >> >>>     call alloc_vars
> >> >>>     write(ipt,*) "ALLOCATED DATA"
> >> >>>     ptest => data(1:M,1:M)
> >> >>>
> >> >>> !===================================================
> >> >>> ! IF YOU CALL DUMMY2 HERE TOO, THEN EVERYTHING PASSES  ???
> >> >>> !===================================================
> >> >>> !    CALL DUMMY1 ! THIS ONE HAS NO EFFECT
> >> >>> !    CALL DUMMY2 ! THIS ONE 'FIXES' THE BUG
> >> >>>
> >> >>>     IF (MYID == 0) ptest=6
> >> >>>     write(ipt,*) "Made POINTER"
> >> >>>
> >> >>>     call MPI_BCAST(ptest,M*M,MPI_INTEGER,0,MPI_COMM_WORLD,IERR)
> >> >>>     IF(IERR /= 0) WRITE(IPT,*) "BAD BCAST"
> >> >>>
> >> >>>     write(ipt,*) "BROADCAST Data; a value:",data(1,6)
> >> >>>
> >> >>>     DO I = 1,M
> >> >>>        DO J = 1,M
> >> >>>           if(data(J,I) /= 6) &
> >> >>>                & write(ipt,*) "INCORRECT VALUE!",I,J,DATA(I,J)
> >> >>>        END DO
> >> >>>
> >> >>>        DO J = M+1,N
> >> >>>           if(data(J,I) /= 0) &
> >> >>>                & write(ipt,*) "INCORRECT VALUE!",I,J,DATA(I,J)
> >> >>>        END DO
> >> >>>     END DO
> >> >>>
> >> >>>     ! CALL THREE DIFFERENT EXAMPLES OF SUBROUTINES W/OUT AN
> >> ITERFACE
> >> >>>     ! THAT USE AN EXPLICIT SHAPE ARRAY
> >> >>>     write(ipt,*) "CALLING DUMMY1"
> >> >>>     CALL DUMMY1
> >> >>>
> >> >>>     write(ipt,*) "CALLING DUMMY2"
> >> >>>     call Dummy2(m,n) ! SHOULD CRASH HERE!
> >> >>>
> >> >>>     write(ipt,*) "CALLING DUMMY3"
> >> >>>     call Dummy3
> >> >>>     write(ipt,*) "FINISHED!"
> >> >>>
> >> >>>   END SUBROUTINE NON_CONTIGUOUS_FAILS
> >> >>>
> >> >>>
> >> >>>   End Module vars
> >> >>>
> >> >>>
> >> >>> Program main
> >> >>>   USE vars
> >> >>>   implicit none
> >> >>>
> >> >>>
> >> >>>   CALL INIT_MPI_ENV(MYID,NPROCS)
> >> >>>
> >> >>>   ipt=myid+10
> >> >>>   OPEN(ipt)
> >> >>>
> >> >>>
> >> >>>   write(ipt,*) "Start memory test!"
> >> >>>
> >> >>>   CALL NON_CONTIGUOUS_FAILS
> >> >>>
> >> >>> !  CALL CONTIGUOUS_WORKS
> >> >>>
> >> >>>   write(ipt,*) "End memory test!"
> >> >>>
> >> >>>   CALL PSHUTDOWN
> >> >>>
> >> >>> END Program main
> >> >>>
> >> >>>
> >> >>>
> >> >>> ! TWO DUMMY SUBROUTINE WITH EXPLICIT SHAPE ARRAYS
> >> >>> ! DUMMY1 DECLARES A VECTOR  - THIS ONE NEVER CAUSES FAILURE
> >> >>> ! DUMMY2 DECLARES AN ARRAY  - THIS ONE CAUSES FAILURE
> >> >>>
> >> >>> SUBROUTINE DUMMY1
> >> >>>   USE vars
> >> >>>   implicit none
> >> >>>   real, dimension(m) :: my_data
> >> >>>
> >> >>>   write(ipt,*) "m,n",m,n
> >> >>>
> >> >>>   write(ipt,*) "DUMMY 1", size(my_data)
> >> >>>
> >> >>> END SUBROUTINE DUMMY1
> >> >>>
> >> >>>
> >> >>> SUBROUTINE DUMMY2(i,j)
> >> >>>   USE vars
> >> >>>   implicit none
> >> >>>   INTEGER, INTENT(IN) ::i,j
> >> >>>
> >> >>>
> >> >>>   real, dimension(i,j) :: my_data
> >> >>>
> >> >>>   write(ipt,*) "start: DUMMY 2", size(my_data)
> >> >>>
> >> >>>
> >> >>> END SUBROUTINE DUMMY2
> >> >>>
> >> >>> SUBROUTINE DUMMY3
> >> >>>   USE vars
> >> >>>   implicit none
> >> >>>
> >> >>>
> >> >>>   real, dimension(m,n) :: my_data
> >> >>>
> >> >>>
> >> >>>   write(ipt,*) "start: DUMMY 3", size(my_data)
> >> >>>
> >> >>>
> >> >>> END SUBROUTINE DUMMY3
> >> >>>
> >> >>>
> >> >>>
> >>  
> ---------------------------------------------------------------------- 
> --
> >> >>>
> >> >>> _______________________________________________
> >> >>> mvapich-discuss mailing list
> >> >>> mvapich-discuss at cse.ohio-state.edu
> >> >>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> >> >>>
> >> >>>
> >> >
> >> >
> >>
> >
> > _______________________________________________
> > mvapich-discuss mailing list
> > mvapich-discuss at cse.ohio-state.edu
> > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
>
> --
> Jeff Squyres
> Cisco Systems
>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20080131/bd22c795/attachment-0001.html


More information about the mvapich-discuss mailing list