[mvapich-discuss] SIGSEV in F90: An MPI bug?

Jeff Squyres jsquyres at cisco.com
Thu Jan 31 13:06:46 EST 2008


Brian is completely correct - if the F90 compiler chooses to make  
temporary buffers in order to pass array subsections to non-blocking  
MPI functions, there's little that an MPI implementation can do.   
Simply put: MPI requires that when you use non-blocking  
communications, the buffer must be available until you call some  
flavor of MPI_TEST or MPI_WAIT to complete the communication.

I don't know of any way for an MPI implementation to know whether it  
has been handed a temporary buffer (e.g., one that a compiler silently  
created to pass an array subsection).  Do you know if there is a way?



On Jan 31, 2008, at 12:36 PM, Brian Curtis wrote:

> David,
>
> The MPI-2 documentation goes into great detail on issues with  
> Fortran-90 bindings (http://www.mpi-forum.org/docs/mpi-20-html/node236.htm#Node236 
> ).  The conditions you are seeing should be directed to Intel.
>
>
> Brian
>
>
> On Jan 31, 2008, at 11:59 AM, David Stuebe wrote:
>
>>
>> Hi again Brian
>>
>> I just ran my test code on our cluster using ifort 10.1.011 and  
>> MVAPICH 1.0.1, but the behavior is still the same.
>>
>> Have you had a chance to try it on any of your test machines?
>>
>> David
>>
>>
>>
>>
>> On Jan 25, 2008 12:31 PM, Brian Curtis <curtisbr at cse.ohio- 
>> state.edu> wrote:
>> David,
>>
>> I did some research on this issue and it looks like you have posted  
>> the
>> bug with Intel.  Please let us know what you find out.
>>
>>
>> Brian
>>
>> David Stuebe wrote:
>> > Hi Brian
>> >
>> > I downloaded the public release, it seems silly but I am not sure  
>> how to get
>> > a rev number from the source... there does not seem to be a '- 
>> version'
>> > option that gives more info, although I did not look to hard.
>> >
>> > I have not tried MVAPICH 1.0.1, but once I have intel ifort 10 on  
>> the
>> > cluster I will try 1.0.1 and see if it goes away.
>> >
>> > In the mean time please let me know if you can recreate the  
>> problem?
>> >
>> > David
>> >
>> > PS - Just want to make sure you understand my issue, I think it  
>> is a bad
>> > idea to try and pass a non-contiguous F90 memory pointer, I  
>> should not do
>> > that... but the way that it breaks has caused me headaches for  
>> weeks now. If
>> > it reliably caused a sigsev on entering MPI_BCAST that would be  
>> great! As it
>> > is it is really hard to trace the problem.
>> >
>> >
>> >
>> >
>> > On Jan 23, 2008 3:23 PM, Brian Curtis <curtisbr at cse.ohio- 
>> state.edu> wrote:
>> >
>> >
>> >> David,
>> >>
>> >> Sorry to hear you are experience problems with the MVAPICH2  
>> Fortran 90
>> >> interface.  I will be investigating this issue, but need some  
>> additional
>> >> information about your setup.  What is the exact version of  
>> MVAPICH2 1.0
>> >> you are utilizing (daily tarball or release)?  Have you tried  
>> MVAPICH2
>> >> 1.0.1?
>> >>
>> >> Brian
>> >>
>> >> David Stuebe wrote:
>> >>
>> >>> Hello MVAPICH
>> >>> I have found a strange bug in MVAPICH2 using IFORT. The  
>> behavior is very
>> >>> strange indeed - it seems to be related to how ifort deals with  
>> passing
>> >>> pointers to the MVAPICH FORTRAN 90 INTERFACE.
>> >>> The MPI call returns successfully, but later calls to a dummy  
>> subroutine
>> >>> cause a sigsev.
>> >>>
>> >>>  Please look at the following code:
>> >>>
>> >>>
>> >>>
>> >> ! 
>> = 
>> = 
>> = 
>> = 
>> = 
>> = 
>> = 
>> = 
>> = 
>> = 
>> = 
>> = 
>> =====================================================================
>> >>
>> >> ! 
>> = 
>> = 
>> = 
>> = 
>> = 
>> = 
>> = 
>> = 
>> = 
>> = 
>> = 
>> = 
>> =====================================================================
>> >>
>> >> ! 
>> = 
>> = 
>> = 
>> = 
>> = 
>> = 
>> = 
>> = 
>> = 
>> = 
>> = 
>> = 
>> =====================================================================
>> >>
>> >>> ! TEST CODE TO FOR POSSIBLE BUG IN MVAPICH2 COMPILED ON IFORT
>> >>> ! WRITEN BY: DAVID STUEBE
>> >>> ! DATE: JAN 23, 2008
>> >>> !
>> >>> ! COMPILE WITH: mpif90 -xP mpi_prog.f90 -o xtest
>> >>> !
>> >>> ! KNOWN BEHAVIOR:
>> >>> ! PASSING A NONE CONTIGUOUS POINTER TO MPI_BCAST CAUSES FAILURE  
>> OF
>> >>> ! SUBROUTINES USING MULTI DIMENSIONAL EXPLICT SHAPE ARRAYS  
>> WITHOUT AN
>> >>> INTERFACE -
>> >>> ! EVEN THOUGH THE MPI_BCAST COMPLETES SUCCESUFULLY, RETURNING  
>> VALID
>> >>>
>> >> DATA.
>> >>
>> >>> !
>> >>> ! COMMENTS:
>> >>> ! I REALIZE PASSING NON CONTIGUOUS POINTERS IS DANGEROUS -  
>> SHAME ON
>> >>> ! ME FOR MAKING THAT MISTAKE. HOWEVER, IT SHOULD EITHER WORK OR  
>> NOT.
>> >>> ! RETURNING SUCCESSFULLY BUT CAUSING INTERFACE ERRORS LATER IS
>> >>> ! EXTREMELY DIFFICULT TO DEBUG!
>> >>> !
>> >>> ! CONDITIONS FOR OCCURANCE:
>> >>> !    COMPILER MUST OPTIMIZE USING 'VECTORIZATION'
>> >>> !    ARRAY MUST BE 'LARGE' -SYSTEM DEPENDENT ?
>> >>> !    MUST BE RUN ON MORE THAN ONE NODE TO CAUSE CRASH...
>> >>> !    ie  Running inside one SMP box does not crash.
>> >>> !
>> >>> !    RUNNING UNDER MPD, ALL PROCESSES SIGSEV
>> >>> !    RUNNING UNDER MPIEXEC0.82 FOR PBS,
>> >>> !       ONLY SOME PROCESSES SIGSEV ?
>> >>> !
>> >>> ! ENVIRONMENTAL INFO:
>> >>> ! NODES: DELL 1850 3.0GHZ, 2GB RAM, INFINIBAND PCI-EX 4X
>> >>> ! SYSTEM: ROCKS 4.2
>> >>> ! gcc version 3.4.6 20060404 (Red Hat 3.4.6-3)
>> >>> !
>> >>> ! IFORT/ICC:
>> >>> !   Intel(R) Fortran Compiler for Intel(R) EM64T-based  
>> applications,
>> >>> !   Version 9.1 Build 20061101 Package ID: l_fc_c_9.1.040
>> >>> !
>> >>> ! MVAPICH2: mpif90 for mvapich2-1.0
>> >>> ! ./configure --prefix=/usr/local/share/mvapich2/1.0
>> >>> --with-device=osu_ch3:mrail --with-rdma=vapi --with-pm=mpd -- 
>> enable-f90
>> >>> --enable-cxx --disable-romio --without-mpe
>> >>> !
>> >>>
>> >>>
>> >> ! 
>> = 
>> = 
>> = 
>> = 
>> = 
>> = 
>> = 
>> = 
>> = 
>> = 
>> = 
>> = 
>> =====================================================================
>> >>
>> >> ! 
>> = 
>> = 
>> = 
>> = 
>> = 
>> = 
>> = 
>> = 
>> = 
>> = 
>> = 
>> = 
>> =====================================================================
>> >>
>> >> ! 
>> = 
>> = 
>> = 
>> = 
>> = 
>> = 
>> = 
>> = 
>> = 
>> = 
>> = 
>> = 
>> =====================================================================
>> >>
>> >>> Module vars
>> >>>   USE MPI
>> >>>   implicit none
>> >>>
>> >>>
>> >>>   integer :: n,m,MYID,NPROCS
>> >>>   integer :: ipt
>> >>>
>> >>>   integer, allocatable, target :: data(:,:)
>> >>>
>> >>>   contains
>> >>>
>> >>>     subroutine alloc_vars
>> >>>       implicit none
>> >>>
>> >>>       integer Status
>> >>>
>> >>>       allocate(data(n,m),stat=status)
>> >>>       if (status /=0) then
>> >>>          write(ipt,*) "allocation error"
>> >>>          stop
>> >>>       end if
>> >>>
>> >>>       data = 0
>> >>>
>> >>>     end subroutine alloc_vars
>> >>>
>> >>>    SUBROUTINE INIT_MPI_ENV(ID,NP)
>> >>>
>> >>>
>> >> ! 
>> = 
>> = 
>> = 
>> = 
>> = 
>> = 
>> = 
>> = 
>> = 
>> = 
>> = 
>> = 
>> = 
>> = 
>> = 
>> ====================================================================|
>> >>
>> >>> !  INITIALIZE MPI
>> >>>  
>> ENVIRONMENT                                                       |
>> >>>
>> >>>
>> >> ! 
>> = 
>> = 
>> = 
>> = 
>> = 
>> = 
>> = 
>> = 
>> = 
>> = 
>> = 
>> = 
>> = 
>> = 
>> = 
>> ====================================================================|
>> >>
>> >>>      INTEGER, INTENT(OUT) :: ID,NP
>> >>>      INTEGER IERR
>> >>>
>> >>>      IERR=0
>> >>>
>> >>>      CALL MPI_INIT(IERR)
>> >>>      IF(IERR/=0) WRITE(*,*) "BAD MPI_INIT", ID
>> >>>      CALL MPI_COMM_RANK(MPI_COMM_WORLD,ID,IERR)
>> >>>      IF(IERR/=0) WRITE(*,*) "BAD MPI_COMM_RANK", ID
>> >>>      CALL MPI_COMM_SIZE(MPI_COMM_WORLD,NP,IERR)
>> >>>      IF(IERR/=0) WRITE(*,*) "BAD MPI_COMM_SIZE", ID
>> >>>
>> >>>    END SUBROUTINE INIT_MPI_ENV
>> >>>
>> >>>
>> >>>
>> >>>
>> >> ! 
>> = 
>> = 
>> = 
>> = 
>> = 
>> = 
>> = 
>> = 
>> = 
>> = 
>> ====================================================================|
>> >>
>> >>>   SUBROUTINE PSHUTDOWN
>> >>>
>> >>>
>> >>>
>> >> ! 
>> = 
>> = 
>> = 
>> = 
>> = 
>> = 
>> = 
>> = 
>> = 
>> = 
>> ====================================================================|
>> >>
>> >>>     INTEGER IERR
>> >>>
>> >>>     IERR=0
>> >>>     CALL MPI_FINALIZE(IERR)
>> >>>     if(ierr /=0) write(ipt,*) "BAD MPI_FINALIZE", MYID
>> >>>     close(IPT)
>> >>>     STOP
>> >>>
>> >>>   END SUBROUTINE PSHUTDOWN
>> >>>
>> >>>
>> >>>   SUBROUTINE CONTIGUOUS_WORKS
>> >>>     IMPLICIT NONE
>> >>>     INTEGER, pointer :: ptest(:,:)
>> >>>     INTEGER :: IERR, I,J
>> >>>
>> >>>
>> >>>     write(ipt,*) "START CONTIGUOUS:"
>> >>>     n=2000 ! Set size here...
>> >>>     m=n+10
>> >>>
>> >>>     call alloc_vars
>> >>>     write(ipt,*) "ALLOCATED DATA"
>> >>>     ptest => data(1:N,1:N)
>> >>>
>> >>>     IF (MYID == 0) ptest=6
>> >>>     write(ipt,*) "Made POINTER"
>> >>>
>> >>>     call MPI_BCAST(ptest,N*N,MPI_INTEGER,0,MPI_COMM_WORLD,IERR)
>> >>>     IF(IERR /= 0) WRITE(IPT,*) "BAD BCAST", MYID
>> >>>
>> >>>     write(ipt,*) "BROADCAST Data; a value:",data(1,6)
>> >>>
>> >>>     DO I = 1,N
>> >>>        DO J = 1,N
>> >>>           if(data(I,J) /= 6) &
>> >>>                & write(ipt,*) "INCORRECT VALUE!", I,J,data(I,J)
>> >>>        END DO
>> >>>
>> >>>        DO J = N+1,M
>> >>>           if(data(I,J) /= 0) &
>> >>>                & write(ipt,*) "INCORRECT VALUE!", I,J,data(I,J)
>> >>>        END DO
>> >>>
>> >>>     END DO
>> >>>
>> >>>     ! CALL THREE DIFFERENT EXAMPLES OF SUBROUTINES W/OUT AN  
>> ITERFACE
>> >>>     ! THAT USE AN EXPLICIT SHAPE ARRAY
>> >>>     write(ipt,*) "CALLING DUMMY1"
>> >>>     CALL DUMMY1
>> >>>
>> >>>     write(ipt,*) "CALLING DUMMY2"
>> >>>     call Dummy2(m,n)
>> >>>
>> >>>     write(ipt,*) "CALLING DUMMY3"
>> >>>     call Dummy3
>> >>>     write(ipt,*) "FINISHED!"
>> >>>
>> >>>   END SUBROUTINE CONTIGUOUS_WORKS
>> >>>
>> >>>   SUBROUTINE NON_CONTIGUOUS_FAILS
>> >>>     IMPLICIT NONE
>> >>>     INTEGER, pointer :: ptest(:,:)
>> >>>     INTEGER :: IERR, I,J
>> >>>
>> >>>
>> >>>     write(ipt,*) "START NON_CONTIGUOUS:"
>> >>>
>> >>>     m=200 ! Set size here - crash is size dependent!
>> >>>     n=m+10
>> >>>
>> >>>     call alloc_vars
>> >>>     write(ipt,*) "ALLOCATED DATA"
>> >>>     ptest => data(1:M,1:M)
>> >>>
>> >>> !===================================================
>> >>> ! IF YOU CALL DUMMY2 HERE TOO, THEN EVERYTHING PASSES  ???
>> >>> !===================================================
>> >>> !    CALL DUMMY1 ! THIS ONE HAS NO EFFECT
>> >>> !    CALL DUMMY2 ! THIS ONE 'FIXES' THE BUG
>> >>>
>> >>>     IF (MYID == 0) ptest=6
>> >>>     write(ipt,*) "Made POINTER"
>> >>>
>> >>>     call MPI_BCAST(ptest,M*M,MPI_INTEGER,0,MPI_COMM_WORLD,IERR)
>> >>>     IF(IERR /= 0) WRITE(IPT,*) "BAD BCAST"
>> >>>
>> >>>     write(ipt,*) "BROADCAST Data; a value:",data(1,6)
>> >>>
>> >>>     DO I = 1,M
>> >>>        DO J = 1,M
>> >>>           if(data(J,I) /= 6) &
>> >>>                & write(ipt,*) "INCORRECT VALUE!",I,J,DATA(I,J)
>> >>>        END DO
>> >>>
>> >>>        DO J = M+1,N
>> >>>           if(data(J,I) /= 0) &
>> >>>                & write(ipt,*) "INCORRECT VALUE!",I,J,DATA(I,J)
>> >>>        END DO
>> >>>     END DO
>> >>>
>> >>>     ! CALL THREE DIFFERENT EXAMPLES OF SUBROUTINES W/OUT AN  
>> ITERFACE
>> >>>     ! THAT USE AN EXPLICIT SHAPE ARRAY
>> >>>     write(ipt,*) "CALLING DUMMY1"
>> >>>     CALL DUMMY1
>> >>>
>> >>>     write(ipt,*) "CALLING DUMMY2"
>> >>>     call Dummy2(m,n) ! SHOULD CRASH HERE!
>> >>>
>> >>>     write(ipt,*) "CALLING DUMMY3"
>> >>>     call Dummy3
>> >>>     write(ipt,*) "FINISHED!"
>> >>>
>> >>>   END SUBROUTINE NON_CONTIGUOUS_FAILS
>> >>>
>> >>>
>> >>>   End Module vars
>> >>>
>> >>>
>> >>> Program main
>> >>>   USE vars
>> >>>   implicit none
>> >>>
>> >>>
>> >>>   CALL INIT_MPI_ENV(MYID,NPROCS)
>> >>>
>> >>>   ipt=myid+10
>> >>>   OPEN(ipt)
>> >>>
>> >>>
>> >>>   write(ipt,*) "Start memory test!"
>> >>>
>> >>>   CALL NON_CONTIGUOUS_FAILS
>> >>>
>> >>> !  CALL CONTIGUOUS_WORKS
>> >>>
>> >>>   write(ipt,*) "End memory test!"
>> >>>
>> >>>   CALL PSHUTDOWN
>> >>>
>> >>> END Program main
>> >>>
>> >>>
>> >>>
>> >>> ! TWO DUMMY SUBROUTINE WITH EXPLICIT SHAPE ARRAYS
>> >>> ! DUMMY1 DECLARES A VECTOR  - THIS ONE NEVER CAUSES FAILURE
>> >>> ! DUMMY2 DECLARES AN ARRAY  - THIS ONE CAUSES FAILURE
>> >>>
>> >>> SUBROUTINE DUMMY1
>> >>>   USE vars
>> >>>   implicit none
>> >>>   real, dimension(m) :: my_data
>> >>>
>> >>>   write(ipt,*) "m,n",m,n
>> >>>
>> >>>   write(ipt,*) "DUMMY 1", size(my_data)
>> >>>
>> >>> END SUBROUTINE DUMMY1
>> >>>
>> >>>
>> >>> SUBROUTINE DUMMY2(i,j)
>> >>>   USE vars
>> >>>   implicit none
>> >>>   INTEGER, INTENT(IN) ::i,j
>> >>>
>> >>>
>> >>>   real, dimension(i,j) :: my_data
>> >>>
>> >>>   write(ipt,*) "start: DUMMY 2", size(my_data)
>> >>>
>> >>>
>> >>> END SUBROUTINE DUMMY2
>> >>>
>> >>> SUBROUTINE DUMMY3
>> >>>   USE vars
>> >>>   implicit none
>> >>>
>> >>>
>> >>>   real, dimension(m,n) :: my_data
>> >>>
>> >>>
>> >>>   write(ipt,*) "start: DUMMY 3", size(my_data)
>> >>>
>> >>>
>> >>> END SUBROUTINE DUMMY3
>> >>>
>> >>>
>> >>>  
>> ------------------------------------------------------------------------
>> >>>
>> >>> _______________________________________________
>> >>> mvapich-discuss mailing list
>> >>> mvapich-discuss at cse.ohio-state.edu
>> >>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>> >>>
>> >>>
>> >
>> >
>>
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss


-- 
Jeff Squyres
Cisco Systems



More information about the mvapich-discuss mailing list