[mvapich-discuss] SIGSEV in F90: An MPI bug?
Jeff Squyres
jsquyres at cisco.com
Thu Jan 31 13:06:46 EST 2008
Brian is completely correct - if the F90 compiler chooses to make
temporary buffers in order to pass array subsections to non-blocking
MPI functions, there's little that an MPI implementation can do.
Simply put: MPI requires that when you use non-blocking
communications, the buffer must be available until you call some
flavor of MPI_TEST or MPI_WAIT to complete the communication.
I don't know of any way for an MPI implementation to know whether it
has been handed a temporary buffer (e.g., one that a compiler silently
created to pass an array subsection). Do you know if there is a way?
On Jan 31, 2008, at 12:36 PM, Brian Curtis wrote:
> David,
>
> The MPI-2 documentation goes into great detail on issues with
> Fortran-90 bindings (http://www.mpi-forum.org/docs/mpi-20-html/node236.htm#Node236
> ). The conditions you are seeing should be directed to Intel.
>
>
> Brian
>
>
> On Jan 31, 2008, at 11:59 AM, David Stuebe wrote:
>
>>
>> Hi again Brian
>>
>> I just ran my test code on our cluster using ifort 10.1.011 and
>> MVAPICH 1.0.1, but the behavior is still the same.
>>
>> Have you had a chance to try it on any of your test machines?
>>
>> David
>>
>>
>>
>>
>> On Jan 25, 2008 12:31 PM, Brian Curtis <curtisbr at cse.ohio-
>> state.edu> wrote:
>> David,
>>
>> I did some research on this issue and it looks like you have posted
>> the
>> bug with Intel. Please let us know what you find out.
>>
>>
>> Brian
>>
>> David Stuebe wrote:
>> > Hi Brian
>> >
>> > I downloaded the public release, it seems silly but I am not sure
>> how to get
>> > a rev number from the source... there does not seem to be a '-
>> version'
>> > option that gives more info, although I did not look to hard.
>> >
>> > I have not tried MVAPICH 1.0.1, but once I have intel ifort 10 on
>> the
>> > cluster I will try 1.0.1 and see if it goes away.
>> >
>> > In the mean time please let me know if you can recreate the
>> problem?
>> >
>> > David
>> >
>> > PS - Just want to make sure you understand my issue, I think it
>> is a bad
>> > idea to try and pass a non-contiguous F90 memory pointer, I
>> should not do
>> > that... but the way that it breaks has caused me headaches for
>> weeks now. If
>> > it reliably caused a sigsev on entering MPI_BCAST that would be
>> great! As it
>> > is it is really hard to trace the problem.
>> >
>> >
>> >
>> >
>> > On Jan 23, 2008 3:23 PM, Brian Curtis <curtisbr at cse.ohio-
>> state.edu> wrote:
>> >
>> >
>> >> David,
>> >>
>> >> Sorry to hear you are experience problems with the MVAPICH2
>> Fortran 90
>> >> interface. I will be investigating this issue, but need some
>> additional
>> >> information about your setup. What is the exact version of
>> MVAPICH2 1.0
>> >> you are utilizing (daily tarball or release)? Have you tried
>> MVAPICH2
>> >> 1.0.1?
>> >>
>> >> Brian
>> >>
>> >> David Stuebe wrote:
>> >>
>> >>> Hello MVAPICH
>> >>> I have found a strange bug in MVAPICH2 using IFORT. The
>> behavior is very
>> >>> strange indeed - it seems to be related to how ifort deals with
>> passing
>> >>> pointers to the MVAPICH FORTRAN 90 INTERFACE.
>> >>> The MPI call returns successfully, but later calls to a dummy
>> subroutine
>> >>> cause a sigsev.
>> >>>
>> >>> Please look at the following code:
>> >>>
>> >>>
>> >>>
>> >> !
>> =
>> =
>> =
>> =
>> =
>> =
>> =
>> =
>> =
>> =
>> =
>> =
>> =====================================================================
>> >>
>> >> !
>> =
>> =
>> =
>> =
>> =
>> =
>> =
>> =
>> =
>> =
>> =
>> =
>> =====================================================================
>> >>
>> >> !
>> =
>> =
>> =
>> =
>> =
>> =
>> =
>> =
>> =
>> =
>> =
>> =
>> =====================================================================
>> >>
>> >>> ! TEST CODE TO FOR POSSIBLE BUG IN MVAPICH2 COMPILED ON IFORT
>> >>> ! WRITEN BY: DAVID STUEBE
>> >>> ! DATE: JAN 23, 2008
>> >>> !
>> >>> ! COMPILE WITH: mpif90 -xP mpi_prog.f90 -o xtest
>> >>> !
>> >>> ! KNOWN BEHAVIOR:
>> >>> ! PASSING A NONE CONTIGUOUS POINTER TO MPI_BCAST CAUSES FAILURE
>> OF
>> >>> ! SUBROUTINES USING MULTI DIMENSIONAL EXPLICT SHAPE ARRAYS
>> WITHOUT AN
>> >>> INTERFACE -
>> >>> ! EVEN THOUGH THE MPI_BCAST COMPLETES SUCCESUFULLY, RETURNING
>> VALID
>> >>>
>> >> DATA.
>> >>
>> >>> !
>> >>> ! COMMENTS:
>> >>> ! I REALIZE PASSING NON CONTIGUOUS POINTERS IS DANGEROUS -
>> SHAME ON
>> >>> ! ME FOR MAKING THAT MISTAKE. HOWEVER, IT SHOULD EITHER WORK OR
>> NOT.
>> >>> ! RETURNING SUCCESSFULLY BUT CAUSING INTERFACE ERRORS LATER IS
>> >>> ! EXTREMELY DIFFICULT TO DEBUG!
>> >>> !
>> >>> ! CONDITIONS FOR OCCURANCE:
>> >>> ! COMPILER MUST OPTIMIZE USING 'VECTORIZATION'
>> >>> ! ARRAY MUST BE 'LARGE' -SYSTEM DEPENDENT ?
>> >>> ! MUST BE RUN ON MORE THAN ONE NODE TO CAUSE CRASH...
>> >>> ! ie Running inside one SMP box does not crash.
>> >>> !
>> >>> ! RUNNING UNDER MPD, ALL PROCESSES SIGSEV
>> >>> ! RUNNING UNDER MPIEXEC0.82 FOR PBS,
>> >>> ! ONLY SOME PROCESSES SIGSEV ?
>> >>> !
>> >>> ! ENVIRONMENTAL INFO:
>> >>> ! NODES: DELL 1850 3.0GHZ, 2GB RAM, INFINIBAND PCI-EX 4X
>> >>> ! SYSTEM: ROCKS 4.2
>> >>> ! gcc version 3.4.6 20060404 (Red Hat 3.4.6-3)
>> >>> !
>> >>> ! IFORT/ICC:
>> >>> ! Intel(R) Fortran Compiler for Intel(R) EM64T-based
>> applications,
>> >>> ! Version 9.1 Build 20061101 Package ID: l_fc_c_9.1.040
>> >>> !
>> >>> ! MVAPICH2: mpif90 for mvapich2-1.0
>> >>> ! ./configure --prefix=/usr/local/share/mvapich2/1.0
>> >>> --with-device=osu_ch3:mrail --with-rdma=vapi --with-pm=mpd --
>> enable-f90
>> >>> --enable-cxx --disable-romio --without-mpe
>> >>> !
>> >>>
>> >>>
>> >> !
>> =
>> =
>> =
>> =
>> =
>> =
>> =
>> =
>> =
>> =
>> =
>> =
>> =====================================================================
>> >>
>> >> !
>> =
>> =
>> =
>> =
>> =
>> =
>> =
>> =
>> =
>> =
>> =
>> =
>> =====================================================================
>> >>
>> >> !
>> =
>> =
>> =
>> =
>> =
>> =
>> =
>> =
>> =
>> =
>> =
>> =
>> =====================================================================
>> >>
>> >>> Module vars
>> >>> USE MPI
>> >>> implicit none
>> >>>
>> >>>
>> >>> integer :: n,m,MYID,NPROCS
>> >>> integer :: ipt
>> >>>
>> >>> integer, allocatable, target :: data(:,:)
>> >>>
>> >>> contains
>> >>>
>> >>> subroutine alloc_vars
>> >>> implicit none
>> >>>
>> >>> integer Status
>> >>>
>> >>> allocate(data(n,m),stat=status)
>> >>> if (status /=0) then
>> >>> write(ipt,*) "allocation error"
>> >>> stop
>> >>> end if
>> >>>
>> >>> data = 0
>> >>>
>> >>> end subroutine alloc_vars
>> >>>
>> >>> SUBROUTINE INIT_MPI_ENV(ID,NP)
>> >>>
>> >>>
>> >> !
>> =
>> =
>> =
>> =
>> =
>> =
>> =
>> =
>> =
>> =
>> =
>> =
>> =
>> =
>> =
>> ====================================================================|
>> >>
>> >>> ! INITIALIZE MPI
>> >>>
>> ENVIRONMENT |
>> >>>
>> >>>
>> >> !
>> =
>> =
>> =
>> =
>> =
>> =
>> =
>> =
>> =
>> =
>> =
>> =
>> =
>> =
>> =
>> ====================================================================|
>> >>
>> >>> INTEGER, INTENT(OUT) :: ID,NP
>> >>> INTEGER IERR
>> >>>
>> >>> IERR=0
>> >>>
>> >>> CALL MPI_INIT(IERR)
>> >>> IF(IERR/=0) WRITE(*,*) "BAD MPI_INIT", ID
>> >>> CALL MPI_COMM_RANK(MPI_COMM_WORLD,ID,IERR)
>> >>> IF(IERR/=0) WRITE(*,*) "BAD MPI_COMM_RANK", ID
>> >>> CALL MPI_COMM_SIZE(MPI_COMM_WORLD,NP,IERR)
>> >>> IF(IERR/=0) WRITE(*,*) "BAD MPI_COMM_SIZE", ID
>> >>>
>> >>> END SUBROUTINE INIT_MPI_ENV
>> >>>
>> >>>
>> >>>
>> >>>
>> >> !
>> =
>> =
>> =
>> =
>> =
>> =
>> =
>> =
>> =
>> =
>> ====================================================================|
>> >>
>> >>> SUBROUTINE PSHUTDOWN
>> >>>
>> >>>
>> >>>
>> >> !
>> =
>> =
>> =
>> =
>> =
>> =
>> =
>> =
>> =
>> =
>> ====================================================================|
>> >>
>> >>> INTEGER IERR
>> >>>
>> >>> IERR=0
>> >>> CALL MPI_FINALIZE(IERR)
>> >>> if(ierr /=0) write(ipt,*) "BAD MPI_FINALIZE", MYID
>> >>> close(IPT)
>> >>> STOP
>> >>>
>> >>> END SUBROUTINE PSHUTDOWN
>> >>>
>> >>>
>> >>> SUBROUTINE CONTIGUOUS_WORKS
>> >>> IMPLICIT NONE
>> >>> INTEGER, pointer :: ptest(:,:)
>> >>> INTEGER :: IERR, I,J
>> >>>
>> >>>
>> >>> write(ipt,*) "START CONTIGUOUS:"
>> >>> n=2000 ! Set size here...
>> >>> m=n+10
>> >>>
>> >>> call alloc_vars
>> >>> write(ipt,*) "ALLOCATED DATA"
>> >>> ptest => data(1:N,1:N)
>> >>>
>> >>> IF (MYID == 0) ptest=6
>> >>> write(ipt,*) "Made POINTER"
>> >>>
>> >>> call MPI_BCAST(ptest,N*N,MPI_INTEGER,0,MPI_COMM_WORLD,IERR)
>> >>> IF(IERR /= 0) WRITE(IPT,*) "BAD BCAST", MYID
>> >>>
>> >>> write(ipt,*) "BROADCAST Data; a value:",data(1,6)
>> >>>
>> >>> DO I = 1,N
>> >>> DO J = 1,N
>> >>> if(data(I,J) /= 6) &
>> >>> & write(ipt,*) "INCORRECT VALUE!", I,J,data(I,J)
>> >>> END DO
>> >>>
>> >>> DO J = N+1,M
>> >>> if(data(I,J) /= 0) &
>> >>> & write(ipt,*) "INCORRECT VALUE!", I,J,data(I,J)
>> >>> END DO
>> >>>
>> >>> END DO
>> >>>
>> >>> ! CALL THREE DIFFERENT EXAMPLES OF SUBROUTINES W/OUT AN
>> ITERFACE
>> >>> ! THAT USE AN EXPLICIT SHAPE ARRAY
>> >>> write(ipt,*) "CALLING DUMMY1"
>> >>> CALL DUMMY1
>> >>>
>> >>> write(ipt,*) "CALLING DUMMY2"
>> >>> call Dummy2(m,n)
>> >>>
>> >>> write(ipt,*) "CALLING DUMMY3"
>> >>> call Dummy3
>> >>> write(ipt,*) "FINISHED!"
>> >>>
>> >>> END SUBROUTINE CONTIGUOUS_WORKS
>> >>>
>> >>> SUBROUTINE NON_CONTIGUOUS_FAILS
>> >>> IMPLICIT NONE
>> >>> INTEGER, pointer :: ptest(:,:)
>> >>> INTEGER :: IERR, I,J
>> >>>
>> >>>
>> >>> write(ipt,*) "START NON_CONTIGUOUS:"
>> >>>
>> >>> m=200 ! Set size here - crash is size dependent!
>> >>> n=m+10
>> >>>
>> >>> call alloc_vars
>> >>> write(ipt,*) "ALLOCATED DATA"
>> >>> ptest => data(1:M,1:M)
>> >>>
>> >>> !===================================================
>> >>> ! IF YOU CALL DUMMY2 HERE TOO, THEN EVERYTHING PASSES ???
>> >>> !===================================================
>> >>> ! CALL DUMMY1 ! THIS ONE HAS NO EFFECT
>> >>> ! CALL DUMMY2 ! THIS ONE 'FIXES' THE BUG
>> >>>
>> >>> IF (MYID == 0) ptest=6
>> >>> write(ipt,*) "Made POINTER"
>> >>>
>> >>> call MPI_BCAST(ptest,M*M,MPI_INTEGER,0,MPI_COMM_WORLD,IERR)
>> >>> IF(IERR /= 0) WRITE(IPT,*) "BAD BCAST"
>> >>>
>> >>> write(ipt,*) "BROADCAST Data; a value:",data(1,6)
>> >>>
>> >>> DO I = 1,M
>> >>> DO J = 1,M
>> >>> if(data(J,I) /= 6) &
>> >>> & write(ipt,*) "INCORRECT VALUE!",I,J,DATA(I,J)
>> >>> END DO
>> >>>
>> >>> DO J = M+1,N
>> >>> if(data(J,I) /= 0) &
>> >>> & write(ipt,*) "INCORRECT VALUE!",I,J,DATA(I,J)
>> >>> END DO
>> >>> END DO
>> >>>
>> >>> ! CALL THREE DIFFERENT EXAMPLES OF SUBROUTINES W/OUT AN
>> ITERFACE
>> >>> ! THAT USE AN EXPLICIT SHAPE ARRAY
>> >>> write(ipt,*) "CALLING DUMMY1"
>> >>> CALL DUMMY1
>> >>>
>> >>> write(ipt,*) "CALLING DUMMY2"
>> >>> call Dummy2(m,n) ! SHOULD CRASH HERE!
>> >>>
>> >>> write(ipt,*) "CALLING DUMMY3"
>> >>> call Dummy3
>> >>> write(ipt,*) "FINISHED!"
>> >>>
>> >>> END SUBROUTINE NON_CONTIGUOUS_FAILS
>> >>>
>> >>>
>> >>> End Module vars
>> >>>
>> >>>
>> >>> Program main
>> >>> USE vars
>> >>> implicit none
>> >>>
>> >>>
>> >>> CALL INIT_MPI_ENV(MYID,NPROCS)
>> >>>
>> >>> ipt=myid+10
>> >>> OPEN(ipt)
>> >>>
>> >>>
>> >>> write(ipt,*) "Start memory test!"
>> >>>
>> >>> CALL NON_CONTIGUOUS_FAILS
>> >>>
>> >>> ! CALL CONTIGUOUS_WORKS
>> >>>
>> >>> write(ipt,*) "End memory test!"
>> >>>
>> >>> CALL PSHUTDOWN
>> >>>
>> >>> END Program main
>> >>>
>> >>>
>> >>>
>> >>> ! TWO DUMMY SUBROUTINE WITH EXPLICIT SHAPE ARRAYS
>> >>> ! DUMMY1 DECLARES A VECTOR - THIS ONE NEVER CAUSES FAILURE
>> >>> ! DUMMY2 DECLARES AN ARRAY - THIS ONE CAUSES FAILURE
>> >>>
>> >>> SUBROUTINE DUMMY1
>> >>> USE vars
>> >>> implicit none
>> >>> real, dimension(m) :: my_data
>> >>>
>> >>> write(ipt,*) "m,n",m,n
>> >>>
>> >>> write(ipt,*) "DUMMY 1", size(my_data)
>> >>>
>> >>> END SUBROUTINE DUMMY1
>> >>>
>> >>>
>> >>> SUBROUTINE DUMMY2(i,j)
>> >>> USE vars
>> >>> implicit none
>> >>> INTEGER, INTENT(IN) ::i,j
>> >>>
>> >>>
>> >>> real, dimension(i,j) :: my_data
>> >>>
>> >>> write(ipt,*) "start: DUMMY 2", size(my_data)
>> >>>
>> >>>
>> >>> END SUBROUTINE DUMMY2
>> >>>
>> >>> SUBROUTINE DUMMY3
>> >>> USE vars
>> >>> implicit none
>> >>>
>> >>>
>> >>> real, dimension(m,n) :: my_data
>> >>>
>> >>>
>> >>> write(ipt,*) "start: DUMMY 3", size(my_data)
>> >>>
>> >>>
>> >>> END SUBROUTINE DUMMY3
>> >>>
>> >>>
>> >>>
>> ------------------------------------------------------------------------
>> >>>
>> >>> _______________________________________________
>> >>> mvapich-discuss mailing list
>> >>> mvapich-discuss at cse.ohio-state.edu
>> >>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>> >>>
>> >>>
>> >
>> >
>>
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
--
Jeff Squyres
Cisco Systems
More information about the mvapich-discuss
mailing list