[mvapich-discuss] possible allgather bug in mvapich 2.0 ?

Evren Yurtesen IB eyurtese at abo.fi
Fri Aug 15 04:39:44 EDT 2014


Hello Hari,

Thank you for the quick response. Please let me know if you need any more 
information which may aid in tracking down the problem.

Thanks,
Evren

On Thu, 14 Aug 2014, Hari Subramoni wrote:

> Hello Evren,
> Thank you for letting us know about the issue. We are glad to know that you were able to work around this issue by setting the appropriate runtime variable.
> 
> We will track down the issue and fix it for an upcoming release.
> 
> Best Regards,
> Hari.
> 
> 
> On Tue, Aug 12, 2014 at 9:04 AM, Evren Yurtesen IB <eyurtese at abo.fi> wrote:
>       Hi,
> 
>
>       I am using latest mvapich2 compiled with gcc 4.9 on a cluster with infiniband
> 
>
>       -bash-4.1$ mpichversion
>       MVAPICH2 Version:       2.0
>       MVAPICH2 Release date:  Fri Jun 20 20:00:00 EDT 2014
>       MVAPICH2 Device:        ch3:mrail
>       MVAPICH2 configure:     --prefix=/export/modules/apps/mvapich2/2.0/gnu --enable-fast=nochkmsg,notiming,ndebug --enable-shared --enable-mpe --with-pmi=slurm
>       --with-pm=no --with-valgrind=/export/modules/tools/valgrind/3.8.1/include/valgrind
>       MVAPICH2 CC:    gcc -Ofast -march=native -mtune=native   -DNDEBUG -DNVALGRIND
>       MVAPICH2 CXX:   g++ -Ofast -march=native -mtune=native  -DNDEBUG -DNVALGRIND
>       MVAPICH2 F77:   gfortran -Ofast -march=native -mtune=native
>       MVAPICH2 FC:    gfortran -Ofast -march=native -mtune=native
>       -bash-4.1$
> 
>
>       I am using the Elmer program from CSC (open source) http://www.csc.fi/english/pages/elmer/sources
>       and it is crashing at an allreduce operation. I ran it with valgrind and the result is below.
> 
>
>       ==5561== Invalid read of size 4
>       ==5561==    at 0x4FC9243: MPIR_Allreduce_index_tuned_intra_MV2 (in /export/modules/apps/mvapich2/2.0/gnu/lib/libmpich.so.12.0.0)
>       ==5561==    by 0x4F71C85: MPIR_Allreduce_impl (in /export/modules/apps/mvapich2/2.0/gnu/lib/libmpich.so.12.0.0)
>       ==5561==    by 0x4F722B6: PMPI_Allreduce (in /export/modules/apps/mvapich2/2.0/gnu/lib/libmpich.so.12.0.0)
>       ==5561==    by 0x4ECD1F7: MPI_ALLREDUCE (in /export/modules/apps/mvapich2/2.0/gnu/lib/libmpich.so.12.0.0)
>       ==5561==    by 0x56CE69C: __sparitercomm_MOD_sparglobalnumbering (SParIterComm.f90:2093)
>       ==5561==    by 0x571AC9F: __parallelutils_MOD_parallelglobalnumbering (ParallelUtils.f90:723)
>       ==5561==    by 0x56B40E7: __meshutils_MOD_splitmeshequal (MeshUtils.f90:9751)
>       ==5561==    by 0x558E19E: __modeldescription_MOD_loadmodel (ModelDescription.f90:2205)
>       ==5561==    by 0x57A8FF9: elmersolver_ (ElmerSolver.f90:406)
>       ==5561==    by 0x400F2A: MAIN__ (Solver.f90:244)
>       ==5561==    by 0x4010CE: main (Solver.f90:220)
>       ==5561==  Address 0xfffffdb4040493d0 is not stack'd, malloc'd or (recently) free'd
> 
>
>       I compiled Elmer with openmp, mpich 3.1.2, mvapich2 1.8 and it seem to work perfectly fine. It also works for some number of processes. For example 4 and 24
>       processes seem to work but 12 does not work. It is strange because the place where it crashes should not be effected from this at all (see below). But with
>       mvapich2 2.0 it consistently crashes when executed with 12 processes.
> 
>
>       Below is copy/paste of relevant parts from SParIterComm.f90, line 2093 is the first ALLREDUCE
>
>       ...
>       ...
>            INTEGER, POINTER :: oldnodes(:), oldnodes2(:), newnodes(:), &
>                 newnodes2(:), parentnodes(:,:), tosend(:), toreceive(:), &
>                 list1(:), list2(:), commonlist(:)
>       ...
>       ...
>       !    Allocate space for local tables:
>       !    --------------------------------
>            ALLOCATE( newnodes( ParEnv % PEs ), &
>                 newnodes2( ParEnv % PEs ), &
>                 oldnodes( ParEnv % PEs ), &
>                 oldnodes2( ParEnv % PEs ), &
>                 tosend( ParEnv % PEs ), &
>                 toreceive( ParEnv % PEs ), &
>                 parentnodes( Mesh % Nodes % NumberOfNodes,2 ) )
>       ...
>       ...
>            !
>            ! Count the current situation in what comes to nodes owned by us:
>            !----------------------------------------------------------------
>            oldnodes = 0
>            newnodes = 0
>            j = ParEnv % MyPE
>            DO i = 1, Mesh % Nodes % NumberOfNodes
>               k = Mesh % ParallelInfo % NeighbourList(i) % Neighbours(1)
>               IF( k /= j ) CYCLE
>               IF( Mesh % ParallelInfo % GlobalDOFs(i)  > 0 ) THEN
>                  oldnodes(j+1) = oldnodes(j+1)+1
>               ELSE
>                  newnodes(j+1) = newnodes(j+1)+1
>               END IF
>            END DO
>            !
>            ! Distribute the knowledge about ownerships (here, assumig all active):
>
>       !----------------------------------------------------------------------
>            oldnodes2 = 0
>            CALL MPI_ALLREDUCE( oldnodes, oldnodes2, ParEnv % PEs, &   ! <- fix
>                 MPI_INTEGER, MPI_SUM, MPI_COMM_WORLD, ierr )
>
>            newnodes2 = 0
>            CALL MPI_ALLREDUCE( newnodes, newnodes2, ParEnv % PEs, &   ! <- fix
>                 MPI_INTEGER, MPI_SUM, MPI_COMM_WORLD, ierr )
>
>       ...
>       ...
> 
>
>       As a workaround, I have set the following environment variables to 0 and now things seem to be working. MV2_USE_INDEXED_TUNING /
>       MV2_USE_INDEXED_ALLREDUCE_TUNING
>
>       I am not sure what may be the culprit in the MPIR_Allreduce_index_tuned_intra_MV2 function.
>       I tried to make a simple test case and failed (the test case seem to function fine.).
>
>       Can you give any hints/tips about how I can help more to track down the reason of the problem? What other information do you need?
>
>       Thanks,
>       Evren
>       _______________________________________________
>       mvapich-discuss mailing list
>       mvapich-discuss at cse.ohio-state.edu
>       http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> 
> 
> 
>


More information about the mvapich-discuss mailing list