[mvapich-discuss] possible allgather bug in mvapich 2.0 ?

Evren Yurtesen IB eyurtese at abo.fi
Tue Aug 12 09:04:22 EDT 2014


Hi,


I am using latest mvapich2 compiled with gcc 4.9 on a cluster with infiniband


-bash-4.1$ mpichversion
MVAPICH2 Version:     	2.0
MVAPICH2 Release date:	Fri Jun 20 20:00:00 EDT 2014
MVAPICH2 Device:      	ch3:mrail
MVAPICH2 configure:   	--prefix=/export/modules/apps/mvapich2/2.0/gnu --enable-fast=nochkmsg,notiming,ndebug --enable-shared --enable-mpe --with-pmi=slurm --with-pm=no --with-valgrind=/export/modules/tools/valgrind/3.8.1/include/valgrind
MVAPICH2 CC:  	gcc -Ofast -march=native -mtune=native   -DNDEBUG -DNVALGRIND
MVAPICH2 CXX: 	g++ -Ofast -march=native -mtune=native  -DNDEBUG -DNVALGRIND
MVAPICH2 F77: 	gfortran -Ofast -march=native -mtune=native
MVAPICH2 FC:  	gfortran -Ofast -march=native -mtune=native
-bash-4.1$


I am using the Elmer program from CSC (open source) http://www.csc.fi/english/pages/elmer/sources
and it is crashing at an allreduce operation. I ran it with valgrind and the result is below.


==5561== Invalid read of size 4
==5561==    at 0x4FC9243: MPIR_Allreduce_index_tuned_intra_MV2 (in /export/modules/apps/mvapich2/2.0/gnu/lib/libmpich.so.12.0.0)
==5561==    by 0x4F71C85: MPIR_Allreduce_impl (in /export/modules/apps/mvapich2/2.0/gnu/lib/libmpich.so.12.0.0)
==5561==    by 0x4F722B6: PMPI_Allreduce (in /export/modules/apps/mvapich2/2.0/gnu/lib/libmpich.so.12.0.0)
==5561==    by 0x4ECD1F7: MPI_ALLREDUCE (in /export/modules/apps/mvapich2/2.0/gnu/lib/libmpich.so.12.0.0)
==5561==    by 0x56CE69C: __sparitercomm_MOD_sparglobalnumbering (SParIterComm.f90:2093)
==5561==    by 0x571AC9F: __parallelutils_MOD_parallelglobalnumbering (ParallelUtils.f90:723)
==5561==    by 0x56B40E7: __meshutils_MOD_splitmeshequal (MeshUtils.f90:9751)
==5561==    by 0x558E19E: __modeldescription_MOD_loadmodel (ModelDescription.f90:2205)
==5561==    by 0x57A8FF9: elmersolver_ (ElmerSolver.f90:406)
==5561==    by 0x400F2A: MAIN__ (Solver.f90:244)
==5561==    by 0x4010CE: main (Solver.f90:220)
==5561==  Address 0xfffffdb4040493d0 is not stack'd, malloc'd or (recently) free'd


I compiled Elmer with openmp, mpich 3.1.2, mvapich2 1.8 and it seem to 
work perfectly fine. It also works for some number of processes. For 
example 4 and 24 processes seem to work but 12 does not work. It is 
strange because the place where it crashes should not be effected from 
this at all (see below). But with mvapich2 2.0 it consistently crashes 
when executed with 12 processes.


Below is copy/paste of relevant parts from SParIterComm.f90, line 2093 is 
the first ALLREDUCE

...
...
      INTEGER, POINTER :: oldnodes(:), oldnodes2(:), newnodes(:), &
           newnodes2(:), parentnodes(:,:), tosend(:), toreceive(:), &
           list1(:), list2(:), commonlist(:)
...
...
!    Allocate space for local tables:
!    --------------------------------
      ALLOCATE( newnodes( ParEnv % PEs ), &
           newnodes2( ParEnv % PEs ), &
           oldnodes( ParEnv % PEs ), &
           oldnodes2( ParEnv % PEs ), &
           tosend( ParEnv % PEs ), &
           toreceive( ParEnv % PEs ), &
           parentnodes( Mesh % Nodes % NumberOfNodes,2 ) )
...
...
      !
      ! Count the current situation in what comes to nodes owned by us:
      !----------------------------------------------------------------
      oldnodes = 0
      newnodes = 0
      j = ParEnv % MyPE
      DO i = 1, Mesh % Nodes % NumberOfNodes
         k = Mesh % ParallelInfo % NeighbourList(i) % Neighbours(1)
        	IF( k /= j ) CYCLE
         IF( Mesh % ParallelInfo % GlobalDOFs(i)  > 0 ) THEN
            oldnodes(j+1) = oldnodes(j+1)+1
        	ELSE
            newnodes(j+1) = newnodes(j+1)+1
         END IF
      END DO
      !
      ! Distribute the knowledge about ownerships (here, assumig all active):

!----------------------------------------------------------------------
      oldnodes2 = 0
      CALL MPI_ALLREDUCE( oldnodes, oldnodes2, ParEnv % PEs, &   ! <- fix
           MPI_INTEGER, MPI_SUM, MPI_COMM_WORLD, ierr )

      newnodes2 = 0
      CALL MPI_ALLREDUCE( newnodes, newnodes2, ParEnv % PEs, &   ! <- fix
           MPI_INTEGER, MPI_SUM, MPI_COMM_WORLD, ierr )

...
...


As a workaround, I have set the following environment variables to 0 and 
now things seem to be working. MV2_USE_INDEXED_TUNING / MV2_USE_INDEXED_ALLREDUCE_TUNING

I am not sure what may be the culprit in the MPIR_Allreduce_index_tuned_intra_MV2 function.
I tried to make a simple test case and failed (the test case seem to 
function fine.).

Can you give any hints/tips about how I can help more to track down the 
reason of the problem? What other information do you need?

Thanks,
Evren


More information about the mvapich-discuss mailing list