[mvapich-discuss] possible allgather bug in mvapich 2.0 ?
Evren Yurtesen IB
eyurtese at abo.fi
Tue Aug 12 09:04:22 EDT 2014
Hi,
I am using latest mvapich2 compiled with gcc 4.9 on a cluster with infiniband
-bash-4.1$ mpichversion
MVAPICH2 Version: 2.0
MVAPICH2 Release date: Fri Jun 20 20:00:00 EDT 2014
MVAPICH2 Device: ch3:mrail
MVAPICH2 configure: --prefix=/export/modules/apps/mvapich2/2.0/gnu --enable-fast=nochkmsg,notiming,ndebug --enable-shared --enable-mpe --with-pmi=slurm --with-pm=no --with-valgrind=/export/modules/tools/valgrind/3.8.1/include/valgrind
MVAPICH2 CC: gcc -Ofast -march=native -mtune=native -DNDEBUG -DNVALGRIND
MVAPICH2 CXX: g++ -Ofast -march=native -mtune=native -DNDEBUG -DNVALGRIND
MVAPICH2 F77: gfortran -Ofast -march=native -mtune=native
MVAPICH2 FC: gfortran -Ofast -march=native -mtune=native
-bash-4.1$
I am using the Elmer program from CSC (open source) http://www.csc.fi/english/pages/elmer/sources
and it is crashing at an allreduce operation. I ran it with valgrind and the result is below.
==5561== Invalid read of size 4
==5561== at 0x4FC9243: MPIR_Allreduce_index_tuned_intra_MV2 (in /export/modules/apps/mvapich2/2.0/gnu/lib/libmpich.so.12.0.0)
==5561== by 0x4F71C85: MPIR_Allreduce_impl (in /export/modules/apps/mvapich2/2.0/gnu/lib/libmpich.so.12.0.0)
==5561== by 0x4F722B6: PMPI_Allreduce (in /export/modules/apps/mvapich2/2.0/gnu/lib/libmpich.so.12.0.0)
==5561== by 0x4ECD1F7: MPI_ALLREDUCE (in /export/modules/apps/mvapich2/2.0/gnu/lib/libmpich.so.12.0.0)
==5561== by 0x56CE69C: __sparitercomm_MOD_sparglobalnumbering (SParIterComm.f90:2093)
==5561== by 0x571AC9F: __parallelutils_MOD_parallelglobalnumbering (ParallelUtils.f90:723)
==5561== by 0x56B40E7: __meshutils_MOD_splitmeshequal (MeshUtils.f90:9751)
==5561== by 0x558E19E: __modeldescription_MOD_loadmodel (ModelDescription.f90:2205)
==5561== by 0x57A8FF9: elmersolver_ (ElmerSolver.f90:406)
==5561== by 0x400F2A: MAIN__ (Solver.f90:244)
==5561== by 0x4010CE: main (Solver.f90:220)
==5561== Address 0xfffffdb4040493d0 is not stack'd, malloc'd or (recently) free'd
I compiled Elmer with openmp, mpich 3.1.2, mvapich2 1.8 and it seem to
work perfectly fine. It also works for some number of processes. For
example 4 and 24 processes seem to work but 12 does not work. It is
strange because the place where it crashes should not be effected from
this at all (see below). But with mvapich2 2.0 it consistently crashes
when executed with 12 processes.
Below is copy/paste of relevant parts from SParIterComm.f90, line 2093 is
the first ALLREDUCE
...
...
INTEGER, POINTER :: oldnodes(:), oldnodes2(:), newnodes(:), &
newnodes2(:), parentnodes(:,:), tosend(:), toreceive(:), &
list1(:), list2(:), commonlist(:)
...
...
! Allocate space for local tables:
! --------------------------------
ALLOCATE( newnodes( ParEnv % PEs ), &
newnodes2( ParEnv % PEs ), &
oldnodes( ParEnv % PEs ), &
oldnodes2( ParEnv % PEs ), &
tosend( ParEnv % PEs ), &
toreceive( ParEnv % PEs ), &
parentnodes( Mesh % Nodes % NumberOfNodes,2 ) )
...
...
!
! Count the current situation in what comes to nodes owned by us:
!----------------------------------------------------------------
oldnodes = 0
newnodes = 0
j = ParEnv % MyPE
DO i = 1, Mesh % Nodes % NumberOfNodes
k = Mesh % ParallelInfo % NeighbourList(i) % Neighbours(1)
IF( k /= j ) CYCLE
IF( Mesh % ParallelInfo % GlobalDOFs(i) > 0 ) THEN
oldnodes(j+1) = oldnodes(j+1)+1
ELSE
newnodes(j+1) = newnodes(j+1)+1
END IF
END DO
!
! Distribute the knowledge about ownerships (here, assumig all active):
!----------------------------------------------------------------------
oldnodes2 = 0
CALL MPI_ALLREDUCE( oldnodes, oldnodes2, ParEnv % PEs, & ! <- fix
MPI_INTEGER, MPI_SUM, MPI_COMM_WORLD, ierr )
newnodes2 = 0
CALL MPI_ALLREDUCE( newnodes, newnodes2, ParEnv % PEs, & ! <- fix
MPI_INTEGER, MPI_SUM, MPI_COMM_WORLD, ierr )
...
...
As a workaround, I have set the following environment variables to 0 and
now things seem to be working. MV2_USE_INDEXED_TUNING / MV2_USE_INDEXED_ALLREDUCE_TUNING
I am not sure what may be the culprit in the MPIR_Allreduce_index_tuned_intra_MV2 function.
I tried to make a simple test case and failed (the test case seem to
function fine.).
Can you give any hints/tips about how I can help more to track down the
reason of the problem? What other information do you need?
Thanks,
Evren
More information about the mvapich-discuss
mailing list