[mvapich-discuss] possible allgather bug in mvapich 2.0 ?

Jonathan Perkins perkinjo at cse.ohio-state.edu
Tue Sep 2 13:45:36 EDT 2014


Thank you for your note.  We have forwarded this note to Akshay and he
is investigating this further.  We'll keep you informed with any
updates.

On Tue, Sep 02, 2014 at 03:47:05PM +0300, Evren Yurtesen IB wrote:
> Akshay, I now have to use all these 3 environment variables to be able to
> use mvapich2
> 
> MV2_USE_INDEXED_TUNING           0
> MV2_USE_INDEXED_ALLREDUCE_TUNING 0
> MV2_USE_INDEXED_ALLGATHER_TUNING 0
> 
> Here is all the info you requested below from my 19Aug2014 message. I am
> re-sending them since I didn't get a response from you about if you received
> them or not.
> 
> ==17453== Invalid read of size 4
> ==17453==    at 0x500DBAF: MPIR_Allgather_index_tuned_intra_MV2 (allgather_osu.c:1055)
> ==17453==    by 0x500E352: MPIR_Allgather_MV2 (allgather_osu.c:1214)
> ==17453==    by 0x4FCED78: MPIR_Allgather_impl (allgather.c:852)
> ==17453==    by 0x4FCF5AE: PMPI_Allgather (allgather.c:1003)
> ==17453==    by 0x75C776B: HYPRE_IJMatrixCreate (in /export/modules/devel/hypre/2.9.0b/mvapich2/gnu/lib/libHYPRE-2.9.0b.so)
> ==17453==    by 0x58BF674: solvehypre1_ (SolveHypre.c:469)
> ==17453==    by 0x583E194: __sparitersolve_MOD_sparitersolver (SParIterSolver.f90:1843)
> ==17453==    by 0x588143F: __parallelutils_MOD_paralleliter (ParallelUtils.f90:740)
> ==17453==    by 0x57868C7: __solverutils_MOD_solvelinearsystem (SolverUtils.f90:6212)
> ==17453==    by 0x578D47A: __solverutils_MOD_solvesystem (SolverUtils.f90:6557)
> ==17453==    by 0x589A768: __defutils_MOD_defaultsolve (DefUtils.f90:2392)
> ==17453==    by 0xCFCC070: dosolve.5516 (MagnetoDynamics.f90:802)
> ==17453==  Address 0xfffffefc04060eac is not stack'd, malloc'd or (recently) free'd
> 
> 
> On Tue, 19 Aug 2014, Evren Yurtesen IB wrote:
> 
> >Hello,
> >
> >I already tried to reproduce it myself and I could not reproduce it with a
> >custom built test case. Which should be doing exactly the same thing. I am
> >thinking there may be some other factors effecting this issue.
> >
> >1- The number of processes in the node is 12 (it is a 12-core system). If
> >I run 12 processes on 2 nodes, the problem does not seem to appear. It
> >appears when all 12 processes are on the same node.
> >
> >2- Nodes have dual X5650 with Mellanox Technologies MT26438
> >
> >3-
> >
> >-bash-4.1$ mpichversion
> >MVAPICH2 Version:     	2.0
> >MVAPICH2 Release date:	Fri Jun 20 20:00:00 EDT 2014
> >MVAPICH2 Device:      	ch3:mrail
> >MVAPICH2 configure:   	--prefix=/export/modules/apps/mvapich2/2.0/gnu
> >--enable-fast=nochkmsg,notiming,ndebug --enable-shared --enable-mpe
> >--with-pmi=slurm --with-pm=no
> >--with-valgrind=/export/modules/tools/valgrind/3.8.1/include/valgrind
> >MVAPICH2 CC:  	gcc -Ofast -march=native -mtune=native   -DNDEBUG -DNVALGRIND
> >MVAPICH2 CXX: 	g++ -Ofast -march=native -mtune=native  -DNDEBUG -DNVALGRIND
> >MVAPICH2 F77: 	gfortran -Ofast -march=native -mtune=native
> >MVAPICH2 FC:  	gfortran -Ofast -march=native -mtune=native
> >-bash-4.1$
> >
> >
> >I went ahead and compiled MVAPICH2 with --enable-g=all option. Then ran it
> >with valgrind again, below is the result. Does this help you to narrow it
> >down a little bit?
> >
> >Here is the configuration info with -g
> >
> >-bash-4.1$ mpichversion
> >MVAPICH2 Version:     	2.0
> >MVAPICH2 Release date:	Fri Jun 20 20:00:00 EDT 2014
> >MVAPICH2 Device:      	ch3:mrail
> >MVAPICH2 configure:   	--prefix=/export/modules/apps/mvapich2/2.0/gnu
> >--enable-fast=nochkmsg,notiming,ndebug --enable-shared --enable-mpe
> >--with-pmi=slurm --with-pm=no
> >--with-valgrind=/export/modules/tools/valgrind/3.8.1/include/valgrind
> >--enable-g=all
> >MVAPICH2 CC:  	gcc -Ofast -march=native -mtune=native -g   -DNDEBUG
> >-DNVALGRIND -g
> >MVAPICH2 CXX: 	g++ -Ofast -march=native -mtune=native -g  -DNDEBUG
> >-DNVALGRIND -g
> >MVAPICH2 F77: 	gfortran -Ofast -march=native -mtune=native -g  -g
> >MVAPICH2 FC:  	gfortran -Ofast -march=native -mtune=native -g  -g
> >-bash-4.1$
> >
> >below is what valgrind says
> >
> >==29664== Invalid read of size 4
> >==29664==    at 0x5015E73: MPIR_Allreduce_index_tuned_intra_MV2
> >(allreduce_osu.c:2358)
> >==29664==    by 0x4FB39C5: MPIR_Allreduce_impl (allreduce.c:788)
> >==29664==    by 0x4FB4162: PMPI_Allreduce (allreduce.c:929)
> >==29664==    by 0x4ED96D7: PMPI_ALLREDUCE (allreducef.c:272)
> >==29664==    by 0x57F3609: __sparitercomm_MOD_sparglobalnumbering
> >(SParIterComm.f90:2093)
> >==29664==    by 0x584AADF: __parallelutils_MOD_parallelglobalnumbering
> >(ParallelUtils.f90:723)
> >==29664==    by 0x57D7EB4: __meshutils_MOD_splitmeshequal
> >(MeshUtils.f90:10947)
> >==29664==    by 0x566FF7B: __modeldescription_MOD_loadmodel
> >(ModelDescription.f90:2205)
> >==29664==    by 0x58E6118: elmersolver_ (ElmerSolver.f90:406)
> >==29664==    by 0x400F2A: MAIN__ (in
> >/export/modules/apps/elmer/6825/mvapich2/gnu/bin/ElmerSolver_mpi)
> >==29664==    by 0x4010CE: main (in
> >/export/modules/apps/elmer/6825/mvapich2/gnu/bin/ElmerSolver_mpi)
> >==29664==  Address 0xfffffdb404056228 is not stack'd, malloc'd or
> >(recently) free'd
> >==29664==
> >
> >Thanks,
> >Evren
> >
> >
> >
> >
> >On Tue, 19 Aug 2014, Akshay Venkatesh wrote:
> >
> >>Hi Evren,
> >>
> >>We're trying to reproduce the bug that you reported. Some more details
> >>of the way the job was launched and how the library was configured would
> >>help us narrow down the
> >>cause. Could you provide job details such as
> >>1. The number of MPI processes that are run on a single node for the
> >>12-process job.
> >>2. The processor type and the network hca type on which the job was run
> >>3. The configuration flags used to build the library
> >>
> >>Thanks a lot
> >>
> >>--
> >>- Akshay
> >>
> >>
> >
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss

-- 
Jonathan Perkins


More information about the mvapich-discuss mailing list