[mvapich-discuss] possible allgather bug in mvapich 2.0 ?

Evren Yurtesen IB eyurtese at abo.fi
Tue Sep 2 08:47:05 EDT 2014


Akshay, I now have to use all these 3 environment variables to be able to 
use mvapich2

MV2_USE_INDEXED_TUNING           0
MV2_USE_INDEXED_ALLREDUCE_TUNING 0
MV2_USE_INDEXED_ALLGATHER_TUNING 0

Here is all the info you requested below from my 19Aug2014 message. I am 
re-sending them since I didn't get a response from you about if you 
received them or not.

==17453== Invalid read of size 4
==17453==    at 0x500DBAF: MPIR_Allgather_index_tuned_intra_MV2 (allgather_osu.c:1055)
==17453==    by 0x500E352: MPIR_Allgather_MV2 (allgather_osu.c:1214)
==17453==    by 0x4FCED78: MPIR_Allgather_impl (allgather.c:852)
==17453==    by 0x4FCF5AE: PMPI_Allgather (allgather.c:1003)
==17453==    by 0x75C776B: HYPRE_IJMatrixCreate (in /export/modules/devel/hypre/2.9.0b/mvapich2/gnu/lib/libHYPRE-2.9.0b.so)
==17453==    by 0x58BF674: solvehypre1_ (SolveHypre.c:469)
==17453==    by 0x583E194: __sparitersolve_MOD_sparitersolver (SParIterSolver.f90:1843)
==17453==    by 0x588143F: __parallelutils_MOD_paralleliter (ParallelUtils.f90:740)
==17453==    by 0x57868C7: __solverutils_MOD_solvelinearsystem (SolverUtils.f90:6212)
==17453==    by 0x578D47A: __solverutils_MOD_solvesystem (SolverUtils.f90:6557)
==17453==    by 0x589A768: __defutils_MOD_defaultsolve (DefUtils.f90:2392)
==17453==    by 0xCFCC070: dosolve.5516 (MagnetoDynamics.f90:802)
==17453==  Address 0xfffffefc04060eac is not stack'd, malloc'd or (recently) free'd


On Tue, 19 Aug 2014, Evren Yurtesen IB wrote:

> Hello,
>
> I already tried to reproduce it myself and I could not reproduce it with a 
> custom built test case. Which should be doing exactly the same thing. I am 
> thinking there may be some other factors effecting this issue.
>
> 1- The number of processes in the node is 12 (it is a 12-core system). If I 
> run 12 processes on 2 nodes, the problem does not seem to appear. It appears 
> when all 12 processes are on the same node.
>
> 2- Nodes have dual X5650 with Mellanox Technologies MT26438
>
> 3-
>
> -bash-4.1$ mpichversion
> MVAPICH2 Version:     	2.0
> MVAPICH2 Release date:	Fri Jun 20 20:00:00 EDT 2014
> MVAPICH2 Device:      	ch3:mrail
> MVAPICH2 configure:   	--prefix=/export/modules/apps/mvapich2/2.0/gnu 
> --enable-fast=nochkmsg,notiming,ndebug --enable-shared --enable-mpe 
> --with-pmi=slurm --with-pm=no 
> --with-valgrind=/export/modules/tools/valgrind/3.8.1/include/valgrind
> MVAPICH2 CC:  	gcc -Ofast -march=native -mtune=native   -DNDEBUG -DNVALGRIND
> MVAPICH2 CXX: 	g++ -Ofast -march=native -mtune=native  -DNDEBUG -DNVALGRIND
> MVAPICH2 F77: 	gfortran -Ofast -march=native -mtune=native
> MVAPICH2 FC:  	gfortran -Ofast -march=native -mtune=native
> -bash-4.1$
>
>
> I went ahead and compiled MVAPICH2 with --enable-g=all option. Then ran it 
> with valgrind again, below is the result. Does this help you to narrow it 
> down a little bit?
>
> Here is the configuration info with -g
>
> -bash-4.1$ mpichversion
> MVAPICH2 Version:     	2.0
> MVAPICH2 Release date:	Fri Jun 20 20:00:00 EDT 2014
> MVAPICH2 Device:      	ch3:mrail
> MVAPICH2 configure:   	--prefix=/export/modules/apps/mvapich2/2.0/gnu 
> --enable-fast=nochkmsg,notiming,ndebug --enable-shared --enable-mpe 
> --with-pmi=slurm --with-pm=no 
> --with-valgrind=/export/modules/tools/valgrind/3.8.1/include/valgrind 
> --enable-g=all
> MVAPICH2 CC:  	gcc -Ofast -march=native -mtune=native -g   -DNDEBUG 
> -DNVALGRIND -g
> MVAPICH2 CXX: 	g++ -Ofast -march=native -mtune=native -g  -DNDEBUG 
> -DNVALGRIND -g
> MVAPICH2 F77: 	gfortran -Ofast -march=native -mtune=native -g  -g
> MVAPICH2 FC:  	gfortran -Ofast -march=native -mtune=native -g  -g
> -bash-4.1$
>
> below is what valgrind says
>
> ==29664== Invalid read of size 4
> ==29664==    at 0x5015E73: MPIR_Allreduce_index_tuned_intra_MV2 
> (allreduce_osu.c:2358)
> ==29664==    by 0x4FB39C5: MPIR_Allreduce_impl (allreduce.c:788)
> ==29664==    by 0x4FB4162: PMPI_Allreduce (allreduce.c:929)
> ==29664==    by 0x4ED96D7: PMPI_ALLREDUCE (allreducef.c:272)
> ==29664==    by 0x57F3609: __sparitercomm_MOD_sparglobalnumbering 
> (SParIterComm.f90:2093)
> ==29664==    by 0x584AADF: __parallelutils_MOD_parallelglobalnumbering 
> (ParallelUtils.f90:723)
> ==29664==    by 0x57D7EB4: __meshutils_MOD_splitmeshequal 
> (MeshUtils.f90:10947)
> ==29664==    by 0x566FF7B: __modeldescription_MOD_loadmodel 
> (ModelDescription.f90:2205)
> ==29664==    by 0x58E6118: elmersolver_ (ElmerSolver.f90:406)
> ==29664==    by 0x400F2A: MAIN__ (in 
> /export/modules/apps/elmer/6825/mvapich2/gnu/bin/ElmerSolver_mpi)
> ==29664==    by 0x4010CE: main (in 
> /export/modules/apps/elmer/6825/mvapich2/gnu/bin/ElmerSolver_mpi)
> ==29664==  Address 0xfffffdb404056228 is not stack'd, malloc'd or (recently) 
> free'd
> ==29664==
>
> Thanks,
> Evren
>
>
>
>
> On Tue, 19 Aug 2014, Akshay Venkatesh wrote:
>
>> Hi Evren,
>> 
>> We're trying to reproduce the bug that you reported. Some more details of 
>> the way the job was launched and how the library was configured would help 
>> us narrow down the
>> cause. Could you provide job details such as
>> 1. The number of MPI processes that are run on a single node for the 
>> 12-process job.
>> 2. The processor type and the network hca type on which the job was run
>> 3. The configuration flags used to build the library
>> 
>> Thanks a lot
>> 
>> --
>> - Akshay
>> 
>> 
>


More information about the mvapich-discuss mailing list