[mvapich-discuss] possible allgather bug in mvapich 2.0 ?
Evren Yurtesen IB
eyurtese at abo.fi
Tue Sep 2 08:47:05 EDT 2014
Akshay, I now have to use all these 3 environment variables to be able to
use mvapich2
MV2_USE_INDEXED_TUNING 0
MV2_USE_INDEXED_ALLREDUCE_TUNING 0
MV2_USE_INDEXED_ALLGATHER_TUNING 0
Here is all the info you requested below from my 19Aug2014 message. I am
re-sending them since I didn't get a response from you about if you
received them or not.
==17453== Invalid read of size 4
==17453== at 0x500DBAF: MPIR_Allgather_index_tuned_intra_MV2 (allgather_osu.c:1055)
==17453== by 0x500E352: MPIR_Allgather_MV2 (allgather_osu.c:1214)
==17453== by 0x4FCED78: MPIR_Allgather_impl (allgather.c:852)
==17453== by 0x4FCF5AE: PMPI_Allgather (allgather.c:1003)
==17453== by 0x75C776B: HYPRE_IJMatrixCreate (in /export/modules/devel/hypre/2.9.0b/mvapich2/gnu/lib/libHYPRE-2.9.0b.so)
==17453== by 0x58BF674: solvehypre1_ (SolveHypre.c:469)
==17453== by 0x583E194: __sparitersolve_MOD_sparitersolver (SParIterSolver.f90:1843)
==17453== by 0x588143F: __parallelutils_MOD_paralleliter (ParallelUtils.f90:740)
==17453== by 0x57868C7: __solverutils_MOD_solvelinearsystem (SolverUtils.f90:6212)
==17453== by 0x578D47A: __solverutils_MOD_solvesystem (SolverUtils.f90:6557)
==17453== by 0x589A768: __defutils_MOD_defaultsolve (DefUtils.f90:2392)
==17453== by 0xCFCC070: dosolve.5516 (MagnetoDynamics.f90:802)
==17453== Address 0xfffffefc04060eac is not stack'd, malloc'd or (recently) free'd
On Tue, 19 Aug 2014, Evren Yurtesen IB wrote:
> Hello,
>
> I already tried to reproduce it myself and I could not reproduce it with a
> custom built test case. Which should be doing exactly the same thing. I am
> thinking there may be some other factors effecting this issue.
>
> 1- The number of processes in the node is 12 (it is a 12-core system). If I
> run 12 processes on 2 nodes, the problem does not seem to appear. It appears
> when all 12 processes are on the same node.
>
> 2- Nodes have dual X5650 with Mellanox Technologies MT26438
>
> 3-
>
> -bash-4.1$ mpichversion
> MVAPICH2 Version: 2.0
> MVAPICH2 Release date: Fri Jun 20 20:00:00 EDT 2014
> MVAPICH2 Device: ch3:mrail
> MVAPICH2 configure: --prefix=/export/modules/apps/mvapich2/2.0/gnu
> --enable-fast=nochkmsg,notiming,ndebug --enable-shared --enable-mpe
> --with-pmi=slurm --with-pm=no
> --with-valgrind=/export/modules/tools/valgrind/3.8.1/include/valgrind
> MVAPICH2 CC: gcc -Ofast -march=native -mtune=native -DNDEBUG -DNVALGRIND
> MVAPICH2 CXX: g++ -Ofast -march=native -mtune=native -DNDEBUG -DNVALGRIND
> MVAPICH2 F77: gfortran -Ofast -march=native -mtune=native
> MVAPICH2 FC: gfortran -Ofast -march=native -mtune=native
> -bash-4.1$
>
>
> I went ahead and compiled MVAPICH2 with --enable-g=all option. Then ran it
> with valgrind again, below is the result. Does this help you to narrow it
> down a little bit?
>
> Here is the configuration info with -g
>
> -bash-4.1$ mpichversion
> MVAPICH2 Version: 2.0
> MVAPICH2 Release date: Fri Jun 20 20:00:00 EDT 2014
> MVAPICH2 Device: ch3:mrail
> MVAPICH2 configure: --prefix=/export/modules/apps/mvapich2/2.0/gnu
> --enable-fast=nochkmsg,notiming,ndebug --enable-shared --enable-mpe
> --with-pmi=slurm --with-pm=no
> --with-valgrind=/export/modules/tools/valgrind/3.8.1/include/valgrind
> --enable-g=all
> MVAPICH2 CC: gcc -Ofast -march=native -mtune=native -g -DNDEBUG
> -DNVALGRIND -g
> MVAPICH2 CXX: g++ -Ofast -march=native -mtune=native -g -DNDEBUG
> -DNVALGRIND -g
> MVAPICH2 F77: gfortran -Ofast -march=native -mtune=native -g -g
> MVAPICH2 FC: gfortran -Ofast -march=native -mtune=native -g -g
> -bash-4.1$
>
> below is what valgrind says
>
> ==29664== Invalid read of size 4
> ==29664== at 0x5015E73: MPIR_Allreduce_index_tuned_intra_MV2
> (allreduce_osu.c:2358)
> ==29664== by 0x4FB39C5: MPIR_Allreduce_impl (allreduce.c:788)
> ==29664== by 0x4FB4162: PMPI_Allreduce (allreduce.c:929)
> ==29664== by 0x4ED96D7: PMPI_ALLREDUCE (allreducef.c:272)
> ==29664== by 0x57F3609: __sparitercomm_MOD_sparglobalnumbering
> (SParIterComm.f90:2093)
> ==29664== by 0x584AADF: __parallelutils_MOD_parallelglobalnumbering
> (ParallelUtils.f90:723)
> ==29664== by 0x57D7EB4: __meshutils_MOD_splitmeshequal
> (MeshUtils.f90:10947)
> ==29664== by 0x566FF7B: __modeldescription_MOD_loadmodel
> (ModelDescription.f90:2205)
> ==29664== by 0x58E6118: elmersolver_ (ElmerSolver.f90:406)
> ==29664== by 0x400F2A: MAIN__ (in
> /export/modules/apps/elmer/6825/mvapich2/gnu/bin/ElmerSolver_mpi)
> ==29664== by 0x4010CE: main (in
> /export/modules/apps/elmer/6825/mvapich2/gnu/bin/ElmerSolver_mpi)
> ==29664== Address 0xfffffdb404056228 is not stack'd, malloc'd or (recently)
> free'd
> ==29664==
>
> Thanks,
> Evren
>
>
>
>
> On Tue, 19 Aug 2014, Akshay Venkatesh wrote:
>
>> Hi Evren,
>>
>> We're trying to reproduce the bug that you reported. Some more details of
>> the way the job was launched and how the library was configured would help
>> us narrow down the
>> cause. Could you provide job details such as
>> 1. The number of MPI processes that are run on a single node for the
>> 12-process job.
>> 2. The processor type and the network hca type on which the job was run
>> 3. The configuration flags used to build the library
>>
>> Thanks a lot
>>
>> --
>> - Akshay
>>
>>
>
More information about the mvapich-discuss
mailing list