[mvapich-discuss] possible allgather bug in mvapich 2.0 ?

Akshay Venkatesh akshay at cse.ohio-state.edu
Wed Sep 3 11:50:02 EDT 2014


Evren,

Apologies for the delay again. I've attached a patch here that should fix
the problem you're seeing. The usual patch command should work if you patch
it against mvapich-2.0:

patch -p0 < index_tuning_fix.patch

Let us know if the fix avoids the crash you've been seeing. Thanks


On Tue, Sep 2, 2014 at 8:47 AM, Evren Yurtesen IB <eyurtese at abo.fi> wrote:

> Akshay, I now have to use all these 3 environment variables to be able to
> use mvapich2
>
> MV2_USE_INDEXED_TUNING           0
> MV2_USE_INDEXED_ALLREDUCE_TUNING 0
> MV2_USE_INDEXED_ALLGATHER_TUNING 0
>
> Here is all the info you requested below from my 19Aug2014 message. I am
> re-sending them since I didn't get a response from you about if you
> received them or not.
>
> ==17453== Invalid read of size 4
> ==17453==    at 0x500DBAF: MPIR_Allgather_index_tuned_intra_MV2
> (allgather_osu.c:1055)
> ==17453==    by 0x500E352: MPIR_Allgather_MV2 (allgather_osu.c:1214)
> ==17453==    by 0x4FCED78: MPIR_Allgather_impl (allgather.c:852)
> ==17453==    by 0x4FCF5AE: PMPI_Allgather (allgather.c:1003)
> ==17453==    by 0x75C776B: HYPRE_IJMatrixCreate (in
> /export/modules/devel/hypre/2.9.0b/mvapich2/gnu/lib/libHYPRE-2.9.0b.so)
> ==17453==    by 0x58BF674: solvehypre1_ (SolveHypre.c:469)
> ==17453==    by 0x583E194: __sparitersolve_MOD_sparitersolver
> (SParIterSolver.f90:1843)
> ==17453==    by 0x588143F: __parallelutils_MOD_paralleliter
> (ParallelUtils.f90:740)
> ==17453==    by 0x57868C7: __solverutils_MOD_solvelinearsystem
> (SolverUtils.f90:6212)
> ==17453==    by 0x578D47A: __solverutils_MOD_solvesystem
> (SolverUtils.f90:6557)
> ==17453==    by 0x589A768: __defutils_MOD_defaultsolve (DefUtils.f90:2392)
> ==17453==    by 0xCFCC070: dosolve.5516 (MagnetoDynamics.f90:802)
> ==17453==  Address 0xfffffefc04060eac is not stack'd, malloc'd or
> (recently) free'd
>
>
>
> On Tue, 19 Aug 2014, Evren Yurtesen IB wrote:
>
>  Hello,
>>
>> I already tried to reproduce it myself and I could not reproduce it with
>> a custom built test case. Which should be doing exactly the same thing. I
>> am thinking there may be some other factors effecting this issue.
>>
>> 1- The number of processes in the node is 12 (it is a 12-core system). If
>> I run 12 processes on 2 nodes, the problem does not seem to appear. It
>> appears when all 12 processes are on the same node.
>>
>> 2- Nodes have dual X5650 with Mellanox Technologies MT26438
>>
>> 3-
>>
>> -bash-4.1$ mpichversion
>> MVAPICH2 Version:       2.0
>> MVAPICH2 Release date:  Fri Jun 20 20:00:00 EDT 2014
>> MVAPICH2 Device:        ch3:mrail
>> MVAPICH2 configure:     --prefix=/export/modules/apps/mvapich2/2.0/gnu
>> --enable-fast=nochkmsg,notiming,ndebug --enable-shared --enable-mpe
>> --with-pmi=slurm --with-pm=no --with-valgrind=/export/
>> modules/tools/valgrind/3.8.1/include/valgrind
>> MVAPICH2 CC:    gcc -Ofast -march=native -mtune=native   -DNDEBUG
>> -DNVALGRIND
>> MVAPICH2 CXX:   g++ -Ofast -march=native -mtune=native  -DNDEBUG
>> -DNVALGRIND
>> MVAPICH2 F77:   gfortran -Ofast -march=native -mtune=native
>> MVAPICH2 FC:    gfortran -Ofast -march=native -mtune=native
>> -bash-4.1$
>>
>>
>> I went ahead and compiled MVAPICH2 with --enable-g=all option. Then ran
>> it with valgrind again, below is the result. Does this help you to narrow
>> it down a little bit?
>>
>> Here is the configuration info with -g
>>
>> -bash-4.1$ mpichversion
>> MVAPICH2 Version:       2.0
>> MVAPICH2 Release date:  Fri Jun 20 20:00:00 EDT 2014
>> MVAPICH2 Device:        ch3:mrail
>> MVAPICH2 configure:     --prefix=/export/modules/apps/mvapich2/2.0/gnu
>> --enable-fast=nochkmsg,notiming,ndebug --enable-shared --enable-mpe
>> --with-pmi=slurm --with-pm=no --with-valgrind=/export/
>> modules/tools/valgrind/3.8.1/include/valgrind --enable-g=all
>> MVAPICH2 CC:    gcc -Ofast -march=native -mtune=native -g   -DNDEBUG
>> -DNVALGRIND -g
>> MVAPICH2 CXX:   g++ -Ofast -march=native -mtune=native -g  -DNDEBUG
>> -DNVALGRIND -g
>> MVAPICH2 F77:   gfortran -Ofast -march=native -mtune=native -g  -g
>> MVAPICH2 FC:    gfortran -Ofast -march=native -mtune=native -g  -g
>> -bash-4.1$
>>
>> below is what valgrind says
>>
>> ==29664== Invalid read of size 4
>> ==29664==    at 0x5015E73: MPIR_Allreduce_index_tuned_intra_MV2
>> (allreduce_osu.c:2358)
>> ==29664==    by 0x4FB39C5: MPIR_Allreduce_impl (allreduce.c:788)
>> ==29664==    by 0x4FB4162: PMPI_Allreduce (allreduce.c:929)
>> ==29664==    by 0x4ED96D7: PMPI_ALLREDUCE (allreducef.c:272)
>> ==29664==    by 0x57F3609: __sparitercomm_MOD_sparglobalnumbering
>> (SParIterComm.f90:2093)
>> ==29664==    by 0x584AADF: __parallelutils_MOD_parallelglobalnumbering
>> (ParallelUtils.f90:723)
>> ==29664==    by 0x57D7EB4: __meshutils_MOD_splitmeshequal
>> (MeshUtils.f90:10947)
>> ==29664==    by 0x566FF7B: __modeldescription_MOD_loadmodel
>> (ModelDescription.f90:2205)
>> ==29664==    by 0x58E6118: elmersolver_ (ElmerSolver.f90:406)
>> ==29664==    by 0x400F2A: MAIN__ (in /export/modules/apps/elmer/
>> 6825/mvapich2/gnu/bin/ElmerSolver_mpi)
>> ==29664==    by 0x4010CE: main (in /export/modules/apps/elmer/
>> 6825/mvapich2/gnu/bin/ElmerSolver_mpi)
>> ==29664==  Address 0xfffffdb404056228 is not stack'd, malloc'd or
>> (recently) free'd
>> ==29664==
>>
>> Thanks,
>> Evren
>>
>>
>>
>>
>> On Tue, 19 Aug 2014, Akshay Venkatesh wrote:
>>
>>  Hi Evren,
>>>
>>> We're trying to reproduce the bug that you reported. Some more details
>>> of the way the job was launched and how the library was configured would
>>> help us narrow down the
>>> cause. Could you provide job details such as
>>> 1. The number of MPI processes that are run on a single node for the
>>> 12-process job.
>>> 2. The processor type and the network hca type on which the job was run
>>> 3. The configuration flags used to build the library
>>>
>>> Thanks a lot
>>>
>>> --
>>> - Akshay
>>>
>>>
>>>
>>


-- 
- Akshay
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20140903/7a2a8069/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: index_tuning_fix.patch
Type: text/x-patch
Size: 6443 bytes
Desc: not available
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20140903/7a2a8069/attachment-0001.bin>


More information about the mvapich-discuss mailing list