[mvapich-discuss] NBP Data Traffic Verification Fail
Hoang-Vu Dang
dang.hvu at gmail.com
Wed May 11 20:36:19 EDT 2016
ping! Update on this issue please!
Vu
On Mon, May 2, 2016 at 5:26 PM, Sourav Chakraborty <
chakraborty.52 at buckeyemail.osu.edu> wrote:
> Hi Honag Vu,
>
> We are able to reproduce the issue and investigating it. Right now it
> looks like an issue with the benchmark itself, but we need some more time
> to figure out exactly what's going on.
>
> In the meantime, you can set MV2_ENABLE_AFFINITY=0 since it does not seem
> to happen with affinity disabled. We will let you know once we have a
> proper solution.
>
> Thanks,
> Sourav
>
>
> On Mon, May 2, 2016 at 6:18 PM, Hoang-Vu Dang <dang.hvu at gmail.com> wrote:
>
>> Is there any news?
>>
>> On Fri, Apr 29, 2016 at 5:39 PM, Hoang-Vu Dang <dang.hvu at gmail.com>
>> wrote:
>>
>>> I assume you are able to reproduce it ? Anything I can do to work around
>>> ?
>>>
>>> On Fri, Apr 29, 2016 at 4:04 PM, Sourav Chakraborty <
>>> chakraborty.52 at buckeyemail.osu.edu> wrote:
>>>
>>>> Hi Hoang-Vu,
>>>>
>>>> Thanks for providing the details. We will take a look and get back to
>>>> you.
>>>>
>>>> Thanks,
>>>> Sourav
>>>>
>>>>
>>>> On Fri, Apr 29, 2016 at 4:59 PM, Hoang-Vu Dang <dang.hvu at gmail.com>
>>>> wrote:
>>>>
>>>>> I can reproduce it for class S too, on SH (quite frequently), I
>>>>> haven't seen it on BH and WH yet.
>>>>>
>>>>> ibrun -np 12 ../bin/dt.S.x SH
>>>>>
>>>>> DT_SH.S Benchmark Completed
>>>>> Class = S
>>>>> Size = 6912
>>>>> Iterations = 12
>>>>> Time in seconds = 0.00
>>>>> Total processes = 12
>>>>> Mop/s total = 56.89
>>>>> Mop/s/process = 4.74
>>>>> Operation type = bytes transmitted
>>>>> Verification = UNSUCCESSFUL
>>>>> Version = 3.3.1
>>>>> Compile date = 28 Apr 2016
>>>>>
>>>>> Compile options:
>>>>> MPICC = mpicc
>>>>> CLINK = $(MPICC)
>>>>> CMPI_LIB = -L/usr/local/lib #-lmpi
>>>>> CMPI_INC = -I/usr/local/include
>>>>> CFLAGS = -O3
>>>>> CLINKFLAGS = -O3
>>>>>
>>>>> Here is some more information: mpiname -a
>>>>>
>>>>> MVAPICH2 2.1 Fri Apr 03 20:00:00 EDT 2015 ch3:mrail
>>>>>
>>>>> Compilation
>>>>> CC: gcc -pipe -g -O3
>>>>> CXX: g++ -pipe -g -O3
>>>>> F77: gfortran -L/opt/ofed/lib64/ -L/lib -L/lib -pipe -g -O3
>>>>> FC: gfortran -g -O3
>>>>>
>>>>> Configuration
>>>>> --prefix=/opt/apps/gcc4_9/mvapich2/2.1
>>>>> --with-ib-libpath=/opt/ofed/lib64/ --with-ib-include=/opt/ofed/include/
>>>>> --enable-cxx --enable-romio --enable-fast=O3 --enable-g=dbg
>>>>> --enable-sharedlibs=gcc --enable-shared --with-ch3-rank-bits=32
>>>>> --with-file-system=lustre --enable-mcast --enable-hybrid
>>>>>
>>>>> ldd dtS
>>>>> linux-vdso.so.1 => (0x00007fff0d0c6000)
>>>>> libmpi.so.12 => /opt/apps/gcc4_9/mvapich2/2.1/lib/libmpi.so.12
>>>>> (0x00002b42c96dc000)
>>>>> libc.so.6 => /lib64/libc.so.6 (0x0000003469400000)
>>>>> libnuma.so.1 => /usr/lib64/libnuma.so.1 (0x000000346b000000)
>>>>> libxml2.so.2 => /usr/lib64/libxml2.so.2 (0x0000003470800000)
>>>>> libibmad.so.5 => /opt/ofed/lib64/libibmad.so.5
>>>>> (0x00002b42c9e57000)
>>>>> librdmacm.so.1 => /opt/ofed/lib64/librdmacm.so.1
>>>>> (0x00002b42ca06e000)
>>>>> libibumad.so.3 => /opt/ofed/lib64/libibumad.so.3
>>>>> (0x00002b42ca276000)
>>>>> libibverbs.so.1 => /opt/ofed/lib64/libibverbs.so.1
>>>>> (0x00002b42ca47d000)
>>>>> libdl.so.2 => /lib64/libdl.so.2 (0x000000346a000000)
>>>>> librt.so.1 => /lib64/librt.so.1 (0x000000346a400000)
>>>>> libgfortran.so.3 => /opt/apps/gcc/4.9.1/lib64/libgfortran.so.3
>>>>> (0x00002b42ca68c000)
>>>>> libm.so.6 => /lib64/libm.so.6 (0x0000003469800000)
>>>>> libpthread.so.0 => /lib64/libpthread.so.0 (0x0000003469c00000)
>>>>> libgcc_s.so.1 => /opt/apps/gcc/4.9.1/lib64/libgcc_s.so.1
>>>>> (0x00002b42ca9a8000)
>>>>> libquadmath.so.0 => /opt/apps/gcc/4.9.1/lib64/libquadmath.so.0
>>>>> (0x00002b42cabbe000)
>>>>> /lib64/ld-linux-x86-64.so.2 (0x0000003469000000)
>>>>> libz.so.1 => /lib64/libz.so.1 (0x000000346a800000)
>>>>>
>>>>> I think somehow affinity is involved
>>>>> It will success with this setting : MV2_ENABLE_AFFINITY=0 ibrun -np 12
>>>>> ../bin/dt.S.x SH
>>>>>
>>>>> but not by the default.
>>>>>
>>>>>
>>>>>
>>>>> On Fri, Apr 29, 2016 at 3:31 PM, Sourav Chakraborty <
>>>>> chakraborty.52 at buckeyemail.osu.edu> wrote:
>>>>>
>>>>>> Hi Hoang-Vu,
>>>>>>
>>>>>> We were unable to reproduce the issue you mentioned. Can you please
>>>>>> give some more details about the configuration/build parameters used to
>>>>>> build MVAPICH2 and NPB? You can obtain this information by running mpiname
>>>>>> -a.
>>>>>>
>>>>>> Also, does the error occur only with class A and SH? How frequently
>>>>>> have you noticed the issue?
>>>>>>
>>>>>> Thanks,
>>>>>> Sourav
>>>>>>
>>>>>>
>>>>>> On Fri, Apr 29, 2016 at 11:11 AM, Hoang-Vu Dang <dang.hvu at gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> The benchmark is DT MPI version inside this tarball
>>>>>>>
>>>>>>> http://www.nas.nasa.gov/assets/npb/NPB3.3.1.tar.gz
>>>>>>>
>>>>>>> It's make with mvapich2 2.1 (gcc/4.9.1) on Stampede cluster with:
>>>>>>>
>>>>>>> cd ~/NPB3.3.1/NPB3.3-MPI/DT
>>>>>>> make CLASS=A
>>>>>>>
>>>>>>> Run wit problem SH for example:
>>>>>>>
>>>>>>> MV2_USE_SHARED_MEM=0 ibrun -np 80 ./dt SH
>>>>>>>
>>>>>>> Sometimes it give correct results:
>>>>>>>
>>>>>>> DT_SH.A L2 Norm = 610856482.000000
>>>>>>> Deviation = 0.000000
>>>>>>>
>>>>>>> Sometimes it gives wrong:
>>>>>>>
>>>>>>> DT_SH.A L2 Norm = 571204151.000000
>>>>>>> The correct verification value = 610856482.000000
>>>>>>> Got value = 571204151.000000
>>>>>>>
>>>>>>> Is there anything I can do to debug ? Is it reproducible ?
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> mvapich-discuss mailing list
>>>>>>> mvapich-discuss at cse.ohio-state.edu
>>>>>>> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20160511/1377689e/attachment.html>
More information about the mvapich-discuss
mailing list