[mvapich-discuss] NBP Data Traffic Verification Fail
Sourav Chakraborty
chakraborty.52 at buckeyemail.osu.edu
Mon Jun 6 14:57:44 EDT 2016
Hi Hoang-Vu,
Can you apply the attached patch on MVAPICH2 2.2rc1 and try it out? You can
download MVAPICH2 2.2rc1 the following URL:
http://mvapich.cse.ohio-state.edu/download/mvapich/mv2/mvapich2-2.2rc1.tar.gz
Please let us know if this fixes the issue for you.
Thanks,
Sourav
On Sat, Jun 4, 2016 at 1:59 PM, Hoang-Vu Dang <dang.hvu at gmail.com> wrote:
> update on this issue please.
>
> On Mon, May 16, 2016 at 2:24 PM, Hari Subramoni <subramoni.1 at osu.edu>
> wrote:
>
>> Hello,
>>
>> We are actively debugging this issue. We will get back to you soon.
>>
>> Regards,
>> Hari.
>>
>> On Wed, May 11, 2016 at 8:36 PM, Hoang-Vu Dang <dang.hvu at gmail.com>
>> wrote:
>>
>>> ping! Update on this issue please!
>>>
>>> Vu
>>>
>>> On Mon, May 2, 2016 at 5:26 PM, Sourav Chakraborty <
>>> chakraborty.52 at buckeyemail.osu.edu> wrote:
>>>
>>>> Hi Honag Vu,
>>>>
>>>> We are able to reproduce the issue and investigating it. Right now it
>>>> looks like an issue with the benchmark itself, but we need some more time
>>>> to figure out exactly what's going on.
>>>>
>>>> In the meantime, you can set MV2_ENABLE_AFFINITY=0 since it does not
>>>> seem to happen with affinity disabled. We will let you know once we have a
>>>> proper solution.
>>>>
>>>> Thanks,
>>>> Sourav
>>>>
>>>>
>>>> On Mon, May 2, 2016 at 6:18 PM, Hoang-Vu Dang <dang.hvu at gmail.com>
>>>> wrote:
>>>>
>>>>> Is there any news?
>>>>>
>>>>> On Fri, Apr 29, 2016 at 5:39 PM, Hoang-Vu Dang <dang.hvu at gmail.com>
>>>>> wrote:
>>>>>
>>>>>> I assume you are able to reproduce it ? Anything I can do to work
>>>>>> around ?
>>>>>>
>>>>>> On Fri, Apr 29, 2016 at 4:04 PM, Sourav Chakraborty <
>>>>>> chakraborty.52 at buckeyemail.osu.edu> wrote:
>>>>>>
>>>>>>> Hi Hoang-Vu,
>>>>>>>
>>>>>>> Thanks for providing the details. We will take a look and get back
>>>>>>> to you.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Sourav
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Apr 29, 2016 at 4:59 PM, Hoang-Vu Dang <dang.hvu at gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> I can reproduce it for class S too, on SH (quite frequently), I
>>>>>>>> haven't seen it on BH and WH yet.
>>>>>>>>
>>>>>>>> ibrun -np 12 ../bin/dt.S.x SH
>>>>>>>>
>>>>>>>> DT_SH.S Benchmark Completed
>>>>>>>> Class = S
>>>>>>>> Size = 6912
>>>>>>>> Iterations = 12
>>>>>>>> Time in seconds = 0.00
>>>>>>>> Total processes = 12
>>>>>>>> Mop/s total = 56.89
>>>>>>>> Mop/s/process = 4.74
>>>>>>>> Operation type = bytes transmitted
>>>>>>>> Verification = UNSUCCESSFUL
>>>>>>>> Version = 3.3.1
>>>>>>>> Compile date = 28 Apr 2016
>>>>>>>>
>>>>>>>> Compile options:
>>>>>>>> MPICC = mpicc
>>>>>>>> CLINK = $(MPICC)
>>>>>>>> CMPI_LIB = -L/usr/local/lib #-lmpi
>>>>>>>> CMPI_INC = -I/usr/local/include
>>>>>>>> CFLAGS = -O3
>>>>>>>> CLINKFLAGS = -O3
>>>>>>>>
>>>>>>>> Here is some more information: mpiname -a
>>>>>>>>
>>>>>>>> MVAPICH2 2.1 Fri Apr 03 20:00:00 EDT 2015 ch3:mrail
>>>>>>>>
>>>>>>>> Compilation
>>>>>>>> CC: gcc -pipe -g -O3
>>>>>>>> CXX: g++ -pipe -g -O3
>>>>>>>> F77: gfortran -L/opt/ofed/lib64/ -L/lib -L/lib -pipe -g -O3
>>>>>>>> FC: gfortran -g -O3
>>>>>>>>
>>>>>>>> Configuration
>>>>>>>> --prefix=/opt/apps/gcc4_9/mvapich2/2.1
>>>>>>>> --with-ib-libpath=/opt/ofed/lib64/ --with-ib-include=/opt/ofed/include/
>>>>>>>> --enable-cxx --enable-romio --enable-fast=O3 --enable-g=dbg
>>>>>>>> --enable-sharedlibs=gcc --enable-shared --with-ch3-rank-bits=32
>>>>>>>> --with-file-system=lustre --enable-mcast --enable-hybrid
>>>>>>>>
>>>>>>>> ldd dtS
>>>>>>>> linux-vdso.so.1 => (0x00007fff0d0c6000)
>>>>>>>> libmpi.so.12 =>
>>>>>>>> /opt/apps/gcc4_9/mvapich2/2.1/lib/libmpi.so.12 (0x00002b42c96dc000)
>>>>>>>> libc.so.6 => /lib64/libc.so.6 (0x0000003469400000)
>>>>>>>> libnuma.so.1 => /usr/lib64/libnuma.so.1 (0x000000346b000000)
>>>>>>>> libxml2.so.2 => /usr/lib64/libxml2.so.2 (0x0000003470800000)
>>>>>>>> libibmad.so.5 => /opt/ofed/lib64/libibmad.so.5
>>>>>>>> (0x00002b42c9e57000)
>>>>>>>> librdmacm.so.1 => /opt/ofed/lib64/librdmacm.so.1
>>>>>>>> (0x00002b42ca06e000)
>>>>>>>> libibumad.so.3 => /opt/ofed/lib64/libibumad.so.3
>>>>>>>> (0x00002b42ca276000)
>>>>>>>> libibverbs.so.1 => /opt/ofed/lib64/libibverbs.so.1
>>>>>>>> (0x00002b42ca47d000)
>>>>>>>> libdl.so.2 => /lib64/libdl.so.2 (0x000000346a000000)
>>>>>>>> librt.so.1 => /lib64/librt.so.1 (0x000000346a400000)
>>>>>>>> libgfortran.so.3 =>
>>>>>>>> /opt/apps/gcc/4.9.1/lib64/libgfortran.so.3 (0x00002b42ca68c000)
>>>>>>>> libm.so.6 => /lib64/libm.so.6 (0x0000003469800000)
>>>>>>>> libpthread.so.0 => /lib64/libpthread.so.0
>>>>>>>> (0x0000003469c00000)
>>>>>>>> libgcc_s.so.1 => /opt/apps/gcc/4.9.1/lib64/libgcc_s.so.1
>>>>>>>> (0x00002b42ca9a8000)
>>>>>>>> libquadmath.so.0 =>
>>>>>>>> /opt/apps/gcc/4.9.1/lib64/libquadmath.so.0 (0x00002b42cabbe000)
>>>>>>>> /lib64/ld-linux-x86-64.so.2 (0x0000003469000000)
>>>>>>>> libz.so.1 => /lib64/libz.so.1 (0x000000346a800000)
>>>>>>>>
>>>>>>>> I think somehow affinity is involved
>>>>>>>> It will success with this setting : MV2_ENABLE_AFFINITY=0 ibrun -np
>>>>>>>> 12 ../bin/dt.S.x SH
>>>>>>>>
>>>>>>>> but not by the default.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, Apr 29, 2016 at 3:31 PM, Sourav Chakraborty <
>>>>>>>> chakraborty.52 at buckeyemail.osu.edu> wrote:
>>>>>>>>
>>>>>>>>> Hi Hoang-Vu,
>>>>>>>>>
>>>>>>>>> We were unable to reproduce the issue you mentioned. Can you
>>>>>>>>> please give some more details about the configuration/build parameters used
>>>>>>>>> to build MVAPICH2 and NPB? You can obtain this information by running
>>>>>>>>> mpiname -a.
>>>>>>>>>
>>>>>>>>> Also, does the error occur only with class A and SH? How
>>>>>>>>> frequently have you noticed the issue?
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Sourav
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Fri, Apr 29, 2016 at 11:11 AM, Hoang-Vu Dang <
>>>>>>>>> dang.hvu at gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> The benchmark is DT MPI version inside this tarball
>>>>>>>>>>
>>>>>>>>>> http://www.nas.nasa.gov/assets/npb/NPB3.3.1.tar.gz
>>>>>>>>>>
>>>>>>>>>> It's make with mvapich2 2.1 (gcc/4.9.1) on Stampede cluster with:
>>>>>>>>>>
>>>>>>>>>> cd ~/NPB3.3.1/NPB3.3-MPI/DT
>>>>>>>>>> make CLASS=A
>>>>>>>>>>
>>>>>>>>>> Run wit problem SH for example:
>>>>>>>>>>
>>>>>>>>>> MV2_USE_SHARED_MEM=0 ibrun -np 80 ./dt SH
>>>>>>>>>>
>>>>>>>>>> Sometimes it give correct results:
>>>>>>>>>>
>>>>>>>>>> DT_SH.A L2 Norm = 610856482.000000
>>>>>>>>>> Deviation = 0.000000
>>>>>>>>>>
>>>>>>>>>> Sometimes it gives wrong:
>>>>>>>>>>
>>>>>>>>>> DT_SH.A L2 Norm = 571204151.000000
>>>>>>>>>> The correct verification value = 610856482.000000
>>>>>>>>>> Got value = 571204151.000000
>>>>>>>>>>
>>>>>>>>>> Is there anything I can do to debug ? Is it reproducible ?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> mvapich-discuss mailing list
>>>>>>>>>> mvapich-discuss at cse.ohio-state.edu
>>>>>>>>>> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>> _______________________________________________
>>> mvapich-discuss mailing list
>>> mvapich-discuss at cse.ohio-state.edu
>>> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20160606/375c9639/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: dt_verification_hwloc.patch
Type: text/x-diff
Size: 654 bytes
Desc: not available
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20160606/375c9639/attachment-0001.bin>
More information about the mvapich-discuss
mailing list