[mvapich-discuss] NBP Data Traffic Verification Fail

Hoang-Vu Dang dang.hvu at gmail.com
Sun Jul 3 10:29:34 EDT 2016


No this doesn't fix it. And I actually can sometime reproduce without
binding. I doubt some protocol implementation error.

On Mon, Jun 6, 2016 at 1:57 PM, Sourav Chakraborty <
chakraborty.52 at buckeyemail.osu.edu> wrote:

> Hi Hoang-Vu,
>
> Can you apply the attached patch on MVAPICH2 2.2rc1 and try it out? You
> can download MVAPICH2 2.2rc1 the following URL:
> http://mvapich.cse.ohio-state.edu/download/mvapich/mv2/mvapich2-2.2rc1.tar.gz
>
> Please let us know if this fixes the issue for you.
>
> Thanks,
> Sourav
>
>
>
> On Sat, Jun 4, 2016 at 1:59 PM, Hoang-Vu Dang <dang.hvu at gmail.com> wrote:
>
>> update on this issue please.
>>
>> On Mon, May 16, 2016 at 2:24 PM, Hari Subramoni <subramoni.1 at osu.edu>
>> wrote:
>>
>>> Hello,
>>>
>>> We are actively debugging this issue. We will get back to you soon.
>>>
>>> Regards,
>>> Hari.
>>>
>>> On Wed, May 11, 2016 at 8:36 PM, Hoang-Vu Dang <dang.hvu at gmail.com>
>>> wrote:
>>>
>>>> ping! Update on this issue please!
>>>>
>>>> Vu
>>>>
>>>> On Mon, May 2, 2016 at 5:26 PM, Sourav Chakraborty <
>>>> chakraborty.52 at buckeyemail.osu.edu> wrote:
>>>>
>>>>> Hi Honag Vu,
>>>>>
>>>>> We are able to reproduce the issue and investigating it. Right now it
>>>>> looks like an issue with the benchmark itself, but we need some more time
>>>>> to figure out exactly what's going on.
>>>>>
>>>>> In the meantime, you can set MV2_ENABLE_AFFINITY=0 since it does not
>>>>> seem to happen with affinity disabled. We will let you know once we have a
>>>>> proper solution.
>>>>>
>>>>> Thanks,
>>>>> Sourav
>>>>>
>>>>>
>>>>> On Mon, May 2, 2016 at 6:18 PM, Hoang-Vu Dang <dang.hvu at gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Is there any news?
>>>>>>
>>>>>> On Fri, Apr 29, 2016 at 5:39 PM, Hoang-Vu Dang <dang.hvu at gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> I assume you are able to reproduce it ? Anything I can do to work
>>>>>>> around ?
>>>>>>>
>>>>>>> On Fri, Apr 29, 2016 at 4:04 PM, Sourav Chakraborty <
>>>>>>> chakraborty.52 at buckeyemail.osu.edu> wrote:
>>>>>>>
>>>>>>>> Hi Hoang-Vu,
>>>>>>>>
>>>>>>>> Thanks for providing the details. We will take a look and get back
>>>>>>>> to you.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Sourav
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, Apr 29, 2016 at 4:59 PM, Hoang-Vu Dang <dang.hvu at gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> I can reproduce it for class S too, on SH (quite frequently), I
>>>>>>>>> haven't seen it on BH and WH yet.
>>>>>>>>>
>>>>>>>>> ibrun -np 12 ../bin/dt.S.x SH
>>>>>>>>>
>>>>>>>>>  DT_SH.S Benchmark Completed
>>>>>>>>>  Class           =                        S
>>>>>>>>>  Size            =                     6912
>>>>>>>>>  Iterations      =                       12
>>>>>>>>>  Time in seconds =                     0.00
>>>>>>>>>  Total processes =                       12
>>>>>>>>>  Mop/s total     =                    56.89
>>>>>>>>>  Mop/s/process   =                     4.74
>>>>>>>>>  Operation type  =        bytes transmitted
>>>>>>>>>  Verification    =             UNSUCCESSFUL
>>>>>>>>>  Version         =                    3.3.1
>>>>>>>>>  Compile date    =              28 Apr 2016
>>>>>>>>>
>>>>>>>>>  Compile options:
>>>>>>>>>     MPICC        = mpicc
>>>>>>>>>     CLINK        = $(MPICC)
>>>>>>>>>     CMPI_LIB     = -L/usr/local/lib #-lmpi
>>>>>>>>>     CMPI_INC     = -I/usr/local/include
>>>>>>>>>     CFLAGS       = -O3
>>>>>>>>>     CLINKFLAGS   = -O3
>>>>>>>>>
>>>>>>>>> Here is some more information: mpiname -a
>>>>>>>>>
>>>>>>>>> MVAPICH2 2.1 Fri Apr 03 20:00:00 EDT 2015 ch3:mrail
>>>>>>>>>
>>>>>>>>> Compilation
>>>>>>>>> CC: gcc   -pipe   -g -O3
>>>>>>>>> CXX: g++   -pipe  -g -O3
>>>>>>>>> F77: gfortran -L/opt/ofed/lib64/ -L/lib -L/lib   -pipe  -g -O3
>>>>>>>>> FC: gfortran    -g -O3
>>>>>>>>>
>>>>>>>>> Configuration
>>>>>>>>> --prefix=/opt/apps/gcc4_9/mvapich2/2.1
>>>>>>>>> --with-ib-libpath=/opt/ofed/lib64/ --with-ib-include=/opt/ofed/include/
>>>>>>>>> --enable-cxx --enable-romio --enable-fast=O3 --enable-g=dbg
>>>>>>>>> --enable-sharedlibs=gcc --enable-shared --with-ch3-rank-bits=32
>>>>>>>>> --with-file-system=lustre --enable-mcast --enable-hybrid
>>>>>>>>>
>>>>>>>>> ldd dtS
>>>>>>>>>         linux-vdso.so.1 =>  (0x00007fff0d0c6000)
>>>>>>>>>         libmpi.so.12 =>
>>>>>>>>> /opt/apps/gcc4_9/mvapich2/2.1/lib/libmpi.so.12 (0x00002b42c96dc000)
>>>>>>>>>         libc.so.6 => /lib64/libc.so.6 (0x0000003469400000)
>>>>>>>>>         libnuma.so.1 => /usr/lib64/libnuma.so.1
>>>>>>>>> (0x000000346b000000)
>>>>>>>>>         libxml2.so.2 => /usr/lib64/libxml2.so.2
>>>>>>>>> (0x0000003470800000)
>>>>>>>>>         libibmad.so.5 => /opt/ofed/lib64/libibmad.so.5
>>>>>>>>> (0x00002b42c9e57000)
>>>>>>>>>         librdmacm.so.1 => /opt/ofed/lib64/librdmacm.so.1
>>>>>>>>> (0x00002b42ca06e000)
>>>>>>>>>         libibumad.so.3 => /opt/ofed/lib64/libibumad.so.3
>>>>>>>>> (0x00002b42ca276000)
>>>>>>>>>         libibverbs.so.1 => /opt/ofed/lib64/libibverbs.so.1
>>>>>>>>> (0x00002b42ca47d000)
>>>>>>>>>         libdl.so.2 => /lib64/libdl.so.2 (0x000000346a000000)
>>>>>>>>>         librt.so.1 => /lib64/librt.so.1 (0x000000346a400000)
>>>>>>>>>         libgfortran.so.3 =>
>>>>>>>>> /opt/apps/gcc/4.9.1/lib64/libgfortran.so.3 (0x00002b42ca68c000)
>>>>>>>>>         libm.so.6 => /lib64/libm.so.6 (0x0000003469800000)
>>>>>>>>>         libpthread.so.0 => /lib64/libpthread.so.0
>>>>>>>>> (0x0000003469c00000)
>>>>>>>>>         libgcc_s.so.1 => /opt/apps/gcc/4.9.1/lib64/libgcc_s.so.1
>>>>>>>>> (0x00002b42ca9a8000)
>>>>>>>>>         libquadmath.so.0 =>
>>>>>>>>> /opt/apps/gcc/4.9.1/lib64/libquadmath.so.0 (0x00002b42cabbe000)
>>>>>>>>>         /lib64/ld-linux-x86-64.so.2 (0x0000003469000000)
>>>>>>>>>         libz.so.1 => /lib64/libz.so.1 (0x000000346a800000)
>>>>>>>>>
>>>>>>>>> I think somehow affinity is involved
>>>>>>>>> It will success with this setting : MV2_ENABLE_AFFINITY=0 ibrun
>>>>>>>>> -np 12 ../bin/dt.S.x SH
>>>>>>>>>
>>>>>>>>> but not by the default.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Fri, Apr 29, 2016 at 3:31 PM, Sourav Chakraborty <
>>>>>>>>> chakraborty.52 at buckeyemail.osu.edu> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Hoang-Vu,
>>>>>>>>>>
>>>>>>>>>> We were unable to reproduce the issue you mentioned. Can you
>>>>>>>>>> please give some more details about the configuration/build parameters used
>>>>>>>>>> to build MVAPICH2 and NPB? You can obtain this information by running
>>>>>>>>>> mpiname -a.
>>>>>>>>>>
>>>>>>>>>> Also, does the error occur only with class A and SH? How
>>>>>>>>>> frequently have you noticed the issue?
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Sourav
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Fri, Apr 29, 2016 at 11:11 AM, Hoang-Vu Dang <
>>>>>>>>>> dang.hvu at gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> The benchmark is DT MPI version inside this tarball
>>>>>>>>>>>
>>>>>>>>>>> http://www.nas.nasa.gov/assets/npb/NPB3.3.1.tar.gz
>>>>>>>>>>>
>>>>>>>>>>> It's make with mvapich2 2.1 (gcc/4.9.1) on Stampede cluster with:
>>>>>>>>>>>
>>>>>>>>>>> cd ~/NPB3.3.1/NPB3.3-MPI/DT
>>>>>>>>>>> make CLASS=A
>>>>>>>>>>>
>>>>>>>>>>> Run wit problem SH for example:
>>>>>>>>>>>
>>>>>>>>>>> MV2_USE_SHARED_MEM=0 ibrun -np 80 ./dt SH
>>>>>>>>>>>
>>>>>>>>>>> Sometimes it give correct results:
>>>>>>>>>>>
>>>>>>>>>>>  DT_SH.A L2 Norm = 610856482.000000
>>>>>>>>>>>  Deviation = 0.000000
>>>>>>>>>>>
>>>>>>>>>>> Sometimes it gives wrong:
>>>>>>>>>>>
>>>>>>>>>>>  DT_SH.A L2 Norm = 571204151.000000
>>>>>>>>>>>  The correct verification value = 610856482.000000
>>>>>>>>>>>  Got value = 571204151.000000
>>>>>>>>>>>
>>>>>>>>>>> Is there anything I can do to debug ? Is it reproducible ?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> mvapich-discuss mailing list
>>>>>>>>>>> mvapich-discuss at cse.ohio-state.edu
>>>>>>>>>>>
>>>>>>>>>>> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>> _______________________________________________
>>>> mvapich-discuss mailing list
>>>> mvapich-discuss at cse.ohio-state.edu
>>>> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>>>
>>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20160703/1ac7e855/attachment-0001.html>


More information about the mvapich-discuss mailing list