[mvapich-discuss] Errors in BSBR when running xCbtest and xFbest of the BLACS

Jeff Hammond jhammond at alcf.anl.gov
Mon May 20 21:53:03 EDT 2013


BLACS does all sorts of terrible things with MPI, including
non-deterministic reductions that break the eigensolver at scale.
Instead of complaining about an MPI implementation, you should just
use a dense linear algebra library that uses MPI properly.  Elemental
(code.google.com/p/elemental/) is such a library.  Elemental is used
with great success in a number of quantum mechanics code, including
QBox, which you appear to be using.

Jeff

On Mon, May 20, 2013 at 3:07 PM, Margulis, Claudio J
<claudio-margulis at uiowa.edu> wrote:
> Dear Krishna, I very much appreciate you looking into this and passing it to
> developers. I am replying to the mailing list so that the thread does not
> remain without conclusion. Hopefully MVAPICH2 will be able to properly deal
> with scalapack and the BLACS in future releases. I think it is important for
> people compiling their codes with MVAPICH2 to know that this release version
> does not pass the BLACS tests.
>
> Is there a way we can get informed when a patch is released that will
> resolve this issue? Many quantum mechanics codes use these linear algebra
> routines.
>
> Thanks,
> cheers
> Claudio
>
> ________________________________
> From: krishna.kandalla at gmail.com [krishna.kandalla at gmail.com] on behalf of
> Krishna Kandalla [kandalla at cse.ohio-state.edu]
> Sent: Thursday, May 16, 2013 10:19 AM
> To: Margulis, Claudio J
> Cc: MVAPICH-Core
> Subject: Re: [mvapich-discuss] Errors in BSBR when running xCbtest and
> xFbest of the BLACS
>
> Hi Claudio,
>           Thanks for sharing the details. We see the same error message with
> the xCbtest. We will continue working on this issue.
> (I am CC'ing our internal developer list)
>
> Thanks,
> Krishna
>
> On Thu, May 16, 2013 at 10:28 AM, Claudio J. Margulis
> <claudio-margulis at uiowa.edu> wrote:
>>
>> I guess it would be useful if I also paste my SL.make for scalapack:
>>
>>
>> ############################################################################
>> #
>> #  Program:         ScaLAPACK
>> #
>> #  Module:          SLmake.inc
>> #
>> #  Purpose:         Top-level Definitions
>> #
>> #  Creation date:   February 15, 2000
>> #
>> #  Modified:        October 13, 2011
>> #
>> #  Send bug reports, comments or suggestions to scalapack at cs.utk.edu
>> #
>>
>> ############################################################################
>> #
>> #  C preprocessor definitions:  set CDEFS to one of the following:
>> #
>> #     -DNoChange (fortran subprogram names are lower case without any
>> suffix)
>> #     -DUpCase   (fortran subprogram names are upper case without any
>> suffix)
>> #     -DAdd_     (fortran subprogram names are lower case with "_"
>> appended)
>>
>> CDEFS         = -DAdd_
>>
>> #
>> #  The fortran and C compilers, loaders, and their flags
>> #
>>
>> FC            =
>> /usr/local/chemistry_software/mvapich2-1.9/gcc-4.5.1/bin/mpif90
>> CC            =
>> /usr/local/chemistry_software/mvapich2-1.9/gcc-4.5.1/bin/mpicc
>> NOOPT         = -O0
>> FCFLAGS       = -O3
>> CCFLAGS       = -O3
>> FCLOADER      = $(FC)
>> CCLOADER      = $(CC)
>> FCLOADFLAGS   = $(FCFLAGS)
>> CCLOADFLAGS   = $(CCFLAGS)
>>
>> #
>> #  The archiver and the flag(s) to use when building archive (library)
>> #  Also the ranlib routine.  If your system has no ranlib, set RANLIB =
>> echo
>> #
>>
>> ARCH          = ar
>> ARCHFLAGS     = cr
>> RANLIB        = ranlib
>>
>> #
>> #  The name of the ScaLAPACK library to be created
>> #
>>
>> SCALAPACKLIB  = libscalapack.a
>>
>> #
>> #  BLAS, LAPACK (and possibly other) libraries needed for linking test
>> programs
>> #
>>
>> #BLASLIB       =
>> LAPACKLIB     =
>> LIBS          = /shared/acml-4.4.0/gfortran64/lib/libacml.a
>>
>>
>>
>>
>> Claudio J. Margulis wrote:
>>>
>>> Dear Krishna, I don't think there are any special options: This were the
>>> commands:
>>>
>>>
>>> gunzip mvapich2-1.9.tgz
>>> tar -xvf mvapich2-1.9.tar
>>> cd mvapich2-1.9
>>> export
>>> LD_LIBRARY_PATH=/shared/gcc-4.5.1/lib64:/shared/gcc-4.5.1/lib:/shared/mpc-0.8.2/lib:/shared/mpfr-3.0.0/lib:/shared/gmp-4.3.2/lib
>>> ./configure --prefix=/usr/local/chemistry_software/mvapich2-1.9/gcc-4.5.1
>>> CC=/shared/gcc-4.5.1/bin/gcc CXX=/shared/gcc-4.5.1/bin/g++
>>> F77=/shared/gcc-4.5.1/bin/gfortran FC=/shared/gcc-4.5.1/bin/gfortran
>>>  make -j 16 >&make.log &
>>> make install
>>>
>>>
>>> cd scalapack-mvapich2-1.9/
>>> tar -xvf scalapack-2.0.2.tar
>>> cd scalapack-2.0.2
>>> export
>>> LD_LIBRARY_PATH=/usr/local/chemistry_software/mvapich2-1.9/gcc-4.5.1/lib:$LD_LIBRARY_PATH
>>> make all
>>> cd BLACS/TESTING/
>>>  /usr/local/chemistry_software/mvapich2-1.9/gcc-4.5.1/bin/mpirun -np 16
>>> ./xCbtest
>>>
>>> I don't want to paste all the errors I get but a sample follows for the
>>> BSBR section:
>>>
>>> INTEGER BSBR TESTS: BEGIN.
>>>
>>> PROCESS {   0,   1} REPORTS ERRORS IN TEST#  2161:
>>>    Invalid element at A(   2,   1):
>>>    Expected=     -995413; Received=          -2
>>>    Complementory triangle overwrite at A(   1,   1):
>>>    Expected=          -2; Received=          -1
>>> PROCESS {   0,   1} DONE ERROR REPORT FOR TEST#  2161.
>>>
>>> PROCESS {   0,   1} REPORTS ERRORS IN TEST#  2162:
>>>    Invalid element at A(   2,   1):
>>>    Expected=     -219319; Received=          -2
>>> PROCESS {   0,   1} DONE ERROR REPORT FOR TEST#  2162.
>>>
>>> PROCESS {   0,   1} REPORTS ERRORS IN TEST#  3761:
>>>    Invalid element at A(   2,   1):
>>>    Expected=      574430; Received=          -2
>>>    Complementory triangle overwrite at A(   1,   1):
>>>    Expected=          -2; Received=          -1
>>> PROCESS {   0,   1} DONE ERROR REPORT FOR TEST#  3761.
>>>
>>> PROCESS {   0,   1} REPORTS ERRORS IN TEST#  4561:
>>>    Invalid element at A(   2,   1):
>>>    Expected=      716842; Received=          -2
>>>    Complementory triangle overwrite at A(   1,   1):
>>>    Expected=          -2; Received=          -1
>>> PROCESS {   0,   1} DONE ERROR REPORT FOR TEST#  4561.
>>>
>>> PROCESS {   1,   0} REPORTS ERRORS IN TEST#  1361:
>>>    Invalid element at A(   2,   1):
>>>    Expected=      862174; Received=          -2
>>>    Complementory triangle overwrite at A(   1,   1):
>>>    Expected=          -2; Received=          -1
>>> PROCESS {   1,   0} DONE ERROR REPORT FOR TEST#  1361.
>>>
>>> PROCESS {   1,   0} REPORTS ERRORS IN TEST#  2161:
>>>    Invalid element at A(   2,   1):
>>>    Expected=     -995413; Received=          -2
>>>    Complementory triangle overwrite at A(   1,   1):
>>>    Expected=          -2; Received=          -1
>>> PROCESS {   1,   0} DONE ERROR REPORT FOR TEST#  2161.
>>>
>>> PROCESS {   1,   0} REPORTS ERRORS IN TEST#  3761:
>>>    Invalid element at A(   2,   1):
>>>    Expected=      574430; Received=          -2
>>>
>>>
>>> These errors do not occur when using the old broadcast method (i.e. with
>>> environmental variable MV2_USE_OLD_BCAST set to 1. There is also the issue
>>> of timing but lets deal with one thing at a time.
>>>
>>> Furthermore, it seems like I am not the only one getting these errors. If
>>> you look at my original posting there is a link to:
>>> http://fpmd.ucdavis.edu/qbox-list/viewtopic.php?p=290
>>> which reports exactly the same issues.
>>>
>>> Do you have any special setting that I may not be aware of that might
>>> result in successful output in your case?
>>>
>>> Thanks for your help.
>>> Cheers,
>>> Claudio
>>>
>>> Krishna Kandalla wrote:
>>>>
>>>> Hello Claudio,
>>>>
>>>>     I just tried running the xdqr test with 16 processes (one node) on
>>>> the TACC Stampede cluster. The overall execution time for this test, with or
>>>> without this flag does not seem to vary much. I am seeing about 1.02 - 1.06s
>>>> as the total time. This test also completes correctly without the env
>>>> variable that we had discussed. And, if it helps, I am also seeing that this
>>>> test takes about 1.7s with Open-MPI-1.6.4.
>>>>     If you are using any specific configure/run-time options for the
>>>> MVAPICH2-1.9 library, could you please share the details?
>>>>
>>>> Thanks,
>>>> Krishna
>>>>
>>>> On Wed, May 15, 2013 at 10:31 AM, Claudio J. Margulis
>>>> <claudio-margulis at uiowa.edu <mailto:claudio-margulis at uiowa.edu>> wrote:
>>>>
>>>>     It seems that my mail did't go through so I am resending it.
>>>>     Please read below.
>>>>     Claudio
>>>>
>>>>
>>>>     Claudio J. Margulis wrote:
>>>>
>>>>         Dear Krishna, thanks for responding.
>>>>         Yes, with that environmental variable the errors are gone.
>>>>         However run time for the tests become extremely long.
>>>>         As an example a typical scalapack test
>>>>         mpirun -np 16 ./xdqr <QR.dat that takes a second to run with
>>>>         openmpi takes on the order of minutes with mvapich2.
>>>>
>>>>         Claudio
>>>>
>>>>
>>>>     --     signature.html Claudio J. Margulis
>>>>
>>>>     Associate Professor of Chemistry
>>>>     The University of Iowa
>>>>     Margulis Group Page
>>>> <http://www.chem.uiowa.edu/faculty/margulis/group/first.html>
>>>>
>>>>
>>>
>>
>> --
>> signature.html Claudio J. Margulis
>> Associate Professor of Chemistry
>> The University of Iowa
>> Margulis Group Page
>> <http://www.chem.uiowa.edu/faculty/margulis/group/first.html>
>>
>
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>



-- 
Jeff Hammond
Argonne Leadership Computing Facility
University of Chicago Computation Institute
jhammond at alcf.anl.gov / (630) 252-5381
http://www.linkedin.com/in/jeffhammond
https://wiki.alcf.anl.gov/parts/index.php/User:Jhammond
ALCF docs: http://www.alcf.anl.gov/user-guides


More information about the mvapich-discuss mailing list