[mvapich-discuss] Errors in BSBR when running xCbtest and xFbest of the BLACS

Margulis, Claudio J claudio-margulis at uiowa.edu
Tue May 21 09:26:50 EDT 2013


Hi Jeff, thanks for your post. I am not using QBox. I just found a thread in their mailing list pointing to the same errors I was seeing and that's why I forwarded that. I instead am using SIESTA and CP2K. I think both of these use BLACS and are major chemistry software developments.

At this point the suggestion of changing applications is not a viable option, particularly when these errors do not occur when using openmpi. I have other reasons why I would like to try MVAPICH2, but at this point until it can handle the BLACS I can't.

On a side note, I am also concerned because users who obtain precompiled versions of SCALAPACK against MVAPICH2 may not know that there will be incorrect results. See, the problem is that codes compile fine and run without you having any clue that the results are wrong. It is only when you compile it yourself and take care of compiling the utilities to test BLACS that you see erroneous results. One may think that everyone should do these checks before using executables but realistically many users don't know how to do this or simply think that they are getting a library from a reputable source and it should be OK to use.

Cheers,
Claudio


________________________________________
From: jeff.science at gmail.com [jeff.science at gmail.com] on behalf of Jeff Hammond [jhammond at alcf.anl.gov]
Sent: Monday, May 20, 2013 8:53 PM
To: Margulis, Claudio J
Cc: Krishna Kandalla; mvapich-discuss at cse.ohio-state.edu
Subject: Re: [mvapich-discuss] Errors in BSBR when running xCbtest and xFbest of the BLACS

BLACS does all sorts of terrible things with MPI, including
non-deterministic reductions that break the eigensolver at scale.
Instead of complaining about an MPI implementation, you should just
use a dense linear algebra library that uses MPI properly.  Elemental
(code.google.com/p/elemental/) is such a library.  Elemental is used
with great success in a number of quantum mechanics code, including
QBox, which you appear to be using.

Jeff

On Mon, May 20, 2013 at 3:07 PM, Margulis, Claudio J
<claudio-margulis at uiowa.edu> wrote:
> Dear Krishna, I very much appreciate you looking into this and passing it to
> developers. I am replying to the mailing list so that the thread does not
> remain without conclusion. Hopefully MVAPICH2 will be able to properly deal
> with scalapack and the BLACS in future releases. I think it is important for
> people compiling their codes with MVAPICH2 to know that this release version
> does not pass the BLACS tests.
>
> Is there a way we can get informed when a patch is released that will
> resolve this issue? Many quantum mechanics codes use these linear algebra
> routines.
>
> Thanks,
> cheers
> Claudio
>
> ________________________________
> From: krishna.kandalla at gmail.com [krishna.kandalla at gmail.com] on behalf of
> Krishna Kandalla [kandalla at cse.ohio-state.edu]
> Sent: Thursday, May 16, 2013 10:19 AM
> To: Margulis, Claudio J
> Cc: MVAPICH-Core
> Subject: Re: [mvapich-discuss] Errors in BSBR when running xCbtest and
> xFbest of the BLACS
>
> Hi Claudio,
>           Thanks for sharing the details. We see the same error message with
> the xCbtest. We will continue working on this issue.
> (I am CC'ing our internal developer list)
>
> Thanks,
> Krishna
>
> On Thu, May 16, 2013 at 10:28 AM, Claudio J. Margulis
> <claudio-margulis at uiowa.edu> wrote:
>>
>> I guess it would be useful if I also paste my SL.make for scalapack:
>>
>>
>> ############################################################################
>> #
>> #  Program:         ScaLAPACK
>> #
>> #  Module:          SLmake.inc
>> #
>> #  Purpose:         Top-level Definitions
>> #
>> #  Creation date:   February 15, 2000
>> #
>> #  Modified:        October 13, 2011
>> #
>> #  Send bug reports, comments or suggestions to scalapack at cs.utk.edu
>> #
>>
>> ############################################################################
>> #
>> #  C preprocessor definitions:  set CDEFS to one of the following:
>> #
>> #     -DNoChange (fortran subprogram names are lower case without any
>> suffix)
>> #     -DUpCase   (fortran subprogram names are upper case without any
>> suffix)
>> #     -DAdd_     (fortran subprogram names are lower case with "_"
>> appended)
>>
>> CDEFS         = -DAdd_
>>
>> #
>> #  The fortran and C compilers, loaders, and their flags
>> #
>>
>> FC            =
>> /usr/local/chemistry_software/mvapich2-1.9/gcc-4.5.1/bin/mpif90
>> CC            =
>> /usr/local/chemistry_software/mvapich2-1.9/gcc-4.5.1/bin/mpicc
>> NOOPT         = -O0
>> FCFLAGS       = -O3
>> CCFLAGS       = -O3
>> FCLOADER      = $(FC)
>> CCLOADER      = $(CC)
>> FCLOADFLAGS   = $(FCFLAGS)
>> CCLOADFLAGS   = $(CCFLAGS)
>>
>> #
>> #  The archiver and the flag(s) to use when building archive (library)
>> #  Also the ranlib routine.  If your system has no ranlib, set RANLIB =
>> echo
>> #
>>
>> ARCH          = ar
>> ARCHFLAGS     = cr
>> RANLIB        = ranlib
>>
>> #
>> #  The name of the ScaLAPACK library to be created
>> #
>>
>> SCALAPACKLIB  = libscalapack.a
>>
>> #
>> #  BLAS, LAPACK (and possibly other) libraries needed for linking test
>> programs
>> #
>>
>> #BLASLIB       =
>> LAPACKLIB     =
>> LIBS          = /shared/acml-4.4.0/gfortran64/lib/libacml.a
>>
>>
>>
>>
>> Claudio J. Margulis wrote:
>>>
>>> Dear Krishna, I don't think there are any special options: This were the
>>> commands:
>>>
>>>
>>> gunzip mvapich2-1.9.tgz
>>> tar -xvf mvapich2-1.9.tar
>>> cd mvapich2-1.9
>>> export
>>> LD_LIBRARY_PATH=/shared/gcc-4.5.1/lib64:/shared/gcc-4.5.1/lib:/shared/mpc-0.8.2/lib:/shared/mpfr-3.0.0/lib:/shared/gmp-4.3.2/lib
>>> ./configure --prefix=/usr/local/chemistry_software/mvapich2-1.9/gcc-4.5.1
>>> CC=/shared/gcc-4.5.1/bin/gcc CXX=/shared/gcc-4.5.1/bin/g++
>>> F77=/shared/gcc-4.5.1/bin/gfortran FC=/shared/gcc-4.5.1/bin/gfortran
>>>  make -j 16 >&make.log &
>>> make install
>>>
>>>
>>> cd scalapack-mvapich2-1.9/
>>> tar -xvf scalapack-2.0.2.tar
>>> cd scalapack-2.0.2
>>> export
>>> LD_LIBRARY_PATH=/usr/local/chemistry_software/mvapich2-1.9/gcc-4.5.1/lib:$LD_LIBRARY_PATH
>>> make all
>>> cd BLACS/TESTING/
>>>  /usr/local/chemistry_software/mvapich2-1.9/gcc-4.5.1/bin/mpirun -np 16
>>> ./xCbtest
>>>
>>> I don't want to paste all the errors I get but a sample follows for the
>>> BSBR section:
>>>
>>> INTEGER BSBR TESTS: BEGIN.
>>>
>>> PROCESS {   0,   1} REPORTS ERRORS IN TEST#  2161:
>>>    Invalid element at A(   2,   1):
>>>    Expected=     -995413; Received=          -2
>>>    Complementory triangle overwrite at A(   1,   1):
>>>    Expected=          -2; Received=          -1
>>> PROCESS {   0,   1} DONE ERROR REPORT FOR TEST#  2161.
>>>
>>> PROCESS {   0,   1} REPORTS ERRORS IN TEST#  2162:
>>>    Invalid element at A(   2,   1):
>>>    Expected=     -219319; Received=          -2
>>> PROCESS {   0,   1} DONE ERROR REPORT FOR TEST#  2162.
>>>
>>> PROCESS {   0,   1} REPORTS ERRORS IN TEST#  3761:
>>>    Invalid element at A(   2,   1):
>>>    Expected=      574430; Received=          -2
>>>    Complementory triangle overwrite at A(   1,   1):
>>>    Expected=          -2; Received=          -1
>>> PROCESS {   0,   1} DONE ERROR REPORT FOR TEST#  3761.
>>>
>>> PROCESS {   0,   1} REPORTS ERRORS IN TEST#  4561:
>>>    Invalid element at A(   2,   1):
>>>    Expected=      716842; Received=          -2
>>>    Complementory triangle overwrite at A(   1,   1):
>>>    Expected=          -2; Received=          -1
>>> PROCESS {   0,   1} DONE ERROR REPORT FOR TEST#  4561.
>>>
>>> PROCESS {   1,   0} REPORTS ERRORS IN TEST#  1361:
>>>    Invalid element at A(   2,   1):
>>>    Expected=      862174; Received=          -2
>>>    Complementory triangle overwrite at A(   1,   1):
>>>    Expected=          -2; Received=          -1
>>> PROCESS {   1,   0} DONE ERROR REPORT FOR TEST#  1361.
>>>
>>> PROCESS {   1,   0} REPORTS ERRORS IN TEST#  2161:
>>>    Invalid element at A(   2,   1):
>>>    Expected=     -995413; Received=          -2
>>>    Complementory triangle overwrite at A(   1,   1):
>>>    Expected=          -2; Received=          -1
>>> PROCESS {   1,   0} DONE ERROR REPORT FOR TEST#  2161.
>>>
>>> PROCESS {   1,   0} REPORTS ERRORS IN TEST#  3761:
>>>    Invalid element at A(   2,   1):
>>>    Expected=      574430; Received=          -2
>>>
>>>
>>> These errors do not occur when using the old broadcast method (i.e. with
>>> environmental variable MV2_USE_OLD_BCAST set to 1. There is also the issue
>>> of timing but lets deal with one thing at a time.
>>>
>>> Furthermore, it seems like I am not the only one getting these errors. If
>>> you look at my original posting there is a link to:
>>> http://fpmd.ucdavis.edu/qbox-list/viewtopic.php?p=290
>>> which reports exactly the same issues.
>>>
>>> Do you have any special setting that I may not be aware of that might
>>> result in successful output in your case?
>>>
>>> Thanks for your help.
>>> Cheers,
>>> Claudio
>>>
>>> Krishna Kandalla wrote:
>>>>
>>>> Hello Claudio,
>>>>
>>>>     I just tried running the xdqr test with 16 processes (one node) on
>>>> the TACC Stampede cluster. The overall execution time for this test, with or
>>>> without this flag does not seem to vary much. I am seeing about 1.02 - 1.06s
>>>> as the total time. This test also completes correctly without the env
>>>> variable that we had discussed. And, if it helps, I am also seeing that this
>>>> test takes about 1.7s with Open-MPI-1.6.4.
>>>>     If you are using any specific configure/run-time options for the
>>>> MVAPICH2-1.9 library, could you please share the details?
>>>>
>>>> Thanks,
>>>> Krishna
>>>>
>>>> On Wed, May 15, 2013 at 10:31 AM, Claudio J. Margulis
>>>> <claudio-margulis at uiowa.edu <mailto:claudio-margulis at uiowa.edu>> wrote:
>>>>
>>>>     It seems that my mail did't go through so I am resending it.
>>>>     Please read below.
>>>>     Claudio
>>>>
>>>>
>>>>     Claudio J. Margulis wrote:
>>>>
>>>>         Dear Krishna, thanks for responding.
>>>>         Yes, with that environmental variable the errors are gone.
>>>>         However run time for the tests become extremely long.
>>>>         As an example a typical scalapack test
>>>>         mpirun -np 16 ./xdqr <QR.dat that takes a second to run with
>>>>         openmpi takes on the order of minutes with mvapich2.
>>>>
>>>>         Claudio
>>>>
>>>>
>>>>     --     signature.html Claudio J. Margulis
>>>>
>>>>     Associate Professor of Chemistry
>>>>     The University of Iowa
>>>>     Margulis Group Page
>>>> <http://www.chem.uiowa.edu/faculty/margulis/group/first.html>
>>>>
>>>>
>>>
>>
>> --
>> signature.html Claudio J. Margulis
>> Associate Professor of Chemistry
>> The University of Iowa
>> Margulis Group Page
>> <http://www.chem.uiowa.edu/faculty/margulis/group/first.html>
>>
>
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>



--
Jeff Hammond
Argonne Leadership Computing Facility
University of Chicago Computation Institute
jhammond at alcf.anl.gov / (630) 252-5381
http://www.linkedin.com/in/jeffhammond
https://wiki.alcf.anl.gov/parts/index.php/User:Jhammond
ALCF docs: http://www.alcf.anl.gov/user-guides



More information about the mvapich-discuss mailing list