[mvapich-discuss] Errors in BSBR when running xCbtest and xFbest
of the BLACS
Jeff Hammond
jhammond at alcf.anl.gov
Mon May 20 21:53:03 EDT 2013
BLACS does all sorts of terrible things with MPI, including
non-deterministic reductions that break the eigensolver at scale.
Instead of complaining about an MPI implementation, you should just
use a dense linear algebra library that uses MPI properly. Elemental
(code.google.com/p/elemental/) is such a library. Elemental is used
with great success in a number of quantum mechanics code, including
QBox, which you appear to be using.
Jeff
On Mon, May 20, 2013 at 3:07 PM, Margulis, Claudio J
<claudio-margulis at uiowa.edu> wrote:
> Dear Krishna, I very much appreciate you looking into this and passing it to
> developers. I am replying to the mailing list so that the thread does not
> remain without conclusion. Hopefully MVAPICH2 will be able to properly deal
> with scalapack and the BLACS in future releases. I think it is important for
> people compiling their codes with MVAPICH2 to know that this release version
> does not pass the BLACS tests.
>
> Is there a way we can get informed when a patch is released that will
> resolve this issue? Many quantum mechanics codes use these linear algebra
> routines.
>
> Thanks,
> cheers
> Claudio
>
> ________________________________
> From: krishna.kandalla at gmail.com [krishna.kandalla at gmail.com] on behalf of
> Krishna Kandalla [kandalla at cse.ohio-state.edu]
> Sent: Thursday, May 16, 2013 10:19 AM
> To: Margulis, Claudio J
> Cc: MVAPICH-Core
> Subject: Re: [mvapich-discuss] Errors in BSBR when running xCbtest and
> xFbest of the BLACS
>
> Hi Claudio,
> Thanks for sharing the details. We see the same error message with
> the xCbtest. We will continue working on this issue.
> (I am CC'ing our internal developer list)
>
> Thanks,
> Krishna
>
> On Thu, May 16, 2013 at 10:28 AM, Claudio J. Margulis
> <claudio-margulis at uiowa.edu> wrote:
>>
>> I guess it would be useful if I also paste my SL.make for scalapack:
>>
>>
>> ############################################################################
>> #
>> # Program: ScaLAPACK
>> #
>> # Module: SLmake.inc
>> #
>> # Purpose: Top-level Definitions
>> #
>> # Creation date: February 15, 2000
>> #
>> # Modified: October 13, 2011
>> #
>> # Send bug reports, comments or suggestions to scalapack at cs.utk.edu
>> #
>>
>> ############################################################################
>> #
>> # C preprocessor definitions: set CDEFS to one of the following:
>> #
>> # -DNoChange (fortran subprogram names are lower case without any
>> suffix)
>> # -DUpCase (fortran subprogram names are upper case without any
>> suffix)
>> # -DAdd_ (fortran subprogram names are lower case with "_"
>> appended)
>>
>> CDEFS = -DAdd_
>>
>> #
>> # The fortran and C compilers, loaders, and their flags
>> #
>>
>> FC =
>> /usr/local/chemistry_software/mvapich2-1.9/gcc-4.5.1/bin/mpif90
>> CC =
>> /usr/local/chemistry_software/mvapich2-1.9/gcc-4.5.1/bin/mpicc
>> NOOPT = -O0
>> FCFLAGS = -O3
>> CCFLAGS = -O3
>> FCLOADER = $(FC)
>> CCLOADER = $(CC)
>> FCLOADFLAGS = $(FCFLAGS)
>> CCLOADFLAGS = $(CCFLAGS)
>>
>> #
>> # The archiver and the flag(s) to use when building archive (library)
>> # Also the ranlib routine. If your system has no ranlib, set RANLIB =
>> echo
>> #
>>
>> ARCH = ar
>> ARCHFLAGS = cr
>> RANLIB = ranlib
>>
>> #
>> # The name of the ScaLAPACK library to be created
>> #
>>
>> SCALAPACKLIB = libscalapack.a
>>
>> #
>> # BLAS, LAPACK (and possibly other) libraries needed for linking test
>> programs
>> #
>>
>> #BLASLIB =
>> LAPACKLIB =
>> LIBS = /shared/acml-4.4.0/gfortran64/lib/libacml.a
>>
>>
>>
>>
>> Claudio J. Margulis wrote:
>>>
>>> Dear Krishna, I don't think there are any special options: This were the
>>> commands:
>>>
>>>
>>> gunzip mvapich2-1.9.tgz
>>> tar -xvf mvapich2-1.9.tar
>>> cd mvapich2-1.9
>>> export
>>> LD_LIBRARY_PATH=/shared/gcc-4.5.1/lib64:/shared/gcc-4.5.1/lib:/shared/mpc-0.8.2/lib:/shared/mpfr-3.0.0/lib:/shared/gmp-4.3.2/lib
>>> ./configure --prefix=/usr/local/chemistry_software/mvapich2-1.9/gcc-4.5.1
>>> CC=/shared/gcc-4.5.1/bin/gcc CXX=/shared/gcc-4.5.1/bin/g++
>>> F77=/shared/gcc-4.5.1/bin/gfortran FC=/shared/gcc-4.5.1/bin/gfortran
>>> make -j 16 >&make.log &
>>> make install
>>>
>>>
>>> cd scalapack-mvapich2-1.9/
>>> tar -xvf scalapack-2.0.2.tar
>>> cd scalapack-2.0.2
>>> export
>>> LD_LIBRARY_PATH=/usr/local/chemistry_software/mvapich2-1.9/gcc-4.5.1/lib:$LD_LIBRARY_PATH
>>> make all
>>> cd BLACS/TESTING/
>>> /usr/local/chemistry_software/mvapich2-1.9/gcc-4.5.1/bin/mpirun -np 16
>>> ./xCbtest
>>>
>>> I don't want to paste all the errors I get but a sample follows for the
>>> BSBR section:
>>>
>>> INTEGER BSBR TESTS: BEGIN.
>>>
>>> PROCESS { 0, 1} REPORTS ERRORS IN TEST# 2161:
>>> Invalid element at A( 2, 1):
>>> Expected= -995413; Received= -2
>>> Complementory triangle overwrite at A( 1, 1):
>>> Expected= -2; Received= -1
>>> PROCESS { 0, 1} DONE ERROR REPORT FOR TEST# 2161.
>>>
>>> PROCESS { 0, 1} REPORTS ERRORS IN TEST# 2162:
>>> Invalid element at A( 2, 1):
>>> Expected= -219319; Received= -2
>>> PROCESS { 0, 1} DONE ERROR REPORT FOR TEST# 2162.
>>>
>>> PROCESS { 0, 1} REPORTS ERRORS IN TEST# 3761:
>>> Invalid element at A( 2, 1):
>>> Expected= 574430; Received= -2
>>> Complementory triangle overwrite at A( 1, 1):
>>> Expected= -2; Received= -1
>>> PROCESS { 0, 1} DONE ERROR REPORT FOR TEST# 3761.
>>>
>>> PROCESS { 0, 1} REPORTS ERRORS IN TEST# 4561:
>>> Invalid element at A( 2, 1):
>>> Expected= 716842; Received= -2
>>> Complementory triangle overwrite at A( 1, 1):
>>> Expected= -2; Received= -1
>>> PROCESS { 0, 1} DONE ERROR REPORT FOR TEST# 4561.
>>>
>>> PROCESS { 1, 0} REPORTS ERRORS IN TEST# 1361:
>>> Invalid element at A( 2, 1):
>>> Expected= 862174; Received= -2
>>> Complementory triangle overwrite at A( 1, 1):
>>> Expected= -2; Received= -1
>>> PROCESS { 1, 0} DONE ERROR REPORT FOR TEST# 1361.
>>>
>>> PROCESS { 1, 0} REPORTS ERRORS IN TEST# 2161:
>>> Invalid element at A( 2, 1):
>>> Expected= -995413; Received= -2
>>> Complementory triangle overwrite at A( 1, 1):
>>> Expected= -2; Received= -1
>>> PROCESS { 1, 0} DONE ERROR REPORT FOR TEST# 2161.
>>>
>>> PROCESS { 1, 0} REPORTS ERRORS IN TEST# 3761:
>>> Invalid element at A( 2, 1):
>>> Expected= 574430; Received= -2
>>>
>>>
>>> These errors do not occur when using the old broadcast method (i.e. with
>>> environmental variable MV2_USE_OLD_BCAST set to 1. There is also the issue
>>> of timing but lets deal with one thing at a time.
>>>
>>> Furthermore, it seems like I am not the only one getting these errors. If
>>> you look at my original posting there is a link to:
>>> http://fpmd.ucdavis.edu/qbox-list/viewtopic.php?p=290
>>> which reports exactly the same issues.
>>>
>>> Do you have any special setting that I may not be aware of that might
>>> result in successful output in your case?
>>>
>>> Thanks for your help.
>>> Cheers,
>>> Claudio
>>>
>>> Krishna Kandalla wrote:
>>>>
>>>> Hello Claudio,
>>>>
>>>> I just tried running the xdqr test with 16 processes (one node) on
>>>> the TACC Stampede cluster. The overall execution time for this test, with or
>>>> without this flag does not seem to vary much. I am seeing about 1.02 - 1.06s
>>>> as the total time. This test also completes correctly without the env
>>>> variable that we had discussed. And, if it helps, I am also seeing that this
>>>> test takes about 1.7s with Open-MPI-1.6.4.
>>>> If you are using any specific configure/run-time options for the
>>>> MVAPICH2-1.9 library, could you please share the details?
>>>>
>>>> Thanks,
>>>> Krishna
>>>>
>>>> On Wed, May 15, 2013 at 10:31 AM, Claudio J. Margulis
>>>> <claudio-margulis at uiowa.edu <mailto:claudio-margulis at uiowa.edu>> wrote:
>>>>
>>>> It seems that my mail did't go through so I am resending it.
>>>> Please read below.
>>>> Claudio
>>>>
>>>>
>>>> Claudio J. Margulis wrote:
>>>>
>>>> Dear Krishna, thanks for responding.
>>>> Yes, with that environmental variable the errors are gone.
>>>> However run time for the tests become extremely long.
>>>> As an example a typical scalapack test
>>>> mpirun -np 16 ./xdqr <QR.dat that takes a second to run with
>>>> openmpi takes on the order of minutes with mvapich2.
>>>>
>>>> Claudio
>>>>
>>>>
>>>> -- signature.html Claudio J. Margulis
>>>>
>>>> Associate Professor of Chemistry
>>>> The University of Iowa
>>>> Margulis Group Page
>>>> <http://www.chem.uiowa.edu/faculty/margulis/group/first.html>
>>>>
>>>>
>>>
>>
>> --
>> signature.html Claudio J. Margulis
>> Associate Professor of Chemistry
>> The University of Iowa
>> Margulis Group Page
>> <http://www.chem.uiowa.edu/faculty/margulis/group/first.html>
>>
>
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
--
Jeff Hammond
Argonne Leadership Computing Facility
University of Chicago Computation Institute
jhammond at alcf.anl.gov / (630) 252-5381
http://www.linkedin.com/in/jeffhammond
https://wiki.alcf.anl.gov/parts/index.php/User:Jhammond
ALCF docs: http://www.alcf.anl.gov/user-guides
More information about the mvapich-discuss
mailing list