[mvapich-discuss] mvapich2 jobs stall after successful completion

Vlad Cojocaru vlad.cojocaru at mpi-muenster.mpg.de
Mon Jan 10 18:02:10 EST 2011


Dear Jonathan,

The build script I use is below ...
I will see when I'll have time to test the rc2 version and let you know

Best wishes
Vlad

----------build_intel11.sh -------------------
#!/bin/bash

export COMP="intel-11.1.073"

export INTEL_HOME=/usr/global/compiler/intel/11.1.073
export MKL_HOME=/usr/global/compiler/intel/11.1.073/mkl

source $INTEL_HOME/bin/iccvars.sh intel64
source $INTEL_HOME/bin/ifortvars.sh intel64

export INCLUDE_PATH=$INTEL_HOME/include:\
$INTEL_HOME/include/intel64:\
$MKL_HOME/include:\
$MKL_HOME/include/em64t

export CC="icc"
export CXX="icpc"
export F77="ifort"
export F90="ifort"
export FC="ifort"
export LDFLAGS="-L$LD_LIBRARY_PATH"
export CPP="icc -E"
export CXXCPP="icpc -E"
export CPPFLAGS="-I$INCLUDE_PATH"
export CFLAGS="-O3 -msse3"
export CXXFLAGS="-O3 -msse3"
export FPP="ifort -E"
export FFLAGS="-O3 -msse3"
export FCLAGS="-O3 -msse3"
export F90FLAGS="-O3 -msse3"
export F77FLAGS="-O3 -msse3"

echo "LD_LIBRARY_PATH: $LD_LIBRARY_PATH"
echo "PATH: $PATH"
ICC=`which icc`
echo "ICC=$ICC"
ICPC=`which icpc`
echo "ICPC=$ICPC"
IFORT=`which ifort`
echo "IFORT=$IFORT"

make clean
make distclean

./configure --prefix=/usr/global/mpi/mvapich2/1.6rc1-$COMP \
            --enable-f77 \
            --enable-f90 \
            --enable-cxx \
            --enable-romio \
            --enable-threads=default \
            --enable-sharedlibs=gcc \
            --with-thread-package=pthreads

make 2>&1 | tee make_$COMP.log
make install 2>&1 | tee make_install_$COMP.log

exit


On 01/10/2011 10:30 PM, Jonathan Perkins wrote:
> Hi Vlad,
> Can you provide us the configuration options that you've used to build
> mvapich2?  Also, does this happen for mvapich2-1.6rc2?  If this is
> reproducible with mvapich2-1.6rc2 will it be possible for you to post
> a back trace of the mpispawn process?  Thanks in advance.
>
> On Mon, Jan 10, 2011 at 3:49 PM, Vlad Cojocaru
> <vlad.cojocaru at mpi-muenster.mpg.de> wrote:
>   
>> Dear MVAPICH2 users,
>>
>> I am running molecular dynamics programs (AMBER and NAMD) using
>> MVAPICH2, version 1.6rc1 .
>> My jobs stall after successful completion. Basically, everything ooks
>> fine, job finishes with all the output complete but then the job does
>> not exit, it hangs (appears as a ghost job). If I kill the left over
>> "mpiswam" process, everything is fine, but of course if I have the
>> parallel run as a step in a workflow, I always need to manually kill
>> this left over process so that the subsequent jobs can run.
>>
>> Did anybody noticed such behavior ? I also have to add that this is not
>> reproducible, it happens at random times, submitting the same job over
>> and over again does produce the same outcome.
>> Also, it happens even with the simple test provided with MVAPICH2 ... On
>> the same cluster, OPENMPI 1.4.3 runs correctly. MVAPICH2 appears to
>> scale better, that's why I would like to use it.
>>
>> Here are details on my architecture:
>> cpu: AMD Opteron Istanbul
>> arch: Linux x86_64, CentOS 5.5
>> mpi: MVAPICH2 1.6 rc1 (the problem appeared also with version 1.5)
>> compiler: INTEL 11.0.073 or GCC 4.5.1 (problem is seen with both
>> compilations)
>> interconnection: Mellanox infiniband
>> Oracle Grid Engine used for controlling the jobs (however the problem
>> appears also when jobs are run without the grid engine)
>>
>> If anybody has seen such a behavior before and knows an elegant fix, I
>> would appreciate an advice
>>
>> Thank you
>>
>> Best wishes
>> Vlad
>>
>>
>> --
>> Dr. Vlad Cojocaru
>> Max Planck Institute for Molecular Biomedicine
>> Department of Cellular and Developmental Biology
>> Roentgenstrasse 20
>> 48149 Muenster, Germany
>> tel: +49-251-70365-324
>> fax: +49-251-70365-399
>> email: vlad.cojocaru[at]mpi-muenster.mpg.de
>>
>>
>> _______________________________________________
>> mvapich-discuss mailing list
>> mvapich-discuss at cse.ohio-state.edu
>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>
>>
>>     
>
>
>   

-- 
Dr. Vlad Cojocaru
Max Planck Institute for Molecular Biomedicine
Department of Cellular and Developmental Biology
Roentgenstrasse 20
48149 Muenster, Germany
tel: +49-251-70365-324
fax: +49-251-70365-399
email: vlad.cojocaru[at]mpi-muenster.mpg.de




More information about the mvapich-discuss mailing list