[mvapich-discuss] mvapich2 jobs stall after successful completion

Vlad Cojocaru vlad.cojocaru at mpi-muenster.mpg.de
Mon Jan 10 15:49:42 EST 2011


Dear MVAPICH2 users,

I am running molecular dynamics programs (AMBER and NAMD) using
MVAPICH2, version 1.6rc1 .
My jobs stall after successful completion. Basically, everything ooks
fine, job finishes with all the output complete but then the job does
not exit, it hangs (appears as a ghost job). If I kill the left over
"mpiswam" process, everything is fine, but of course if I have the
parallel run as a step in a workflow, I always need to manually kill
this left over process so that the subsequent jobs can run.

Did anybody noticed such behavior ? I also have to add that this is not
reproducible, it happens at random times, submitting the same job over
and over again does produce the same outcome.
Also, it happens even with the simple test provided with MVAPICH2 ... On
the same cluster, OPENMPI 1.4.3 runs correctly. MVAPICH2 appears to
scale better, that's why I would like to use it.

Here are details on my architecture:
cpu: AMD Opteron Istanbul
arch: Linux x86_64, CentOS 5.5
mpi: MVAPICH2 1.6 rc1 (the problem appeared also with version 1.5)
compiler: INTEL 11.0.073 or GCC 4.5.1 (problem is seen with both
compilations)
interconnection: Mellanox infiniband
Oracle Grid Engine used for controlling the jobs (however the problem
appears also when jobs are run without the grid engine)

If anybody has seen such a behavior before and knows an elegant fix, I
would appreciate an advice

Thank you

Best wishes
Vlad


-- 
Dr. Vlad Cojocaru
Max Planck Institute for Molecular Biomedicine
Department of Cellular and Developmental Biology
Roentgenstrasse 20
48149 Muenster, Germany
tel: +49-251-70365-324
fax: +49-251-70365-399
email: vlad.cojocaru[at]mpi-muenster.mpg.de




More information about the mvapich-discuss mailing list