[mvapich-discuss] Jobs run slowly with >1 job on the same nodes

Nick Holway nick.holway at gmail.com
Thu Apr 30 10:08:27 EDT 2009


Dear all.

I'm running a 64bit Rocks 5.1 cluster (ie Centos 5.2) with Voltaire
OFED 1.4 and SGE 6.1u5. I compiled MVAPICH 1.2 with ifort 10 and I
configured it with F77 & F90 bindings. The nodes all have 2 quad core
Xeon CPUs.

We've compiled PMEMD and sander.MPI and see the same problem with
both. When one job is run at a time (32 CPUs on 8 nodes) the job runs
well with good performance. If two jobs (eg 32 on the same 8 nodes)
are launched at the same time then both jobs run an order of magnitude
slower. A single 64 CPU run on the same nodes runs normally.

We're also seeing problems with jobs disapearing from SGE and qdel not
deleting the jobs properly.

Does anyone know what might be causing the above issues? FWIW I've run
the osu benchmarks and subounce on the cluster without issue.

I originally raised this on the Amber mailing list who suggested that
it's more likely to be a system problem rather than with their
software (http://structbio.vanderbilt.edu/archives/amber-archive/2009/1410.php).

Regards

Nick


More information about the mvapich-discuss mailing list