[mvapich-discuss] Problems running MPI jobs with large (?) numbers of processors

Thu Jan 18 12:55:53 EST 2007

I'm new to the list, mostly because I'm having a problem running codes
with large (?) numbers of processors.

I'm using mvapich-0.9.7, but the problem also happens with
mvapich-0.9.5, which is also installed on our cluster.

Our cluster has ~440 processors. My job runs fine with 4-about 32+
processors. When I run a job with 64+ processors, it fails. STDERR gives
nothing more than "done"; the code just quits. My tests show that the
code isn't getting past MPI::Init().

To illustrate the problem, I've run very similar code written in
Fortran90 and C++ concurrently (within seconds of each other, using the
same nodes on the cluster). The Fortran90 code always succeeds, the C++
code always fails.

Here is the Fortran90 code:

program fortrantest
implicit none
include 'mpif.h'
integer ierr

call mpi_init(ierr)
write(*,*)'.'
call mpi_barrier(MPI_COMM_WORLD, ierr)

end

and here is the C++ code:

#include <cstdlib>
#include <iostream>

#include "mpi.h"

int main ( int argc, char *argv[] )

{
    MPI::Init(argc, argv);
    std::cout<<".";
    MPI::COMM_WORLD.Barrier();
}

Here is the tracejob output for the Fortran90 code:

Job: 19347.head

01/18/2007 10:18:29  L    Considering job to run
01/18/2007 10:18:29  S    enqueuing into workq, state 1 hop 1
01/18/2007 10:18:29  S    Job Queued at request of
e35689 at head.default.domain, owner = e35689 at head.default.domain, job name
= webb-x8701, queue = workq
01/18/2007 10:18:29  S    Job Run at request of
Scheduler at head.default.domain on hosts
n022c:ncpus=4+n023c:ncpus=4+n044c:ncpus=4+n047c:ncpus=4+n048c:ncpus=4+n0
49c:ncpus=4+n050c:ncpus=4+n051c:ncpus=4+n052c:ncpus=4+n053c:ncpus=4+n054
c:ncpus=4+n055c:ncpus=4+n057c:ncpus=4+n058c:ncpus=4+n066c:ncpus=4+n073c:
ncpus=4
01/18/2007 10:18:29  S    Job Modified at request of
Scheduler at head.default.domain
01/18/2007 10:18:29  L    Job run
01/18/2007 10:18:49  S    Obit received
01/18/2007 10:18:49  S    Exit_status=0 resources_used.cpupercent=0
resources_used.cput=00:00:02 resources_used.mem=167704kb
resources_used.ncpus=64 resources_used.vmem=1554300kb
resources_used.walltime=00:00:20
01/18/2007 10:18:50  S    dequeuing from workq, state 5

and here is the tracejob output for the C++ code:

Job: 19346.head

01/18/2007 10:17:40  S    enqueuing into workq, state 1 hop 1
01/18/2007 10:17:41  L    Considering job to run
01/18/2007 10:17:41  S    Job Queued at request of
e35689 at head.default.domain, owner = e35689 at head.default.domain, job name
= photontest, queue = workq
01/18/2007 10:17:41  S    Job Run at request of
Scheduler at head.default.domain on hosts
n022c:ncpus=4+n023c:ncpus=4+n044c:ncpus=4+n047c:ncpus=4+n048c:ncpus=4+n0
49c:ncpus=4+n050c:ncpus=4+n051c:ncpus=4+n052c:ncpus=4+n053c:ncpus=4+n054
c:ncpus=4+n055c:ncpus=4+n057c:ncpus=4+n058c:ncpus=4+n066c:ncpus=4+n073c:
ncpus=4
01/18/2007 10:17:41  S    Job Modified at request of
Scheduler at head.default.domain
01/18/2007 10:17:41  L    Job run
01/18/2007 10:18:13  S    Obit received
01/18/2007 10:18:13  S    Exit_status=1 resources_used.cpupercent=0
resources_used.cput=00:00:00 resources_used.mem=157584kb
resources_used.ncpus=64 resources_used.vmem=1459504kb
resources_used.walltime=00:00:22
01/18/2007 10:18:13  S    dequeuing from workq, state 5

You'll notice that I'm using the exact same nodes in each case.

Can somebody please help me diagnose and solve this problem? I presume
that there's something wrong (?) with our comm backbone, but the admins
tell me the cluster is working just fine.

Thanks,

Michael Webb
Scientist
ATK Launch Systems
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20070118/f4eb03f9/attachment-0001.html