[mvapich-discuss] Problems running MPI jobs with large (?)
numbers of processors
Sayantan Sur
surs at cse.ohio-state.edu
Thu Jan 18 18:16:20 EST 2007
Hello Michael,
* On Jan,1 Webb, Michael<Michael.Webb at atk.com> wrote :
> I'm new to the list, mostly because I'm having a problem running codes with
> large (?) numbers of processors.
Welcome to the list :-) Thanks for reporting your problem.
> To illustrate the problem, I've run very similar code written in Fortran90 and
> C++ concurrently (within seconds of each other, using the same nodes on the
> cluster). The Fortran90 code always succeeds, the C++ code always fails.
I see that both the fortran as well as the C++ code lack the call to
MPI_Finalize (which is required to make it a valid MPI program). On
larger number of processors, there might be a slight race of processes
quitting before their barrier messages actually complete -- leading to
incorrect termination.
Can you add the call to MPI_Finalize() and report the results? I'm
hoping that both the fortran as well as C++ codes will succeed :-)
Thanks,
Sayantan.
>
> Here is the Fortran90 code:
>
> program fortrantest
> implicit none
> include 'mpif.h'
> integer ierr
>
> call mpi_init(ierr)
> write(*,*)'.'
> call mpi_barrier(MPI_COMM_WORLD, ierr)
>
> end
>
> and here is the C++ code:
>
> #include <cstdlib>
> #include <iostream>
> #include "mpi.h"
>
> int main ( int argc, char *argv[] )
>
> {
> MPI::Init(argc, argv);
> std::cout<<".";
> MPI::COMM_WORLD.Barrier();
> }
>
> Here is the tracejob output for the Fortran90 code:
>
> Job: 19347.head
>
> 01/18/2007 10:18:29 L Considering job to run
> 01/18/2007 10:18:29 S enqueuing into workq, state 1 hop 1
> 01/18/2007 10:18:29 S Job Queued at request of e35689 at head.default.domain,
> owner = e35689 at head.default.domain, job name = webb-x8701, queue = workq
> 01/18/2007 10:18:29 S Job Run at request of Scheduler at head.default.domain
> on hosts n022c:ncpus=4+n023c:ncpus=4+n044c:ncpus=4+n047c:ncpus=4+n048c:ncpus=
> 4+n049c:ncpus=4+n050c:ncpus=4+n051c:ncpus=4+n052c:ncpus=4+n053c:ncpus=
> 4+n054c:ncpus=4+n055c:ncpus=4+n057c:ncpus=4+n058c:ncpus=4+n066c:ncpus=
> 4+n073c:ncpus=4
> 01/18/2007 10:18:29 S Job Modified at request of
> Scheduler at head.default.domain
> 01/18/2007 10:18:29 L Job run
> 01/18/2007 10:18:49 S Obit received
> 01/18/2007 10:18:49 S Exit_status=0 resources_used.cpupercent=0
> resources_used.cput=00:00:02 resources_used.mem=167704kb resources_used.ncpus=
> 64 resources_used.vmem=1554300kb resources_used.walltime=00:00:20
> 01/18/2007 10:18:50 S dequeuing from workq, state 5
>
> and here is the tracejob output for the C++ code:
>
> Job: 19346.head
>
> 01/18/2007 10:17:40 S enqueuing into workq, state 1 hop 1
> 01/18/2007 10:17:41 L Considering job to run
> 01/18/2007 10:17:41 S Job Queued at request of e35689 at head.default.domain,
> owner = e35689 at head.default.domain, job name = photontest, queue = workq
> 01/18/2007 10:17:41 S Job Run at request of Scheduler at head.default.domain
> on hosts n022c:ncpus=4+n023c:ncpus=4+n044c:ncpus=4+n047c:ncpus=4+n048c:ncpus=
> 4+n049c:ncpus=4+n050c:ncpus=4+n051c:ncpus=4+n052c:ncpus=4+n053c:ncpus=
> 4+n054c:ncpus=4+n055c:ncpus=4+n057c:ncpus=4+n058c:ncpus=4+n066c:ncpus=
> 4+n073c:ncpus=4
> 01/18/2007 10:17:41 S Job Modified at request of
> Scheduler at head.default.domain
> 01/18/2007 10:17:41 L Job run
> 01/18/2007 10:18:13 S Obit received
> 01/18/2007 10:18:13 S Exit_status=1 resources_used.cpupercent=0
> resources_used.cput=00:00:00 resources_used.mem=157584kb resources_used.ncpus=
> 64 resources_used.vmem=1459504kb resources_used.walltime=00:00:22
> 01/18/2007 10:18:13 S dequeuing from workq, state 5
>
> You'll notice that I'm using the exact same nodes in each case.
>
> Can somebody please help me diagnose and solve this problem? I presume that
> there's something wrong (?) with our comm backbone, but the admins tell me the
> cluster is working just fine.
>
> Thanks,
>
> Michael Webb
> Scientist
> ATK Launch Systems
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
--
http://www.cse.ohio-state.edu/~surs
More information about the mvapich-discuss
mailing list