[mvapich-discuss] Problems running MPI jobs with large (?) numbers of processors

Sayantan Sur surs at cse.ohio-state.edu
Thu Jan 18 18:16:20 EST 2007


Hello Michael,

* On Jan,1 Webb, Michael<Michael.Webb at atk.com> wrote :
> I'm new to the list, mostly because I'm having a problem running codes with
> large (?) numbers of processors.


Welcome to the list :-) Thanks for reporting your problem.

> To illustrate the problem, I've run very similar code written in Fortran90 and
> C++ concurrently (within seconds of each other, using the same nodes on the
> cluster). The Fortran90 code always succeeds, the C++ code always fails.

I see that both the fortran as well as the C++ code lack the call to
MPI_Finalize (which is required to make it a valid MPI program). On
larger number of processors, there might be a slight race of processes
quitting before their barrier messages actually complete -- leading to
incorrect termination.

Can you add the call to MPI_Finalize() and report the results? I'm
hoping that both the fortran as well as C++ codes will succeed :-)

Thanks,
Sayantan.

>  
> Here is the Fortran90 code:
>  
> program fortrantest
> implicit none
> include 'mpif.h'
> integer ierr
> 
> call mpi_init(ierr)
> write(*,*)'.'
> call mpi_barrier(MPI_COMM_WORLD, ierr)
>  
> end
>  
> and here is the C++ code:
>  
> #include <cstdlib>
> #include <iostream>
> #include "mpi.h"
>  
> int main ( int argc, char *argv[] )
>  
> {
>     MPI::Init(argc, argv);
>     std::cout<<".";
>     MPI::COMM_WORLD.Barrier();
> }
>  
> Here is the tracejob output for the Fortran90 code:
>  
> Job: 19347.head
>  
> 01/18/2007 10:18:29  L    Considering job to run
> 01/18/2007 10:18:29  S    enqueuing into workq, state 1 hop 1
> 01/18/2007 10:18:29  S    Job Queued at request of e35689 at head.default.domain,
> owner = e35689 at head.default.domain, job name = webb-x8701, queue = workq
> 01/18/2007 10:18:29  S    Job Run at request of Scheduler at head.default.domain
> on hosts n022c:ncpus=4+n023c:ncpus=4+n044c:ncpus=4+n047c:ncpus=4+n048c:ncpus=
> 4+n049c:ncpus=4+n050c:ncpus=4+n051c:ncpus=4+n052c:ncpus=4+n053c:ncpus=
> 4+n054c:ncpus=4+n055c:ncpus=4+n057c:ncpus=4+n058c:ncpus=4+n066c:ncpus=
> 4+n073c:ncpus=4
> 01/18/2007 10:18:29  S    Job Modified at request of
> Scheduler at head.default.domain
> 01/18/2007 10:18:29  L    Job run
> 01/18/2007 10:18:49  S    Obit received
> 01/18/2007 10:18:49  S    Exit_status=0 resources_used.cpupercent=0
> resources_used.cput=00:00:02 resources_used.mem=167704kb resources_used.ncpus=
> 64 resources_used.vmem=1554300kb resources_used.walltime=00:00:20
> 01/18/2007 10:18:50  S    dequeuing from workq, state 5
>  
> and here is the tracejob output for the C++ code:
>  
> Job: 19346.head
>  
> 01/18/2007 10:17:40  S    enqueuing into workq, state 1 hop 1
> 01/18/2007 10:17:41  L    Considering job to run
> 01/18/2007 10:17:41  S    Job Queued at request of e35689 at head.default.domain,
> owner = e35689 at head.default.domain, job name = photontest, queue = workq
> 01/18/2007 10:17:41  S    Job Run at request of Scheduler at head.default.domain
> on hosts n022c:ncpus=4+n023c:ncpus=4+n044c:ncpus=4+n047c:ncpus=4+n048c:ncpus=
> 4+n049c:ncpus=4+n050c:ncpus=4+n051c:ncpus=4+n052c:ncpus=4+n053c:ncpus=
> 4+n054c:ncpus=4+n055c:ncpus=4+n057c:ncpus=4+n058c:ncpus=4+n066c:ncpus=
> 4+n073c:ncpus=4
> 01/18/2007 10:17:41  S    Job Modified at request of
> Scheduler at head.default.domain
> 01/18/2007 10:17:41  L    Job run
> 01/18/2007 10:18:13  S    Obit received
> 01/18/2007 10:18:13  S    Exit_status=1 resources_used.cpupercent=0
> resources_used.cput=00:00:00 resources_used.mem=157584kb resources_used.ncpus=
> 64 resources_used.vmem=1459504kb resources_used.walltime=00:00:22
> 01/18/2007 10:18:13  S    dequeuing from workq, state 5
>  
> You'll notice that I'm using the exact same nodes in each case.
>  
> Can somebody please help me diagnose and solve this problem? I presume that
> there's something wrong (?) with our comm backbone, but the admins tell me the
> cluster is working just fine.
>  
> Thanks,
>  
> Michael Webb
> Scientist
> ATK Launch Systems

> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss


-- 
http://www.cse.ohio-state.edu/~surs


More information about the mvapich-discuss mailing list