[mvapich-discuss] process limits

Dhabaleswar Panda panda at cse.ohio-state.edu
Wed Aug 29 13:21:56 EDT 2007


Hi Mark,

Thanks for providing us the details. There appears to be some
`time out' with the new mpirun_rsh. We are taking a look at it
and will be able to send you some solution soon.

Thanks,

DK

On Tue, 28 Aug 2007, Mark Potts wrote:

> Hi,
>     I tried VIADEV_USE_SHMEM_COLL=0 and separately tried
>     VIADEV_USE_BLOCKING=1, with no change in results.  During task
>     startup I get either "Unable to find child nnnn!", "Child died.
>     Timeout while waiting", and/or simply "done."
>
>     I tried repeatedly but was never able to consistently run more
>     than 10 ranks (-np 10) on a single node.  I, of course, am
>     able to run many more ranks, when I spread the targets across
>     more nodes.
>
>    My experiment is to start a very simple code with multiple
>    processes on a single node.  Specific details of my setup on two
>    machines.  The results were the same:
>
>     Machine Cpus per  Cores per   Avail   Cpu      MVAPICH     MVAPICH
>              Node      Cpu        Nodes   Type     version     Device
>      A        1         2          3     X86-64  -0.9.9-1168   ch_gen2
>      B        2         4         16     X86_64  -0.9.9-1326   ch_gen2
>
>     The MVAPICH code, which was obtained from ofed 1.2 installation,
>     has two patches as follows:
>      (1) for mpirun_rsh.c from Sayatan Sur of 10 Jul for MVAPICH errant
>          process/job cleanup.
>      (2) for comm_free.c from Amith Rajith Mamidla of 11 Jul for MVAPICH
>          segmentation fault during MPI_Finalize() in large jobs.
>
>     Is it possible that the mpirun_rsh.c patch is prematurely killing
>     tasks when it determines that the processes on the oversubscribed
>     node are not responding fast enough?  Or is there another
>     clean explanation?  As I understand the note from DK this morning,
>     oversubscription should work...
>           regards,
>
> amith rajith mamidala wrote:
> > Hi Mark,
> >
> > Can you check if you get this error by setting the environment variable:
> > VIADEV_USE_SHMEM_COLL to 0 e.g. mpirun_rsh -np N VIADEV_USE_SHMEM_COLL=0
> > ./a.out
> >
> > -thanks,
> > Amith
> >
> > On Tue, 28 Aug 2007, Mark Potts wrote:
> >
> >> Hi,
> >>     Is there an effective or hard limit on the number of MVAPICH
> >>     processes that can be run on a single node?
> >>
> >>     Given N cpus, each having M cores, on a single node, I've been told
> >>     that one can not run more than N*M MVAPICH processes on a single
> >>     node.  In fact, I observe that if I try to even approach this number
> >>     with "-np 16" (for a node with N=8 and M=4), I observe a "unable to
> >>     find child nnnn!" or "Child died" message.  Is this a configuration
> >>     problem with this system or somehow an expected behavior?
> >>
> >>     More pointedly, should oversubscription of cores, np > N*M, on a
> >>     single node work in MVAPICH?  How about in MVAPICH2?
> >>
> >>             regards,
> >> --
> >> ***********************************
> >>  >> Mark J. Potts, PhD
> >>  >>
> >>  >> HPC Applications Inc.
> >>  >> phone: 410-992-8360 Bus
> >>  >>        410-313-9318 Home
> >>  >>        443-418-4375 Cell
> >>  >> email: potts at hpcapplications.com
> >>  >>        potts at excray.com
> >> ***********************************
> >> _______________________________________________
> >> mvapich-discuss mailing list
> >> mvapich-discuss at cse.ohio-state.edu
> >> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> >>
>
> --
> ***********************************
>  >> Mark J. Potts, PhD
>  >>
>  >> HPC Applications Inc.
>  >> phone: 410-992-8360 Bus
>  >>        410-313-9318 Home
>  >>        443-418-4375 Cell
>  >> email: potts at hpcapplications.com
>  >>        potts at excray.com
> ***********************************
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>



More information about the mvapich-discuss mailing list