[mvapich-discuss] process limits
Dhabaleswar Panda
panda at cse.ohio-state.edu
Wed Aug 29 13:21:56 EDT 2007
Hi Mark,
Thanks for providing us the details. There appears to be some
`time out' with the new mpirun_rsh. We are taking a look at it
and will be able to send you some solution soon.
Thanks,
DK
On Tue, 28 Aug 2007, Mark Potts wrote:
> Hi,
> I tried VIADEV_USE_SHMEM_COLL=0 and separately tried
> VIADEV_USE_BLOCKING=1, with no change in results. During task
> startup I get either "Unable to find child nnnn!", "Child died.
> Timeout while waiting", and/or simply "done."
>
> I tried repeatedly but was never able to consistently run more
> than 10 ranks (-np 10) on a single node. I, of course, am
> able to run many more ranks, when I spread the targets across
> more nodes.
>
> My experiment is to start a very simple code with multiple
> processes on a single node. Specific details of my setup on two
> machines. The results were the same:
>
> Machine Cpus per Cores per Avail Cpu MVAPICH MVAPICH
> Node Cpu Nodes Type version Device
> A 1 2 3 X86-64 -0.9.9-1168 ch_gen2
> B 2 4 16 X86_64 -0.9.9-1326 ch_gen2
>
> The MVAPICH code, which was obtained from ofed 1.2 installation,
> has two patches as follows:
> (1) for mpirun_rsh.c from Sayatan Sur of 10 Jul for MVAPICH errant
> process/job cleanup.
> (2) for comm_free.c from Amith Rajith Mamidla of 11 Jul for MVAPICH
> segmentation fault during MPI_Finalize() in large jobs.
>
> Is it possible that the mpirun_rsh.c patch is prematurely killing
> tasks when it determines that the processes on the oversubscribed
> node are not responding fast enough? Or is there another
> clean explanation? As I understand the note from DK this morning,
> oversubscription should work...
> regards,
>
> amith rajith mamidala wrote:
> > Hi Mark,
> >
> > Can you check if you get this error by setting the environment variable:
> > VIADEV_USE_SHMEM_COLL to 0 e.g. mpirun_rsh -np N VIADEV_USE_SHMEM_COLL=0
> > ./a.out
> >
> > -thanks,
> > Amith
> >
> > On Tue, 28 Aug 2007, Mark Potts wrote:
> >
> >> Hi,
> >> Is there an effective or hard limit on the number of MVAPICH
> >> processes that can be run on a single node?
> >>
> >> Given N cpus, each having M cores, on a single node, I've been told
> >> that one can not run more than N*M MVAPICH processes on a single
> >> node. In fact, I observe that if I try to even approach this number
> >> with "-np 16" (for a node with N=8 and M=4), I observe a "unable to
> >> find child nnnn!" or "Child died" message. Is this a configuration
> >> problem with this system or somehow an expected behavior?
> >>
> >> More pointedly, should oversubscription of cores, np > N*M, on a
> >> single node work in MVAPICH? How about in MVAPICH2?
> >>
> >> regards,
> >> --
> >> ***********************************
> >> >> Mark J. Potts, PhD
> >> >>
> >> >> HPC Applications Inc.
> >> >> phone: 410-992-8360 Bus
> >> >> 410-313-9318 Home
> >> >> 443-418-4375 Cell
> >> >> email: potts at hpcapplications.com
> >> >> potts at excray.com
> >> ***********************************
> >> _______________________________________________
> >> mvapich-discuss mailing list
> >> mvapich-discuss at cse.ohio-state.edu
> >> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> >>
>
> --
> ***********************************
> >> Mark J. Potts, PhD
> >>
> >> HPC Applications Inc.
> >> phone: 410-992-8360 Bus
> >> 410-313-9318 Home
> >> 443-418-4375 Cell
> >> email: potts at hpcapplications.com
> >> potts at excray.com
> ***********************************
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
More information about the mvapich-discuss
mailing list