[mvapich-discuss] process limits
Mark Potts
potts at hpcapplications.com
Tue Aug 28 22:19:45 EDT 2007
Hi,
I tried VIADEV_USE_SHMEM_COLL=0 and separately tried
VIADEV_USE_BLOCKING=1, with no change in results. During task
startup I get either "Unable to find child nnnn!", "Child died.
Timeout while waiting", and/or simply "done."
I tried repeatedly but was never able to consistently run more
than 10 ranks (-np 10) on a single node. I, of course, am
able to run many more ranks, when I spread the targets across
more nodes.
My experiment is to start a very simple code with multiple
processes on a single node. Specific details of my setup on two
machines. The results were the same:
Machine Cpus per Cores per Avail Cpu MVAPICH MVAPICH
Node Cpu Nodes Type version Device
A 1 2 3 X86-64 -0.9.9-1168 ch_gen2
B 2 4 16 X86_64 -0.9.9-1326 ch_gen2
The MVAPICH code, which was obtained from ofed 1.2 installation,
has two patches as follows:
(1) for mpirun_rsh.c from Sayatan Sur of 10 Jul for MVAPICH errant
process/job cleanup.
(2) for comm_free.c from Amith Rajith Mamidla of 11 Jul for MVAPICH
segmentation fault during MPI_Finalize() in large jobs.
Is it possible that the mpirun_rsh.c patch is prematurely killing
tasks when it determines that the processes on the oversubscribed
node are not responding fast enough? Or is there another
clean explanation? As I understand the note from DK this morning,
oversubscription should work...
regards,
amith rajith mamidala wrote:
> Hi Mark,
>
> Can you check if you get this error by setting the environment variable:
> VIADEV_USE_SHMEM_COLL to 0 e.g. mpirun_rsh -np N VIADEV_USE_SHMEM_COLL=0
> ./a.out
>
> -thanks,
> Amith
>
> On Tue, 28 Aug 2007, Mark Potts wrote:
>
>> Hi,
>> Is there an effective or hard limit on the number of MVAPICH
>> processes that can be run on a single node?
>>
>> Given N cpus, each having M cores, on a single node, I've been told
>> that one can not run more than N*M MVAPICH processes on a single
>> node. In fact, I observe that if I try to even approach this number
>> with "-np 16" (for a node with N=8 and M=4), I observe a "unable to
>> find child nnnn!" or "Child died" message. Is this a configuration
>> problem with this system or somehow an expected behavior?
>>
>> More pointedly, should oversubscription of cores, np > N*M, on a
>> single node work in MVAPICH? How about in MVAPICH2?
>>
>> regards,
>> --
>> ***********************************
>> >> Mark J. Potts, PhD
>> >>
>> >> HPC Applications Inc.
>> >> phone: 410-992-8360 Bus
>> >> 410-313-9318 Home
>> >> 443-418-4375 Cell
>> >> email: potts at hpcapplications.com
>> >> potts at excray.com
>> ***********************************
>> _______________________________________________
>> mvapich-discuss mailing list
>> mvapich-discuss at cse.ohio-state.edu
>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>
--
***********************************
>> Mark J. Potts, PhD
>>
>> HPC Applications Inc.
>> phone: 410-992-8360 Bus
>> 410-313-9318 Home
>> 443-418-4375 Cell
>> email: potts at hpcapplications.com
>> potts at excray.com
***********************************
More information about the mvapich-discuss
mailing list