[mvapich-discuss] process limits

Jonathan Perkins perkinjo at cse.ohio-state.edu
Sat Sep 1 09:26:28 EDT 2007


Mark Potts wrote:
> Jonathan, That patch seems to solve the non-bash shell problems. I've
> asked some others to try it for themselves, and if I learn anything
> new, I'll let you know.  But for now, it appears that problem can be
> checked off. regards,

That's good to hear.  Below my quoted response you'll find the solution
to your problem when trying to run many processes on one node.

> 
> Jonathan L. Perkins wrote:
>> Mark Potts wrote:
>>> DK, Thanks for the quick reply.
>>> 
>>> I'm maybe jumping ahead here, but assuming the "timeout" is the 
>>> one that I suggested from the 10 July patch (mentioned below) I 
>>> have a question related to that patch. Could you/your people 
>>> suggest why this patch seems to fail for users running tcsh (as 
>>> opposed to bash)?  My initial testing, which used bash, of this 
>>> patch found that it worked quite nicely to clean up jobs when one
>>>  or more of the processes killed themselves.  An mpirun_rsh
>>> thread detected this condition and then killed the remaining
>>> remote tasks.  However, we now find that users employing tcsh
>>> login shell are not so lucky.  The detection part of the patch
>>> works successfully and the remote sshd tasks are killed
>>> successfully but the remote MPI processes continue to run.  Quite
>>> interesting if it weren't so bad.  We can be left with large
>>> numbers of CPU burning tasks that are difficult to find and kill.
>>> 
>> 
>> Mark: Attached you find a patch that should solve the issue with 
>> remote processes not being killed when using tcsh.  The problem 
>> arose from using the signal name with the kill command.  Since in 
>> many cases the kill command is a shell built in, there can be some 
>> minor differences such as SIGKILL in bash and just KILL in tcsh.
>> 
>> After applying this patch to the MVAPICH source you can simply 
>> re-install using the proper make script.  Let me know if you have 
>> any further questions regarding this.
>> 
>> As for your original question, it seems that the problem isn't the 
>> timeout while waiting for the other child process to end.  More 
>> troublesome is that the child died in the first place.  We have 
>> reproduced this issue and are working on finding the best solution 
>> to resolve it.  We'll keep you posted.

Can you try setting the following in /etc/ssh/sshd_config:

MaxStartups 20

>  From the sshd_config manual page:
> 
>      MaxStartups
>              Specifies the maximum number of concurrent unauthenticated con-
>              nections to the sshd daemon.  Additional connections will be
>              dropped until authentication succeeds or the LoginGraceTime
>              expires for a connection.  The default is 10.

We were able to narrow the problem down to the area where we spawn ssh 
processes and it seemed to consistently have problems whenever this 
limit was reached.  After changing this configuration and restarting 
sshd the problem went away.


>>> Shell experts' ideas welcome.
>>> 
>>> regards,
>>> 
>>> Dhabaleswar Panda wrote:
>>>> Hi Mark,
>>>> 
>>>> Thanks for providing us the details. There appears to be some 
>>>> `time out' with the new mpirun_rsh. We are taking a look at it
>>>>  and will be able to send you some solution soon.
>>>> 
>>>> Thanks,
>>>> 
>>>> DK
>>>> 
>>>> On Tue, 28 Aug 2007, Mark Potts wrote:
>>>> 
>>>>> Hi, I tried VIADEV_USE_SHMEM_COLL=0 and separately tried 
>>>>> VIADEV_USE_BLOCKING=1, with no change in results.  During 
>>>>> task startup I get either "Unable to find child nnnn!", 
>>>>> "Child died. Timeout while waiting", and/or simply "done."
>>>>> 
>>>>> I tried repeatedly but was never able to consistently run 
>>>>> more than 10 ranks (-np 10) on a single node.  I, of course, 
>>>>> am able to run many more ranks, when I spread the targets 
>>>>> across more nodes.
>>>>> 
>>>>> My experiment is to start a very simple code with multiple 
>>>>> processes on a single node.  Specific details of my setup on 
>>>>> two machines.  The results were the same:
>>>>> 
>>>>> Machine Cpus per  Cores per   Avail   Cpu      MVAPICH 
>>>>> MVAPICH Node      Cpu        Nodes   Type     version Device
>>>>> A        1         2          3     X86-64 -0.9.9-1168
>>>>> ch_gen2 B        2         4         16 X86_64  -0.9.9-1326
>>>>> ch_gen2
>>>>> 
>>>>> The MVAPICH code, which was obtained from ofed 1.2 
>>>>> installation, has two patches as follows: (1) for 
>>>>> mpirun_rsh.c from Sayatan Sur of 10 Jul for MVAPICH errant 
>>>>> process/job cleanup. (2) for comm_free.c from Amith Rajith 
>>>>> Mamidla of 11 Jul for MVAPICH segmentation fault during 
>>>>> MPI_Finalize() in large jobs.
>>>>> 
>>>>> Is it possible that the mpirun_rsh.c patch is prematurely 
>>>>> killing tasks when it determines that the processes on the 
>>>>> oversubscribed node are not responding fast enough?  Or is 
>>>>> there another clean explanation?  As I understand the note 
>>>>> from DK this morning, oversubscription should work... 
>>>>> regards,
>>>>> 
>>>>> amith rajith mamidala wrote:
>>>>>> Hi Mark,
>>>>>> 
>>>>>> Can you check if you get this error by setting the 
>>>>>> environment variable: VIADEV_USE_SHMEM_COLL to 0 e.g. 
>>>>>> mpirun_rsh -np N VIADEV_USE_SHMEM_COLL=0 ./a.out
>>>>>> 
>>>>>> -thanks, Amith
>>>>>> 
>>>>>> On Tue, 28 Aug 2007, Mark Potts wrote:
>>>>>> 
>>>>>>> Hi, Is there an effective or hard limit on the number of 
>>>>>>> MVAPICH processes that can be run on a single node?
>>>>>>> 
>>>>>>> Given N cpus, each having M cores, on a single node, I've
>>>>>>>  been told that one can not run more than N*M MVAPICH 
>>>>>>> processes on a single node.  In fact, I observe that if I
>>>>>>>  try to even approach this number with "-np 16" (for a 
>>>>>>> node with N=8 and M=4), I observe a "unable to find child
>>>>>>>  nnnn!" or "Child died" message.  Is this a configuration
>>>>>>>  problem with this system or somehow an expected
>>>>>>> behavior?
>>>>>>> 
>>>>>>> 
>>>>>>> More pointedly, should oversubscription of cores, np > 
>>>>>>> N*M, on a single node work in MVAPICH?  How about in 
>>>>>>> MVAPICH2?
>>>>>>> 
>>>>>>> regards, -- ***********************************
>>>>>>>>> Mark J. Potts, PhD
>>>>>>>>> 
>>>>>>>>> HPC Applications Inc. phone: 410-992-8360 Bus 
>>>>>>>>> 410-313-9318 Home 443-418-4375 Cell email: 
>>>>>>>>> potts at hpcapplications.com potts at excray.com
>>>>>>> *********************************** 
>>>>>>> _______________________________________________ 
>>>>>>> mvapich-discuss mailing list 
>>>>>>> mvapich-discuss at cse.ohio-state.edu 
>>>>>>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>> -- ***********************************
>>>>>>> Mark J. Potts, PhD
>>>>>>> 
>>>>>>> HPC Applications Inc. phone: 410-992-8360 Bus 
>>>>>>> 410-313-9318 Home 443-418-4375 Cell email: 
>>>>>>> potts at hpcapplications.com potts at excray.com
>>>>> *********************************** 
>>>>> _______________________________________________ 
>>>>> mvapich-discuss mailing list 
>>>>> mvapich-discuss at cse.ohio-state.edu 
>>>>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>>>> 
>>>>> 
>>>>> 
>>> 
>> 
>> 
> 


-- 
Jonathan Perkins
http://www.cse.ohio-state.edu/~perkinjo



More information about the mvapich-discuss mailing list