[mvapich-discuss] mvapich2 integrate with Torque

Shenglong Wang sw77 at nyu.edu
Tue Apr 29 11:25:46 EDT 2014


This is the output from the first node

[sw77 at compute-14-2 ~]$ ps xf -u sw77
  PID TTY      STAT   TIME COMMAND
25410 ?        Ss     0:00 -bash
25436 ?        Sl     0:00  \_ pbs_demux
25484 ?        S      0:00  \_ /bin/bash /opt/torque/mom_priv/jobs/1580.soho.es.its.nyu.edu.SC
25490 ?        S      0:00      \_ /share/apps/mvapich2/2.0rc1/intel/bin/mpiexec /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
25491 ?        Ss     0:00          \_ /share/apps/mvapich2/2.0rc1/intel/bin/hydra_pmi_proxy --control-port compute-14-2.local:41010 --rmk pbs --launcher ssh --demux poll --pgid 0 --retries 10 --usize -2 --proxy-id 0
25494 ?        RLsl   6:40          |   \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
25495 ?        RLsl   6:41          |   \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
25496 ?        RLsl   6:41          |   \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
25497 ?        RLsl   6:41          |   \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
25498 ?        RLsl   6:41          |   \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
25499 ?        RLsl   6:41          |   \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
25500 ?        RLsl   6:41          |   \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
25502 ?        RLsl   6:41          |   \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
25503 ?        RLsl   6:41          |   \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
25504 ?        RLsl   6:41          |   \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
25505 ?        RLsl   6:41          |   \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
25506 ?        RLsl   6:41          |   \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
25507 ?        RLsl   6:41          |   \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
25508 ?        RLsl   6:41          |   \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
25509 ?        RLsl   6:41          |   \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
25510 ?        RLsl   6:41          |   \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
25511 ?        RLsl   6:41          |   \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
25512 ?        RLsl   6:41          |   \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
25513 ?        RLsl   6:41          |   \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
25514 ?        RLsl   6:41          |   \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
25492 ?        Ss     0:00          \_ /usr/bin/ssh -x compute-14-3.local "/share/apps/mvapich2/2.0rc1/intel/bin/hydra_pmi_proxy" --control-port compute-14-2.local:41010 --rmk pbs --launcher ssh --demux poll --pgid 0 --retries 10 --usize -2 --proxy-id 1
25493 ?        Ss     0:00          \_ /usr/bin/ssh -x compute-14-4.local "/share/apps/mvapich2/2.0rc1/intel/bin/hydra_pmi_proxy" --control-port compute-14-2.local:41010 --rmk pbs --launcher ssh --demux poll --pgid 0 --retries 10 --usize -2 --proxy-id 2
25544 ?        S      0:00 sshd: sw77 at pts/0 
25549 pts/0    Ss     0:00  \_ -bash
25932 pts/0    R+     0:00      \_ ps xf -u sw77
[sw77 at compute-14-2 ~]$ 


the output from 2nd node

[sw77 at compute-14-3 ~]$ ps xf -u sw77
  PID TTY      STAT   TIME COMMAND
44090 ?        S      0:00 sshd: sw77 at pts/0 
44095 pts/0    Ss     0:00  \_ -bash
44444 pts/0    R+     0:00      \_ ps xf -u sw77
43926 ?        S      0:00 sshd: sw77 at notty 
43927 ?        Ss     0:00  \_ /share/apps/mvapich2/2.0rc1/intel/bin/hydra_pmi_proxy --control-port compute-14-2.local:41010 --rmk pbs --launcher ssh --demux poll --pgid 0 --retries 10 --usize -2 --proxy-id 1
43978 ?        RLsl   7:08      \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
43979 ?        RLsl   7:08      \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
43980 ?        RLsl   7:08      \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
43981 ?        RLsl   7:08      \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
43982 ?        RLsl   7:08      \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
43983 ?        RLsl   7:08      \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
43984 ?        RLsl   7:08      \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
43985 ?        RLsl   7:08      \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
43986 ?        RLsl   7:08      \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
43987 ?        RLsl   7:08      \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
43988 ?        RLsl   7:08      \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
43989 ?        RLsl   7:08      \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
43990 ?        RLsl   7:08      \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
43991 ?        RLsl   7:08      \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
43992 ?        RLsl   7:08      \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
43993 ?        RLsl   7:08      \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
43994 ?        RLsl   7:08      \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
43995 ?        RLsl   7:08      \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
43996 ?        RLsl   7:08      \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
43998 ?        RLsl   7:08      \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
[sw77 at compute-14-3 ~]$ 


the 3rd node

[sw77 at compute-14-4 ~]$  ps xf -u sw77
  PID TTY      STAT   TIME COMMAND
18784 ?        S      0:00 sshd: sw77 at pts/0 
18789 pts/0    Ss     0:00  \_ -bash
18845 pts/0    R+     0:00      \_ ps xf -u sw77
18328 ?        S      0:00 sshd: sw77 at notty 
18329 ?        Ss     0:00  \_ /share/apps/mvapich2/2.0rc1/intel/bin/hydra_pmi_proxy --control-port compute-14-2.local:41010 --rmk pbs --launcher ssh --demux poll --pgid 0 --retries 10 --usize -2 --proxy-id 2
18380 ?        RLsl   7:37      \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
18381 ?        RLsl   7:37      \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
18382 ?        RLsl   7:37      \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
18383 ?        RLsl   7:36      \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
18384 ?        RLsl   7:36      \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
18385 ?        RLsl   7:37      \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
18386 ?        RLsl   7:37      \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
18387 ?        RLsl   7:37      \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
18388 ?        RLsl   7:37      \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
18389 ?        RLsl   7:37      \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
18390 ?        RLsl   7:37      \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
18391 ?        RLsl   7:37      \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
18392 ?        RLsl   7:37      \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
18393 ?        RLsl   7:37      \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
18394 ?        RLsl   7:37      \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
18395 ?        RLsl   7:36      \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
18396 ?        RLsl   7:37      \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
18397 ?        RLsl   7:37      \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
18398 ?        RLsl   7:37      \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
18399 ?        RLsl   7:37      \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
[sw77 at compute-14-4 ~]$ 


Best,

Shenglong


On Apr 29, 2014, at 10:08 AM, Jonathan Perkins <perkinjo at cse.ohio-state.edu> wrote:

> Thanks for the report.  It's possible that this reporting is due to an
> outstanding issue with hydra and torque(pbs) integration
> (https://trac.mpich.org/projects/mpich/ticket/1812#no1).  Can you send
> us the relevant output of ps axf from each node as the job is running
> to help verify?
> 
> On Tue, Apr 29, 2014 at 9:49 AM, Shenglong Wang <sw77 at nyu.edu> wrote:
>> 
>> Hi Jonathan,
>> 
>> Thanks a lot for the reply. I'm running mvapich2 2.0rc1 and using mpiexec to
>> launch MPI threads.
>> 
>> I'm running a job with 120 MPI threads, 6 compute nodes, 20 cores per node.
>> This is the compute resource usage reported from Torque
>> 
>> Aborted by PBS Server
>> Job exceeded its walltime limit. Job was aborted
>> See Administrator for help
>> Exit_status=-11
>> resources_used.cput=239:36:39
>> resources_used.mem=1984640kb
>> resources_used.vmem=8092716kb
>> resources_used.walltime=12:00:16
>> 
>> The wall time is 12 hours, CPU time is about 240 hours, which is only the
>> sum of the first node.
>> 
>> OpenMPI is able to be tightly integrated with Torque, which reports the
>> total CPU time and memory usage from all the compute nodes. Not sure if
>> MVAPICH2 has the similar integration with Torque.
>> 
>> Best,
>> 
>> Shenglong
>> 
>> On Apr 29, 2014, at 9:18 AM, Jonathan Perkins <perkinjo at cse.ohio-state.edu>
>> wrote:
>> 
>> Hello.  I believe that this is already available when using the hydra
>> process manager (ie. mpiexec or mpiexec.hydra).  Are you using this
>> launcher within your torque environment?  If this isn't working then
>> it may be a matter of the torque development files not being found
>> when mvapich2 was compiled.  Also, please tell us which version of
>> MVAPICH2 you're using.
>> 
>> On Tue, Apr 29, 2014 at 9:07 AM, Shenglong Wang <sw77 at nyu.edu> wrote:
>> 
>> 
>> Hi,
>> 
>> Is it possible to tightly integrate MVAPICH2 with Torque to get the correct
>> total CPU time and memory usage from all the compute nodes?
>> 
>> Best,
>> 
>> Shenglong
>> 
>> 
>> _______________________________________________
>> mvapich-discuss mailing list
>> mvapich-discuss at cse.ohio-state.edu
>> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>> 
>> 
>> 
>> 
>> --
>> Jonathan Perkins
>> http://www.cse.ohio-state.edu/~perkinjo
>> 
>> 
> 
> 
> 
> -- 
> Jonathan Perkins
> http://www.cse.ohio-state.edu/~perkinjo
> 






More information about the mvapich-discuss mailing list