[mvapich-discuss] mvapich2 integrate with Torque

Jonathan Perkins perkinjo at cse.ohio-state.edu
Tue Apr 29 12:21:16 EDT 2014


Thanks for providing the output.  It does look like mpiexec is using
the ssh launcher instead of the pbs launcher.  This should be the
reason why you're only seeing the resources used for the first node.
I suggest downloading the standalone hydra package and trying to
configure it with torque support.

It is available at
http://www.mpich.org/static/downloads/3.1/hydra-3.1.tar.gz.  I'm not
sure where your tm.h and libtorque.so files are located but if you're
able to locate them make sure these directories are either in
/usr/include and /usr/lib[64] or added in CPPFLAGS and LDFLAGS
respectively when you configure hydra.

Please let us know if this helps.

On Tue, Apr 29, 2014 at 11:25 AM, Shenglong Wang <sw77 at nyu.edu> wrote:
>
> This is the output from the first node
>
> [sw77 at compute-14-2 ~]$ ps xf -u sw77
>   PID TTY      STAT   TIME COMMAND
> 25410 ?        Ss     0:00 -bash
> 25436 ?        Sl     0:00  \_ pbs_demux
> 25484 ?        S      0:00  \_ /bin/bash /opt/torque/mom_priv/jobs/1580.soho.es.its.nyu.edu.SC
> 25490 ?        S      0:00      \_ /share/apps/mvapich2/2.0rc1/intel/bin/mpiexec /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
> 25491 ?        Ss     0:00          \_ /share/apps/mvapich2/2.0rc1/intel/bin/hydra_pmi_proxy --control-port compute-14-2.local:41010 --rmk pbs --launcher ssh --demux poll --pgid 0 --retries 10 --usize -2 --proxy-id 0
> 25494 ?        RLsl   6:40          |   \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
> 25495 ?        RLsl   6:41          |   \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
> 25496 ?        RLsl   6:41          |   \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
> 25497 ?        RLsl   6:41          |   \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
> 25498 ?        RLsl   6:41          |   \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
> 25499 ?        RLsl   6:41          |   \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
> 25500 ?        RLsl   6:41          |   \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
> 25502 ?        RLsl   6:41          |   \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
> 25503 ?        RLsl   6:41          |   \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
> 25504 ?        RLsl   6:41          |   \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
> 25505 ?        RLsl   6:41          |   \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
> 25506 ?        RLsl   6:41          |   \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
> 25507 ?        RLsl   6:41          |   \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
> 25508 ?        RLsl   6:41          |   \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
> 25509 ?        RLsl   6:41          |   \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
> 25510 ?        RLsl   6:41          |   \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
> 25511 ?        RLsl   6:41          |   \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
> 25512 ?        RLsl   6:41          |   \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
> 25513 ?        RLsl   6:41          |   \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
> 25514 ?        RLsl   6:41          |   \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
> 25492 ?        Ss     0:00          \_ /usr/bin/ssh -x compute-14-3.local "/share/apps/mvapich2/2.0rc1/intel/bin/hydra_pmi_proxy" --control-port compute-14-2.local:41010 --rmk pbs --launcher ssh --demux poll --pgid 0 --retries 10 --usize -2 --proxy-id 1
> 25493 ?        Ss     0:00          \_ /usr/bin/ssh -x compute-14-4.local "/share/apps/mvapich2/2.0rc1/intel/bin/hydra_pmi_proxy" --control-port compute-14-2.local:41010 --rmk pbs --launcher ssh --demux poll --pgid 0 --retries 10 --usize -2 --proxy-id 2
> 25544 ?        S      0:00 sshd: sw77 at pts/0
> 25549 pts/0    Ss     0:00  \_ -bash
> 25932 pts/0    R+     0:00      \_ ps xf -u sw77
> [sw77 at compute-14-2 ~]$
>
>
> the output from 2nd node
>
> [sw77 at compute-14-3 ~]$ ps xf -u sw77
>   PID TTY      STAT   TIME COMMAND
> 44090 ?        S      0:00 sshd: sw77 at pts/0
> 44095 pts/0    Ss     0:00  \_ -bash
> 44444 pts/0    R+     0:00      \_ ps xf -u sw77
> 43926 ?        S      0:00 sshd: sw77 at notty
> 43927 ?        Ss     0:00  \_ /share/apps/mvapich2/2.0rc1/intel/bin/hydra_pmi_proxy --control-port compute-14-2.local:41010 --rmk pbs --launcher ssh --demux poll --pgid 0 --retries 10 --usize -2 --proxy-id 1
> 43978 ?        RLsl   7:08      \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
> 43979 ?        RLsl   7:08      \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
> 43980 ?        RLsl   7:08      \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
> 43981 ?        RLsl   7:08      \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
> 43982 ?        RLsl   7:08      \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
> 43983 ?        RLsl   7:08      \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
> 43984 ?        RLsl   7:08      \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
> 43985 ?        RLsl   7:08      \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
> 43986 ?        RLsl   7:08      \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
> 43987 ?        RLsl   7:08      \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
> 43988 ?        RLsl   7:08      \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
> 43989 ?        RLsl   7:08      \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
> 43990 ?        RLsl   7:08      \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
> 43991 ?        RLsl   7:08      \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
> 43992 ?        RLsl   7:08      \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
> 43993 ?        RLsl   7:08      \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
> 43994 ?        RLsl   7:08      \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
> 43995 ?        RLsl   7:08      \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
> 43996 ?        RLsl   7:08      \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
> 43998 ?        RLsl   7:08      \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
> [sw77 at compute-14-3 ~]$
>
>
> the 3rd node
>
> [sw77 at compute-14-4 ~]$  ps xf -u sw77
>   PID TTY      STAT   TIME COMMAND
> 18784 ?        S      0:00 sshd: sw77 at pts/0
> 18789 pts/0    Ss     0:00  \_ -bash
> 18845 pts/0    R+     0:00      \_ ps xf -u sw77
> 18328 ?        S      0:00 sshd: sw77 at notty
> 18329 ?        Ss     0:00  \_ /share/apps/mvapich2/2.0rc1/intel/bin/hydra_pmi_proxy --control-port compute-14-2.local:41010 --rmk pbs --launcher ssh --demux poll --pgid 0 --retries 10 --usize -2 --proxy-id 2
> 18380 ?        RLsl   7:37      \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
> 18381 ?        RLsl   7:37      \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
> 18382 ?        RLsl   7:37      \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
> 18383 ?        RLsl   7:36      \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
> 18384 ?        RLsl   7:36      \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
> 18385 ?        RLsl   7:37      \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
> 18386 ?        RLsl   7:37      \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
> 18387 ?        RLsl   7:37      \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
> 18388 ?        RLsl   7:37      \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
> 18389 ?        RLsl   7:37      \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
> 18390 ?        RLsl   7:37      \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
> 18391 ?        RLsl   7:37      \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
> 18392 ?        RLsl   7:37      \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
> 18393 ?        RLsl   7:37      \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
> 18394 ?        RLsl   7:37      \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
> 18395 ?        RLsl   7:36      \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
> 18396 ?        RLsl   7:37      \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
> 18397 ?        RLsl   7:37      \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
> 18398 ?        RLsl   7:37      \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
> 18399 ?        RLsl   7:37      \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
> [sw77 at compute-14-4 ~]$
>
>
> Best,
>
> Shenglong
>
>
> On Apr 29, 2014, at 10:08 AM, Jonathan Perkins <perkinjo at cse.ohio-state.edu> wrote:
>
>> Thanks for the report.  It's possible that this reporting is due to an
>> outstanding issue with hydra and torque(pbs) integration
>> (https://trac.mpich.org/projects/mpich/ticket/1812#no1).  Can you send
>> us the relevant output of ps axf from each node as the job is running
>> to help verify?
>>
>> On Tue, Apr 29, 2014 at 9:49 AM, Shenglong Wang <sw77 at nyu.edu> wrote:
>>>
>>> Hi Jonathan,
>>>
>>> Thanks a lot for the reply. I'm running mvapich2 2.0rc1 and using mpiexec to
>>> launch MPI threads.
>>>
>>> I'm running a job with 120 MPI threads, 6 compute nodes, 20 cores per node.
>>> This is the compute resource usage reported from Torque
>>>
>>> Aborted by PBS Server
>>> Job exceeded its walltime limit. Job was aborted
>>> See Administrator for help
>>> Exit_status=-11
>>> resources_used.cput=239:36:39
>>> resources_used.mem=1984640kb
>>> resources_used.vmem=8092716kb
>>> resources_used.walltime=12:00:16
>>>
>>> The wall time is 12 hours, CPU time is about 240 hours, which is only the
>>> sum of the first node.
>>>
>>> OpenMPI is able to be tightly integrated with Torque, which reports the
>>> total CPU time and memory usage from all the compute nodes. Not sure if
>>> MVAPICH2 has the similar integration with Torque.
>>>
>>> Best,
>>>
>>> Shenglong
>>>
>>> On Apr 29, 2014, at 9:18 AM, Jonathan Perkins <perkinjo at cse.ohio-state.edu>
>>> wrote:
>>>
>>> Hello.  I believe that this is already available when using the hydra
>>> process manager (ie. mpiexec or mpiexec.hydra).  Are you using this
>>> launcher within your torque environment?  If this isn't working then
>>> it may be a matter of the torque development files not being found
>>> when mvapich2 was compiled.  Also, please tell us which version of
>>> MVAPICH2 you're using.
>>>
>>> On Tue, Apr 29, 2014 at 9:07 AM, Shenglong Wang <sw77 at nyu.edu> wrote:
>>>
>>>
>>> Hi,
>>>
>>> Is it possible to tightly integrate MVAPICH2 with Torque to get the correct
>>> total CPU time and memory usage from all the compute nodes?
>>>
>>> Best,
>>>
>>> Shenglong
>>>
>>>
>>> _______________________________________________
>>> mvapich-discuss mailing list
>>> mvapich-discuss at cse.ohio-state.edu
>>> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>>
>>>
>>>
>>>
>>> --
>>> Jonathan Perkins
>>> http://www.cse.ohio-state.edu/~perkinjo
>>>
>>>
>>
>>
>>
>> --
>> Jonathan Perkins
>> http://www.cse.ohio-state.edu/~perkinjo
>>
>
>
>



-- 
Jonathan Perkins
http://www.cse.ohio-state.edu/~perkinjo



More information about the mvapich-discuss mailing list