[mvapich-discuss] mvapich2 integrate with Torque

Jonathan Perkins perkinjo at cse.ohio-state.edu
Tue Apr 29 14:04:49 EDT 2014


Thanks for the note.  I'm glad things are working correctly for you now.

On Tue, Apr 29, 2014 at 1:49 PM, Shenglong Wang <sw77 at nyu.edu> wrote:
>
> It seems hydra is not configured correctly. I rebuilt hydra 3.1 with Torque integrated, now the compute resource usage report is correct.
>
> Thanks a lot for the help.
>
> Best,
>
> Shenglong
>
> Job exceeded its walltime limit. Job was aborted
> See Administrator for help
> Exit_status=-11
> resources_used.cput=05:07:18
> resources_used.mem=5414208kb
> resources_used.vmem=22891256kb
> resources_used.walltime=00:05:31
>
> wall time is about 5.5 mins, 60 MPI threads, total is about 330 mins, it reports as 5 hours and 7 mins.
>
> [sw77 at compute-14-4 ~]$ ps xf -u sw77
>   PID TTY      STAT   TIME COMMAND
> 25752 ?        Ss     0:00 /share/apps/hydra/3.1/intel/bin/hydra_pmi_proxy --control-port compute-14-4.local:32901 --rmk pbs --launcher pbs --demux poll --pgid 0 --retries 10 --usize -2 --proxy-id 0
> 25753 ?        RLsl   2:11  \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
> 25754 ?        RLsl   2:12  \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
> 25755 ?        RLsl   2:12  \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
> 25756 ?        RLsl   2:12  \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
> 25757 ?        RLsl   2:12  \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
> 25758 ?        RLsl   2:12  \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
> 25759 ?        RLsl   2:12  \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
> 25760 ?        RLsl   2:12  \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
> 25761 ?        RLsl   2:12  \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
> 25762 ?        RLsl   2:12  \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
> 25763 ?        RLsl   2:12  \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
> 25764 ?        RLsl   2:12  \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
> 25765 ?        RLsl   2:12  \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
> 25766 ?        RLsl   2:12  \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
> 25767 ?        RLsl   2:12  \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
> 25768 ?        RLsl   2:12  \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
> 25769 ?        RLsl   2:12  \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
> 25770 ?        RLsl   2:12  \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
> 25771 ?        RLsl   2:12  \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
> 25772 ?        RLsl   2:12  \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
> 25647 ?        Ss     0:00 -bash
> 25673 ?        Sl     0:00  \_ pbs_demux
> 25723 ?        S      0:00  \_ /bin/bash /opt/torque/mom_priv/jobs/1587.soho.es.its.nyu.edu.SC
> 25751 ?        S      0:00      \_ /share/apps/hydra/3.1/intel/bin/mpiexec -rmk pbs /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
> 25842 ?        S      0:00 sshd: sw77 at pts/0
> 25847 pts/0    Ss     0:00  \_ -bash
> 25984 pts/0    R+     0:00      \_ ps xf -u sw77
> [sw77 at compute-14-4 ~]$
>
> [sw77 at compute-14-5 ~]$ ps xf -u sw77
>   PID TTY      STAT   TIME COMMAND
> 47642 ?        S      0:00 sshd: sw77 at pts/0
> 47647 pts/0    Ss     0:00  \_ -bash
> 47769 pts/0    R+     0:00      \_ ps xf -u sw77
> 47503 ?        Ss     0:00 /share/apps/hydra/3.1/intel/bin/hydra_pmi_proxy --control-port compute-14-4.local:32901 --rmk pbs --launcher pbs --demux poll --pgid 0 --retries 10 --usize -2 --proxy-id 1
> 47511 ?        RLsl   2:30  \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
> 47512 ?        RLsl   2:30  \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
> 47513 ?        RLsl   2:30  \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
> 47514 ?        RLsl   2:30  \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
> 47515 ?        RLsl   2:30  \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
> 47516 ?        RLsl   2:30  \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
> 47517 ?        RLsl   2:30  \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
> 47518 ?        RLsl   2:30  \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
> 47519 ?        RLsl   2:29  \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
> 47520 ?        RLsl   2:30  \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
> 47521 ?        RLsl   2:30  \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
> 47522 ?        RLsl   2:30  \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
> 47523 ?        RLsl   2:30  \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
> 47524 ?        RLsl   2:29  \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
> 47525 ?        RLsl   2:30  \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
> 47526 ?        RLsl   2:30  \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
> 47527 ?        RLsl   2:30  \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
> 47528 ?        RLsl   2:29  \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
> 47529 ?        RLsl   2:30  \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
> 47530 ?        RLsl   2:30  \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
> [sw77 at compute-14-5 ~]$
>
> [sw77 at compute-14-6 ~]$ ps xf -u sw77
>   PID TTY      STAT   TIME COMMAND
> 32865 ?        S      0:00 sshd: sw77 at pts/0
> 32870 pts/0    Ss     0:00  \_ -bash
> 32978 pts/0    R+     0:00      \_ ps xf -u sw77
> 32742 ?        Ss     0:00 /share/apps/hydra/3.1/intel/bin/hydra_pmi_proxy --control-port compute-14-4.local:32901 --rmk pbs --launcher pbs --demux poll --pgid 0 --retries 10 --usize -2 --proxy-id 2
> 32750 ?        RLsl   2:04  \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
> 32751 ?        RLsl   2:05  \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
> 32752 ?        RLsl   2:05  \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
> 32753 ?        RLsl   2:05  \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
> 32754 ?        RLsl   2:05  \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
> 32755 ?        RLsl   2:05  \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
> 32756 ?        RLsl   2:05  \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
> 32757 ?        RLsl   2:05  \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
> 32758 ?        RLsl   2:05  \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
> 32759 ?        RLsl   2:05  \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
> 32760 ?        RLsl   2:05  \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
> 32761 ?        RLsl   2:05  \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
> 32762 ?        RLsl   2:05  \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
> 32763 ?        RLsl   2:05  \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
> 32764 ?        RLsl   2:05  \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
> 32765 ?        RLsl   2:05  \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
> 32766 ?        RLsl   2:05  \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
> 32767 ?        RLsl   2:05  \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
> 32768 ?        RLsl   2:05  \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
> 32769 ?        RLsl   2:04  \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
> [sw77 at compute-14-6 ~]$
>
>
> On Apr 29, 2014, at 1:32 PM, Shenglong Wang <sw77 at nyu.edu> wrote:
>
>>
>> Hi Jonathan,
>>
>> Thanks for the suggestions.
>>
>> I rebuilt hydra 3.1 with Torque linked
>>
>> [sw77 at compute-14-1 ~]$ ldd -r /share/apps/hydra/3.1/intel/bin/mpiexec
>>        linux-vdso.so.1 =>  (0x00007ffffe7ff000)
>>        libmpl.so.1 => /share/apps/hydra/3.1/intel/lib/libmpl.so.1 (0x00002ad9d8a8b000)
>>        libnsl.so.1 => /lib64/libnsl.so.1 (0x00000036dde00000)
>>        libcr.so.0 => /usr/lib64/libcr.so.0 (0x00000036dc200000)
>>        libdl.so.2 => /lib64/libdl.so.2 (0x00000036dba00000)
>>        libtorque.so.2 => /opt/torque/lib/libtorque.so.2 (0x00002ad9d8ca4000)
>>        libxml2.so.2 => /share/apps/libxml2/2.9.1/intel/lib/libxml2.so.2 (0x00002ad9d9575000)
>>        libz.so.1 => /share/apps/zlib/1.2.8/intel/lib/libz.so.1 (0x00002ad9d9a8e000)
>>        libcrypto.so.10 => /usr/lib64/libcrypto.so.10 (0x00000036e3200000)
>>        libssl.so.10 => /usr/lib64/libssl.so.10 (0x00000036e4a00000)
>>        libpthread.so.0 => /lib64/libpthread.so.0 (0x00000036dbe00000)
>>        librt.so.1 => /lib64/librt.so.1 (0x00000036dca00000)
>>        libimf.so => /share/apps/intel/14.0.2/composer_xe_2013_sp1.2.144/compiler/lib/intel64/libimf.so (0x00002ad9d9ca9000)
>>        libsvml.so => /share/apps/intel/14.0.2/composer_xe_2013_sp1.2.144/compiler/lib/intel64/libsvml.so (0x00002ad9da16d000)
>>        libirng.so => /share/apps/intel/14.0.2/composer_xe_2013_sp1.2.144/compiler/lib/intel64/libirng.so (0x00002ad9dad68000)
>>        libm.so.6 => /lib64/libm.so.6 (0x00002ad9daf6f000)
>>        libiomp5.so => /share/apps/intel/14.0.2/composer_xe_2013_sp1.2.144/compiler/lib/intel64/libiomp5.so (0x00002ad9db1f4000)
>>        libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00002ad9db50c000)
>>        libintlc.so.5 => /share/apps/intel/14.0.2/composer_xe_2013_sp1.2.144/compiler/lib/intel64/libintlc.so.5 (0x00002ad9db722000)
>>        libc.so.6 => /lib64/libc.so.6 (0x00000036db600000)
>>        /lib64/ld-linux-x86-64.so.2 (0x00000036db200000)
>>        libstdc++.so.6 => /usr/lib64/libstdc++.so.6 (0x00000036dee00000)
>>        libgssapi_krb5.so.2 => /lib64/libgssapi_krb5.so.2 (0x00000036e2600000)
>>        libkrb5.so.3 => /lib64/libkrb5.so.3 (0x00000036e3600000)
>>        libcom_err.so.2 => /lib64/libcom_err.so.2 (0x00000036e0200000)
>>        libk5crypto.so.3 => /lib64/libk5crypto.so.3 (0x00000036e2200000)
>>        libkrb5support.so.0 => /lib64/libkrb5support.so.0 (0x00000036e4200000)
>>        libkeyutils.so.1 => /lib64/libkeyutils.so.1 (0x00000036e2e00000)
>>        libresolv.so.2 => /lib64/libresolv.so.2 (0x00000036dda00000)
>>        libselinux.so.1 => /lib64/libselinux.so.1 (0x00000036dce00000)
>> [sw77 at compute-14-1 ~]$
>>
>> but it still uses ssh launcher, and only shows the CPU time from the first node
>>
>> [sw77 at compute-14-1 ~]$ ps xf -u sw77
>>  PID TTY      STAT   TIME COMMAND
>> 17864 ?        Ss     0:00 -bash
>> 17876 ?        Sl     0:00  \_ pbs_demux
>> 17926 ?        S      0:00  \_ /bin/bash /opt/torque/mom_priv/jobs/1585.soho.es.its.nyu.edu.SC
>> 17932 ?        S      0:00      \_ /share/apps/hydra/3.1/intel/bin/mpiexec -rmk pbs /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
>> 17933 ?        Ss     0:00          \_ /share/apps/hydra/3.1/intel/bin/hydra_pmi_proxy --control-port compute-14-1.local:60085 --rmk pbs --launcher ssh --demux poll --pgid 0 --retries 10 --usize -2 --proxy-id 0
>> 17936 ?        RLsl   0:07          |   \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
>> 17937 ?        RLsl   0:07          |   \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
>> 17938 ?        RLsl   0:07          |   \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
>> 17939 ?        RLsl   0:07          |   \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
>> 17940 ?        RLsl   0:07          |   \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
>> 17941 ?        RLsl   0:07          |   \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
>> 17942 ?        RLsl   0:06          |   \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
>> 17943 ?        RLsl   0:06          |   \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
>> 17944 ?        RLsl   0:07          |   \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
>> 17945 ?        RLsl   0:07          |   \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
>> 17946 ?        RLsl   0:07          |   \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
>> 17947 ?        RLsl   0:07          |   \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
>> 17948 ?        RLsl   0:07          |   \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
>> 17949 ?        RLsl   0:07          |   \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
>> 17950 ?        RLsl   0:06          |   \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
>> 17951 ?        RLsl   0:06          |   \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
>> 17952 ?        RLsl   0:06          |   \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
>> 17953 ?        RLsl   0:07          |   \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
>> 17954 ?        RLsl   0:07          |   \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
>> 17955 ?        RLsl   0:07          |   \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
>> 17934 ?        Ss     0:00          \_ /usr/bin/ssh -x compute-14-2.local "/share/apps/hydra/3.1/intel/bin/hydra_pmi_proxy" --control-port compute-14-1.local:60085 --rmk pbs --launcher ssh --demux poll --pgid 0 --retries 1
>> 17935 ?        Ss     0:00          \_ /usr/bin/ssh -x compute-14-3.local "/share/apps/hydra/3.1/intel/bin/hydra_pmi_proxy" --control-port compute-14-1.local:60085 --rmk pbs --launcher ssh --demux poll --pgid 0 --retries 1
>> 17403 ?        S      0:00 sshd: sw77 at pts/0
>> 17408 pts/0    Ss     0:00  \_ -bash
>> 17980 pts/0    R+     0:00      \_ ps xf -u sw77
>> [sw77 at compute-14-1 ~]$
>>
>>
>> Job exceeded its walltime limit. Job was aborted
>> See Administrator for help
>> Exit_status=-11
>> resources_used.cput=01:40:16
>> resources_used.mem=1833964kb
>> resources_used.vmem=8126520kb
>> resources_used.walltime=00:05:09
>>
>> 20 cores per node, so the CPU time is 5x20 ~ 100 mins, the first node only.
>>
>> Do we have something missing from Torque?
>>
>> Thanks.
>>
>> Shenglong
>>
>>
>> On Apr 29, 2014, at 12:21 PM, Jonathan Perkins <perkinjo at cse.ohio-state.edu> wrote:
>>
>>> Thanks for providing the output.  It does look like mpiexec is using
>>> the ssh launcher instead of the pbs launcher.  This should be the
>>> reason why you're only seeing the resources used for the first node.
>>> I suggest downloading the standalone hydra package and trying to
>>> configure it with torque support.
>>>
>>> It is available at
>>> http://www.mpich.org/static/downloads/3.1/hydra-3.1.tar.gz.  I'm not
>>> sure where your tm.h and libtorque.so files are located but if you're
>>> able to locate them make sure these directories are either in
>>> /usr/include and /usr/lib[64] or added in CPPFLAGS and LDFLAGS
>>> respectively when you configure hydra.
>>>
>>> Please let us know if this helps.
>>>
>>> On Tue, Apr 29, 2014 at 11:25 AM, Shenglong Wang <sw77 at nyu.edu> wrote:
>>>>
>>>> This is the output from the first node
>>>>
>>>> [sw77 at compute-14-2 ~]$ ps xf -u sw77
>>>> PID TTY      STAT   TIME COMMAND
>>>> 25410 ?        Ss     0:00 -bash
>>>> 25436 ?        Sl     0:00  \_ pbs_demux
>>>> 25484 ?        S      0:00  \_ /bin/bash /opt/torque/mom_priv/jobs/1580.soho.es.its.nyu.edu.SC
>>>> 25490 ?        S      0:00      \_ /share/apps/mvapich2/2.0rc1/intel/bin/mpiexec /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
>>>> 25491 ?        Ss     0:00          \_ /share/apps/mvapich2/2.0rc1/intel/bin/hydra_pmi_proxy --control-port compute-14-2.local:41010 --rmk pbs --launcher ssh --demux poll --pgid 0 --retries 10 --usize -2 --proxy-id 0
>>>> 25494 ?        RLsl   6:40          |   \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
>>>> 25495 ?        RLsl   6:41          |   \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
>>>> 25496 ?        RLsl   6:41          |   \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
>>>> 25497 ?        RLsl   6:41          |   \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
>>>> 25498 ?        RLsl   6:41          |   \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
>>>> 25499 ?        RLsl   6:41          |   \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
>>>> 25500 ?        RLsl   6:41          |   \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
>>>> 25502 ?        RLsl   6:41          |   \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
>>>> 25503 ?        RLsl   6:41          |   \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
>>>> 25504 ?        RLsl   6:41          |   \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
>>>> 25505 ?        RLsl   6:41          |   \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
>>>> 25506 ?        RLsl   6:41          |   \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
>>>> 25507 ?        RLsl   6:41          |   \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
>>>> 25508 ?        RLsl   6:41          |   \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
>>>> 25509 ?        RLsl   6:41          |   \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
>>>> 25510 ?        RLsl   6:41          |   \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
>>>> 25511 ?        RLsl   6:41          |   \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
>>>> 25512 ?        RLsl   6:41          |   \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
>>>> 25513 ?        RLsl   6:41          |   \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
>>>> 25514 ?        RLsl   6:41          |   \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
>>>> 25492 ?        Ss     0:00          \_ /usr/bin/ssh -x compute-14-3.local "/share/apps/mvapich2/2.0rc1/intel/bin/hydra_pmi_proxy" --control-port compute-14-2.local:41010 --rmk pbs --launcher ssh --demux poll --pgid 0 --retries 10 --usize -2 --proxy-id 1
>>>> 25493 ?        Ss     0:00          \_ /usr/bin/ssh -x compute-14-4.local "/share/apps/mvapich2/2.0rc1/intel/bin/hydra_pmi_proxy" --control-port compute-14-2.local:41010 --rmk pbs --launcher ssh --demux poll --pgid 0 --retries 10 --usize -2 --proxy-id 2
>>>> 25544 ?        S      0:00 sshd: sw77 at pts/0
>>>> 25549 pts/0    Ss     0:00  \_ -bash
>>>> 25932 pts/0    R+     0:00      \_ ps xf -u sw77
>>>> [sw77 at compute-14-2 ~]$
>>>>
>>>>
>>>> the output from 2nd node
>>>>
>>>> [sw77 at compute-14-3 ~]$ ps xf -u sw77
>>>> PID TTY      STAT   TIME COMMAND
>>>> 44090 ?        S      0:00 sshd: sw77 at pts/0
>>>> 44095 pts/0    Ss     0:00  \_ -bash
>>>> 44444 pts/0    R+     0:00      \_ ps xf -u sw77
>>>> 43926 ?        S      0:00 sshd: sw77 at notty
>>>> 43927 ?        Ss     0:00  \_ /share/apps/mvapich2/2.0rc1/intel/bin/hydra_pmi_proxy --control-port compute-14-2.local:41010 --rmk pbs --launcher ssh --demux poll --pgid 0 --retries 10 --usize -2 --proxy-id 1
>>>> 43978 ?        RLsl   7:08      \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
>>>> 43979 ?        RLsl   7:08      \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
>>>> 43980 ?        RLsl   7:08      \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
>>>> 43981 ?        RLsl   7:08      \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
>>>> 43982 ?        RLsl   7:08      \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
>>>> 43983 ?        RLsl   7:08      \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
>>>> 43984 ?        RLsl   7:08      \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
>>>> 43985 ?        RLsl   7:08      \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
>>>> 43986 ?        RLsl   7:08      \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
>>>> 43987 ?        RLsl   7:08      \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
>>>> 43988 ?        RLsl   7:08      \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
>>>> 43989 ?        RLsl   7:08      \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
>>>> 43990 ?        RLsl   7:08      \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
>>>> 43991 ?        RLsl   7:08      \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
>>>> 43992 ?        RLsl   7:08      \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
>>>> 43993 ?        RLsl   7:08      \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
>>>> 43994 ?        RLsl   7:08      \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
>>>> 43995 ?        RLsl   7:08      \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
>>>> 43996 ?        RLsl   7:08      \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
>>>> 43998 ?        RLsl   7:08      \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
>>>> [sw77 at compute-14-3 ~]$
>>>>
>>>>
>>>> the 3rd node
>>>>
>>>> [sw77 at compute-14-4 ~]$  ps xf -u sw77
>>>> PID TTY      STAT   TIME COMMAND
>>>> 18784 ?        S      0:00 sshd: sw77 at pts/0
>>>> 18789 pts/0    Ss     0:00  \_ -bash
>>>> 18845 pts/0    R+     0:00      \_ ps xf -u sw77
>>>> 18328 ?        S      0:00 sshd: sw77 at notty
>>>> 18329 ?        Ss     0:00  \_ /share/apps/mvapich2/2.0rc1/intel/bin/hydra_pmi_proxy --control-port compute-14-2.local:41010 --rmk pbs --launcher ssh --demux poll --pgid 0 --retries 10 --usize -2 --proxy-id 2
>>>> 18380 ?        RLsl   7:37      \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
>>>> 18381 ?        RLsl   7:37      \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
>>>> 18382 ?        RLsl   7:37      \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
>>>> 18383 ?        RLsl   7:36      \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
>>>> 18384 ?        RLsl   7:36      \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
>>>> 18385 ?        RLsl   7:37      \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
>>>> 18386 ?        RLsl   7:37      \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
>>>> 18387 ?        RLsl   7:37      \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
>>>> 18388 ?        RLsl   7:37      \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
>>>> 18389 ?        RLsl   7:37      \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
>>>> 18390 ?        RLsl   7:37      \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
>>>> 18391 ?        RLsl   7:37      \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
>>>> 18392 ?        RLsl   7:37      \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
>>>> 18393 ?        RLsl   7:37      \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
>>>> 18394 ?        RLsl   7:37      \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
>>>> 18395 ?        RLsl   7:36      \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
>>>> 18396 ?        RLsl   7:37      \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
>>>> 18397 ?        RLsl   7:37      \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
>>>> 18398 ?        RLsl   7:37      \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
>>>> 18399 ?        RLsl   7:37      \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
>>>> [sw77 at compute-14-4 ~]$
>>>>
>>>>
>>>> Best,
>>>>
>>>> Shenglong
>>>>
>>>>
>>>> On Apr 29, 2014, at 10:08 AM, Jonathan Perkins <perkinjo at cse.ohio-state.edu> wrote:
>>>>
>>>>> Thanks for the report.  It's possible that this reporting is due to an
>>>>> outstanding issue with hydra and torque(pbs) integration
>>>>> (https://trac.mpich.org/projects/mpich/ticket/1812#no1).  Can you send
>>>>> us the relevant output of ps axf from each node as the job is running
>>>>> to help verify?
>>>>>
>>>>> On Tue, Apr 29, 2014 at 9:49 AM, Shenglong Wang <sw77 at nyu.edu> wrote:
>>>>>>
>>>>>> Hi Jonathan,
>>>>>>
>>>>>> Thanks a lot for the reply. I'm running mvapich2 2.0rc1 and using mpiexec to
>>>>>> launch MPI threads.
>>>>>>
>>>>>> I'm running a job with 120 MPI threads, 6 compute nodes, 20 cores per node.
>>>>>> This is the compute resource usage reported from Torque
>>>>>>
>>>>>> Aborted by PBS Server
>>>>>> Job exceeded its walltime limit. Job was aborted
>>>>>> See Administrator for help
>>>>>> Exit_status=-11
>>>>>> resources_used.cput=239:36:39
>>>>>> resources_used.mem=1984640kb
>>>>>> resources_used.vmem=8092716kb
>>>>>> resources_used.walltime=12:00:16
>>>>>>
>>>>>> The wall time is 12 hours, CPU time is about 240 hours, which is only the
>>>>>> sum of the first node.
>>>>>>
>>>>>> OpenMPI is able to be tightly integrated with Torque, which reports the
>>>>>> total CPU time and memory usage from all the compute nodes. Not sure if
>>>>>> MVAPICH2 has the similar integration with Torque.
>>>>>>
>>>>>> Best,
>>>>>>
>>>>>> Shenglong
>>>>>>
>>>>>> On Apr 29, 2014, at 9:18 AM, Jonathan Perkins <perkinjo at cse.ohio-state.edu>
>>>>>> wrote:
>>>>>>
>>>>>> Hello.  I believe that this is already available when using the hydra
>>>>>> process manager (ie. mpiexec or mpiexec.hydra).  Are you using this
>>>>>> launcher within your torque environment?  If this isn't working then
>>>>>> it may be a matter of the torque development files not being found
>>>>>> when mvapich2 was compiled.  Also, please tell us which version of
>>>>>> MVAPICH2 you're using.
>>>>>>
>>>>>> On Tue, Apr 29, 2014 at 9:07 AM, Shenglong Wang <sw77 at nyu.edu> wrote:
>>>>>>
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> Is it possible to tightly integrate MVAPICH2 with Torque to get the correct
>>>>>> total CPU time and memory usage from all the compute nodes?
>>>>>>
>>>>>> Best,
>>>>>>
>>>>>> Shenglong
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> mvapich-discuss mailing list
>>>>>> mvapich-discuss at cse.ohio-state.edu
>>>>>> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Jonathan Perkins
>>>>>> http://www.cse.ohio-state.edu/~perkinjo
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Jonathan Perkins
>>>>> http://www.cse.ohio-state.edu/~perkinjo
>>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>>
>>> --
>>> Jonathan Perkins
>>> http://www.cse.ohio-state.edu/~perkinjo
>>>
>>
>
>
>



-- 
Jonathan Perkins
http://www.cse.ohio-state.edu/~perkinjo




More information about the mvapich-discuss mailing list