[mvapich-discuss] mvapich2 integrate with Torque
Shenglong Wang
sw77 at nyu.edu
Tue Apr 29 13:32:18 EDT 2014
Hi Jonathan,
Thanks for the suggestions.
I rebuilt hydra 3.1 with Torque linked
[sw77 at compute-14-1 ~]$ ldd -r /share/apps/hydra/3.1/intel/bin/mpiexec
linux-vdso.so.1 => (0x00007ffffe7ff000)
libmpl.so.1 => /share/apps/hydra/3.1/intel/lib/libmpl.so.1 (0x00002ad9d8a8b000)
libnsl.so.1 => /lib64/libnsl.so.1 (0x00000036dde00000)
libcr.so.0 => /usr/lib64/libcr.so.0 (0x00000036dc200000)
libdl.so.2 => /lib64/libdl.so.2 (0x00000036dba00000)
libtorque.so.2 => /opt/torque/lib/libtorque.so.2 (0x00002ad9d8ca4000)
libxml2.so.2 => /share/apps/libxml2/2.9.1/intel/lib/libxml2.so.2 (0x00002ad9d9575000)
libz.so.1 => /share/apps/zlib/1.2.8/intel/lib/libz.so.1 (0x00002ad9d9a8e000)
libcrypto.so.10 => /usr/lib64/libcrypto.so.10 (0x00000036e3200000)
libssl.so.10 => /usr/lib64/libssl.so.10 (0x00000036e4a00000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x00000036dbe00000)
librt.so.1 => /lib64/librt.so.1 (0x00000036dca00000)
libimf.so => /share/apps/intel/14.0.2/composer_xe_2013_sp1.2.144/compiler/lib/intel64/libimf.so (0x00002ad9d9ca9000)
libsvml.so => /share/apps/intel/14.0.2/composer_xe_2013_sp1.2.144/compiler/lib/intel64/libsvml.so (0x00002ad9da16d000)
libirng.so => /share/apps/intel/14.0.2/composer_xe_2013_sp1.2.144/compiler/lib/intel64/libirng.so (0x00002ad9dad68000)
libm.so.6 => /lib64/libm.so.6 (0x00002ad9daf6f000)
libiomp5.so => /share/apps/intel/14.0.2/composer_xe_2013_sp1.2.144/compiler/lib/intel64/libiomp5.so (0x00002ad9db1f4000)
libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00002ad9db50c000)
libintlc.so.5 => /share/apps/intel/14.0.2/composer_xe_2013_sp1.2.144/compiler/lib/intel64/libintlc.so.5 (0x00002ad9db722000)
libc.so.6 => /lib64/libc.so.6 (0x00000036db600000)
/lib64/ld-linux-x86-64.so.2 (0x00000036db200000)
libstdc++.so.6 => /usr/lib64/libstdc++.so.6 (0x00000036dee00000)
libgssapi_krb5.so.2 => /lib64/libgssapi_krb5.so.2 (0x00000036e2600000)
libkrb5.so.3 => /lib64/libkrb5.so.3 (0x00000036e3600000)
libcom_err.so.2 => /lib64/libcom_err.so.2 (0x00000036e0200000)
libk5crypto.so.3 => /lib64/libk5crypto.so.3 (0x00000036e2200000)
libkrb5support.so.0 => /lib64/libkrb5support.so.0 (0x00000036e4200000)
libkeyutils.so.1 => /lib64/libkeyutils.so.1 (0x00000036e2e00000)
libresolv.so.2 => /lib64/libresolv.so.2 (0x00000036dda00000)
libselinux.so.1 => /lib64/libselinux.so.1 (0x00000036dce00000)
[sw77 at compute-14-1 ~]$
but it still uses ssh launcher, and only shows the CPU time from the first node
[sw77 at compute-14-1 ~]$ ps xf -u sw77
PID TTY STAT TIME COMMAND
17864 ? Ss 0:00 -bash
17876 ? Sl 0:00 \_ pbs_demux
17926 ? S 0:00 \_ /bin/bash /opt/torque/mom_priv/jobs/1585.soho.es.its.nyu.edu.SC
17932 ? S 0:00 \_ /share/apps/hydra/3.1/intel/bin/mpiexec -rmk pbs /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
17933 ? Ss 0:00 \_ /share/apps/hydra/3.1/intel/bin/hydra_pmi_proxy --control-port compute-14-1.local:60085 --rmk pbs --launcher ssh --demux poll --pgid 0 --retries 10 --usize -2 --proxy-id 0
17936 ? RLsl 0:07 | \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
17937 ? RLsl 0:07 | \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
17938 ? RLsl 0:07 | \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
17939 ? RLsl 0:07 | \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
17940 ? RLsl 0:07 | \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
17941 ? RLsl 0:07 | \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
17942 ? RLsl 0:06 | \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
17943 ? RLsl 0:06 | \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
17944 ? RLsl 0:07 | \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
17945 ? RLsl 0:07 | \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
17946 ? RLsl 0:07 | \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
17947 ? RLsl 0:07 | \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
17948 ? RLsl 0:07 | \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
17949 ? RLsl 0:07 | \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
17950 ? RLsl 0:06 | \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
17951 ? RLsl 0:06 | \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
17952 ? RLsl 0:06 | \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
17953 ? RLsl 0:07 | \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
17954 ? RLsl 0:07 | \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
17955 ? RLsl 0:07 | \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
17934 ? Ss 0:00 \_ /usr/bin/ssh -x compute-14-2.local "/share/apps/hydra/3.1/intel/bin/hydra_pmi_proxy" --control-port compute-14-1.local:60085 --rmk pbs --launcher ssh --demux poll --pgid 0 --retries 1
17935 ? Ss 0:00 \_ /usr/bin/ssh -x compute-14-3.local "/share/apps/hydra/3.1/intel/bin/hydra_pmi_proxy" --control-port compute-14-1.local:60085 --rmk pbs --launcher ssh --demux poll --pgid 0 --retries 1
17403 ? S 0:00 sshd: sw77 at pts/0
17408 pts/0 Ss 0:00 \_ -bash
17980 pts/0 R+ 0:00 \_ ps xf -u sw77
[sw77 at compute-14-1 ~]$
Job exceeded its walltime limit. Job was aborted
See Administrator for help
Exit_status=-11
resources_used.cput=01:40:16
resources_used.mem=1833964kb
resources_used.vmem=8126520kb
resources_used.walltime=00:05:09
20 cores per node, so the CPU time is 5x20 ~ 100 mins, the first node only.
Do we have something missing from Torque?
Thanks.
Shenglong
On Apr 29, 2014, at 12:21 PM, Jonathan Perkins <perkinjo at cse.ohio-state.edu> wrote:
> Thanks for providing the output. It does look like mpiexec is using
> the ssh launcher instead of the pbs launcher. This should be the
> reason why you're only seeing the resources used for the first node.
> I suggest downloading the standalone hydra package and trying to
> configure it with torque support.
>
> It is available at
> http://www.mpich.org/static/downloads/3.1/hydra-3.1.tar.gz. I'm not
> sure where your tm.h and libtorque.so files are located but if you're
> able to locate them make sure these directories are either in
> /usr/include and /usr/lib[64] or added in CPPFLAGS and LDFLAGS
> respectively when you configure hydra.
>
> Please let us know if this helps.
>
> On Tue, Apr 29, 2014 at 11:25 AM, Shenglong Wang <sw77 at nyu.edu> wrote:
>>
>> This is the output from the first node
>>
>> [sw77 at compute-14-2 ~]$ ps xf -u sw77
>> PID TTY STAT TIME COMMAND
>> 25410 ? Ss 0:00 -bash
>> 25436 ? Sl 0:00 \_ pbs_demux
>> 25484 ? S 0:00 \_ /bin/bash /opt/torque/mom_priv/jobs/1580.soho.es.its.nyu.edu.SC
>> 25490 ? S 0:00 \_ /share/apps/mvapich2/2.0rc1/intel/bin/mpiexec /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
>> 25491 ? Ss 0:00 \_ /share/apps/mvapich2/2.0rc1/intel/bin/hydra_pmi_proxy --control-port compute-14-2.local:41010 --rmk pbs --launcher ssh --demux poll --pgid 0 --retries 10 --usize -2 --proxy-id 0
>> 25494 ? RLsl 6:40 | \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
>> 25495 ? RLsl 6:41 | \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
>> 25496 ? RLsl 6:41 | \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
>> 25497 ? RLsl 6:41 | \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
>> 25498 ? RLsl 6:41 | \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
>> 25499 ? RLsl 6:41 | \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
>> 25500 ? RLsl 6:41 | \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
>> 25502 ? RLsl 6:41 | \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
>> 25503 ? RLsl 6:41 | \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
>> 25504 ? RLsl 6:41 | \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
>> 25505 ? RLsl 6:41 | \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
>> 25506 ? RLsl 6:41 | \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
>> 25507 ? RLsl 6:41 | \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
>> 25508 ? RLsl 6:41 | \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
>> 25509 ? RLsl 6:41 | \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
>> 25510 ? RLsl 6:41 | \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
>> 25511 ? RLsl 6:41 | \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
>> 25512 ? RLsl 6:41 | \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
>> 25513 ? RLsl 6:41 | \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
>> 25514 ? RLsl 6:41 | \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
>> 25492 ? Ss 0:00 \_ /usr/bin/ssh -x compute-14-3.local "/share/apps/mvapich2/2.0rc1/intel/bin/hydra_pmi_proxy" --control-port compute-14-2.local:41010 --rmk pbs --launcher ssh --demux poll --pgid 0 --retries 10 --usize -2 --proxy-id 1
>> 25493 ? Ss 0:00 \_ /usr/bin/ssh -x compute-14-4.local "/share/apps/mvapich2/2.0rc1/intel/bin/hydra_pmi_proxy" --control-port compute-14-2.local:41010 --rmk pbs --launcher ssh --demux poll --pgid 0 --retries 10 --usize -2 --proxy-id 2
>> 25544 ? S 0:00 sshd: sw77 at pts/0
>> 25549 pts/0 Ss 0:00 \_ -bash
>> 25932 pts/0 R+ 0:00 \_ ps xf -u sw77
>> [sw77 at compute-14-2 ~]$
>>
>>
>> the output from 2nd node
>>
>> [sw77 at compute-14-3 ~]$ ps xf -u sw77
>> PID TTY STAT TIME COMMAND
>> 44090 ? S 0:00 sshd: sw77 at pts/0
>> 44095 pts/0 Ss 0:00 \_ -bash
>> 44444 pts/0 R+ 0:00 \_ ps xf -u sw77
>> 43926 ? S 0:00 sshd: sw77 at notty
>> 43927 ? Ss 0:00 \_ /share/apps/mvapich2/2.0rc1/intel/bin/hydra_pmi_proxy --control-port compute-14-2.local:41010 --rmk pbs --launcher ssh --demux poll --pgid 0 --retries 10 --usize -2 --proxy-id 1
>> 43978 ? RLsl 7:08 \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
>> 43979 ? RLsl 7:08 \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
>> 43980 ? RLsl 7:08 \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
>> 43981 ? RLsl 7:08 \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
>> 43982 ? RLsl 7:08 \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
>> 43983 ? RLsl 7:08 \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
>> 43984 ? RLsl 7:08 \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
>> 43985 ? RLsl 7:08 \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
>> 43986 ? RLsl 7:08 \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
>> 43987 ? RLsl 7:08 \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
>> 43988 ? RLsl 7:08 \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
>> 43989 ? RLsl 7:08 \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
>> 43990 ? RLsl 7:08 \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
>> 43991 ? RLsl 7:08 \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
>> 43992 ? RLsl 7:08 \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
>> 43993 ? RLsl 7:08 \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
>> 43994 ? RLsl 7:08 \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
>> 43995 ? RLsl 7:08 \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
>> 43996 ? RLsl 7:08 \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
>> 43998 ? RLsl 7:08 \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
>> [sw77 at compute-14-3 ~]$
>>
>>
>> the 3rd node
>>
>> [sw77 at compute-14-4 ~]$ ps xf -u sw77
>> PID TTY STAT TIME COMMAND
>> 18784 ? S 0:00 sshd: sw77 at pts/0
>> 18789 pts/0 Ss 0:00 \_ -bash
>> 18845 pts/0 R+ 0:00 \_ ps xf -u sw77
>> 18328 ? S 0:00 sshd: sw77 at notty
>> 18329 ? Ss 0:00 \_ /share/apps/mvapich2/2.0rc1/intel/bin/hydra_pmi_proxy --control-port compute-14-2.local:41010 --rmk pbs --launcher ssh --demux poll --pgid 0 --retries 10 --usize -2 --proxy-id 2
>> 18380 ? RLsl 7:37 \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
>> 18381 ? RLsl 7:37 \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
>> 18382 ? RLsl 7:37 \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
>> 18383 ? RLsl 7:36 \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
>> 18384 ? RLsl 7:36 \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
>> 18385 ? RLsl 7:37 \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
>> 18386 ? RLsl 7:37 \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
>> 18387 ? RLsl 7:37 \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
>> 18388 ? RLsl 7:37 \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
>> 18389 ? RLsl 7:37 \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
>> 18390 ? RLsl 7:37 \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
>> 18391 ? RLsl 7:37 \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
>> 18392 ? RLsl 7:37 \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
>> 18393 ? RLsl 7:37 \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
>> 18394 ? RLsl 7:37 \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
>> 18395 ? RLsl 7:36 \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
>> 18396 ? RLsl 7:37 \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
>> 18397 ? RLsl 7:37 \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
>> 18398 ? RLsl 7:37 \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
>> 18399 ? RLsl 7:37 \_ /share/apps/amber/12/mvapich2/intel/amber12/bin/pmemd.MPI -O -i mdin -o md.log -p prmtop -c inpcrd
>> [sw77 at compute-14-4 ~]$
>>
>>
>> Best,
>>
>> Shenglong
>>
>>
>> On Apr 29, 2014, at 10:08 AM, Jonathan Perkins <perkinjo at cse.ohio-state.edu> wrote:
>>
>>> Thanks for the report. It's possible that this reporting is due to an
>>> outstanding issue with hydra and torque(pbs) integration
>>> (https://trac.mpich.org/projects/mpich/ticket/1812#no1). Can you send
>>> us the relevant output of ps axf from each node as the job is running
>>> to help verify?
>>>
>>> On Tue, Apr 29, 2014 at 9:49 AM, Shenglong Wang <sw77 at nyu.edu> wrote:
>>>>
>>>> Hi Jonathan,
>>>>
>>>> Thanks a lot for the reply. I'm running mvapich2 2.0rc1 and using mpiexec to
>>>> launch MPI threads.
>>>>
>>>> I'm running a job with 120 MPI threads, 6 compute nodes, 20 cores per node.
>>>> This is the compute resource usage reported from Torque
>>>>
>>>> Aborted by PBS Server
>>>> Job exceeded its walltime limit. Job was aborted
>>>> See Administrator for help
>>>> Exit_status=-11
>>>> resources_used.cput=239:36:39
>>>> resources_used.mem=1984640kb
>>>> resources_used.vmem=8092716kb
>>>> resources_used.walltime=12:00:16
>>>>
>>>> The wall time is 12 hours, CPU time is about 240 hours, which is only the
>>>> sum of the first node.
>>>>
>>>> OpenMPI is able to be tightly integrated with Torque, which reports the
>>>> total CPU time and memory usage from all the compute nodes. Not sure if
>>>> MVAPICH2 has the similar integration with Torque.
>>>>
>>>> Best,
>>>>
>>>> Shenglong
>>>>
>>>> On Apr 29, 2014, at 9:18 AM, Jonathan Perkins <perkinjo at cse.ohio-state.edu>
>>>> wrote:
>>>>
>>>> Hello. I believe that this is already available when using the hydra
>>>> process manager (ie. mpiexec or mpiexec.hydra). Are you using this
>>>> launcher within your torque environment? If this isn't working then
>>>> it may be a matter of the torque development files not being found
>>>> when mvapich2 was compiled. Also, please tell us which version of
>>>> MVAPICH2 you're using.
>>>>
>>>> On Tue, Apr 29, 2014 at 9:07 AM, Shenglong Wang <sw77 at nyu.edu> wrote:
>>>>
>>>>
>>>> Hi,
>>>>
>>>> Is it possible to tightly integrate MVAPICH2 with Torque to get the correct
>>>> total CPU time and memory usage from all the compute nodes?
>>>>
>>>> Best,
>>>>
>>>> Shenglong
>>>>
>>>>
>>>> _______________________________________________
>>>> mvapich-discuss mailing list
>>>> mvapich-discuss at cse.ohio-state.edu
>>>> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Jonathan Perkins
>>>> http://www.cse.ohio-state.edu/~perkinjo
>>>>
>>>>
>>>
>>>
>>>
>>> --
>>> Jonathan Perkins
>>> http://www.cse.ohio-state.edu/~perkinjo
>>>
>>
>>
>>
>
>
>
> --
> Jonathan Perkins
> http://www.cse.ohio-state.edu/~perkinjo
>
More information about the mvapich-discuss
mailing list