[mvapich-discuss] mvapich2 integrate with Torque

Brock Palen brockp at umich.edu
Wed Apr 30 14:28:31 EDT 2014


Sorry for the delay, I have written this up (roughly)

http://www.failureasaservice.com/2014/04/mpich-and-mvapich-with-torque.html

Brock Palen
www.umich.edu/~brockp
CAEN Advanced Computing
XSEDE Campus Champion
brockp at umich.edu
(734)936-1985



On Apr 29, 2014, at 4:56 PM, Shenglong Wang <sw77 at nyu.edu> wrote:

> 
> Hi Brock,
> 
> Yes, you are right, "pbs" appears twice after integrated with Torque.
> 
> [sw77 at login-0-1 ~]$ which mpiexec
> /share/apps/hydra/3.1/intel/bin/mpiexec
> [sw77 at login-0-1 ~]$ mpiexec --help | grep pbs
>    -launcher                        launcher to use (ssh rsh fork slurm ll lsf sge pbs manual persist)
>    -rmk                             resource management kernel to use (user slurm ll lsf sge pbs cobalt)
> [sw77 at login-0-1 ~]$
> 
> Thanks.
> 
> Shenglong
> 
> On Apr 29, 2014, at 10:16 AM, Brock Palen <brockp at umich.edu> wrote:
> 
>> Shenglong,
>> 
>> I am not a regular Mvapich user, I use it more with Matlab PCT.
>> 
>> That said I always found it difficulty to get the configure for mvapich/mpich to pick up libtorque.so  thus the PBS launcher doesn't get correctly created, and still uses ssh to spawn.
>> 
>> I always end up downloading the hydra package and with it I can for sure pass the location to libtorque.so.
>> 
>> Working from memory run:
>> 
>> mpiexec -help
>> 
>> If both parts of PBS functionality were enabled you should see the string "pbs"  appear twice.  One always gets built, this is how hydra gets the list of hosts.
>> 
>> The other is the questionable one, which requires libtoque.so for using the TM api to spawning on other nodes.  Also check your defaults, you can control them by environment variable, this is what I do for our cluster.
>> 
>> A quick test is when your job is running ssh to a sister node (a node other than the first)  and look at pstree,
>> 
>> hydra_proxy  should be a child of pbs_mom if it is working right.  If not it will be a child of init.
>> 
>> If you have more trouble let me know, I didn't make notes the last time I built it and it has come up a few times, so I would blog it if people would find it useful.
>> 
>> Brock Palen
>> www.umich.edu/~brockp
>> CAEN Advanced Computing
>> XSEDE Campus Champion
>> brockp at umich.edu
>> (734)936-1985
>> 
>> 
>> 
>> On Apr 29, 2014, at 9:49 AM, Shenglong Wang <sw77 at nyu.edu> wrote:
>> 
>>> 
>>> Hi Jonathan,
>>> 
>>> Thanks a lot for the reply. I'm running mvapich2 2.0rc1 and using mpiexec to launch MPI threads.
>>> 
>>> I'm running a job with 120 MPI threads, 6 compute nodes, 20 cores per node. This is the compute resource usage reported from Torque
>>> 
>>> Aborted by PBS Server
>>> Job exceeded its walltime limit. Job was aborted
>>> See Administrator for help
>>> Exit_status=-11
>>> resources_used.cput=239:36:39
>>> resources_used.mem=1984640kb
>>> resources_used.vmem=8092716kb
>>> resources_used.walltime=12:00:16
>>> 
>>> The wall time is 12 hours, CPU time is about 240 hours, which is only the sum of the first node.
>>> 
>>> OpenMPI is able to be tightly integrated with Torque, which reports the total CPU time and memory usage from all the compute nodes. Not sure if MVAPICH2 has the similar integration with Torque.
>>> 
>>> Best,
>>> 
>>> Shenglong
>>> 
>>> On Apr 29, 2014, at 9:18 AM, Jonathan Perkins <perkinjo at cse.ohio-state.edu> wrote:
>>> 
>>>> Hello.  I believe that this is already available when using the hydra
>>>> process manager (ie. mpiexec or mpiexec.hydra).  Are you using this
>>>> launcher within your torque environment?  If this isn't working then
>>>> it may be a matter of the torque development files not being found
>>>> when mvapich2 was compiled.  Also, please tell us which version of
>>>> MVAPICH2 you're using.
>>>> 
>>>> On Tue, Apr 29, 2014 at 9:07 AM, Shenglong Wang <sw77 at nyu.edu> wrote:
>>>>> 
>>>>> Hi,
>>>>> 
>>>>> Is it possible to tightly integrate MVAPICH2 with Torque to get the correct total CPU time and memory usage from all the compute nodes?
>>>>> 
>>>>> Best,
>>>>> 
>>>>> Shenglong
>>>>> 
>>>>> 
>>>>> _______________________________________________
>>>>> mvapich-discuss mailing list
>>>>> mvapich-discuss at cse.ohio-state.edu
>>>>> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>>>> 
>>>> 
>>>> 
>>>> 
>>>> -- 
>>>> Jonathan Perkins
>>>> http://www.cse.ohio-state.edu/~perkinjo
>>>> 
>>> 
>>> _______________________________________________
>>> mvapich-discuss mailing list
>>> mvapich-discuss at cse.ohio-state.edu
>>> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>> 
> 

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 163 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20140430/9c7bda9a/attachment.sig>


More information about the mvapich-discuss mailing list