[mvapich-discuss] mvapich2 integrate with Torque

Panda, Dhabaleswar panda at cse.ohio-state.edu
Tue Apr 29 11:35:40 EDT 2014


TACC has used MVAPICH2 with SGE in the past. You may contact some of their staff members for 
details. 

Thanks, 

DK


________________________________________
From: mvapich-discuss-bounces at cse.ohio-state.edu on behalf of Albert Everett [aeeverett at ualr.edu]
Sent: Tuesday, April 29, 2014 10:21 AM
To: Brock Palen
Cc: mvapich-discuss at cse.ohio-state.edu
Subject: Re: [mvapich-discuss] mvapich2 integrate with Torque

Thanks for this note. I run torque/maui now, but the next incarnation of the cluster will probably have SGE instead. Any notes about mvapich and SGE would also be useful.

Albert

On Apr 29, 2014, at 9:16 AM, Brock Palen <brockp at umich.edu> wrote:

> Shenglong,
>
> I am not a regular Mvapich user, I use it more with Matlab PCT.
>
> That said I always found it difficulty to get the configure for mvapich/mpich to pick up libtorque.so  thus the PBS launcher doesn't get correctly created, and still uses ssh to spawn.
>
> I always end up downloading the hydra package and with it I can for sure pass the location to libtorque.so.
>
> Working from memory run:
>
> mpiexec -help
>
> If both parts of PBS functionality were enabled you should see the string "pbs"  appear twice.  One always gets built, this is how hydra gets the list of hosts.
>
> The other is the questionable one, which requires libtoque.so for using the TM api to spawning on other nodes.  Also check your defaults, you can control them by environment variable, this is what I do for our cluster.
>
> A quick test is when your job is running ssh to a sister node (a node other than the first)  and look at pstree,
>
> hydra_proxy  should be a child of pbs_mom if it is working right.  If not it will be a child of init.
>
> If you have more trouble let me know, I didn't make notes the last time I built it and it has come up a few times, so I would blog it if people would find it useful.
>
> Brock Palen
> www.umich.edu/~brockp
> CAEN Advanced Computing
> XSEDE Campus Champion
> brockp at umich.edu
> (734)936-1985
>
>
>
> On Apr 29, 2014, at 9:49 AM, Shenglong Wang <sw77 at nyu.edu> wrote:
>
>>
>> Hi Jonathan,
>>
>> Thanks a lot for the reply. I'm running mvapich2 2.0rc1 and using mpiexec to launch MPI threads.
>>
>> I'm running a job with 120 MPI threads, 6 compute nodes, 20 cores per node. This is the compute resource usage reported from Torque
>>
>> Aborted by PBS Server
>> Job exceeded its walltime limit. Job was aborted
>> See Administrator for help
>> Exit_status=-11
>> resources_used.cput=239:36:39
>> resources_used.mem=1984640kb
>> resources_used.vmem=8092716kb
>> resources_used.walltime=12:00:16
>>
>> The wall time is 12 hours, CPU time is about 240 hours, which is only the sum of the first node.
>>
>> OpenMPI is able to be tightly integrated with Torque, which reports the total CPU time and memory usage from all the compute nodes. Not sure if MVAPICH2 has the similar integration with Torque.
>>
>> Best,
>>
>> Shenglong
>>
>> On Apr 29, 2014, at 9:18 AM, Jonathan Perkins <perkinjo at cse.ohio-state.edu> wrote:
>>
>>> Hello.  I believe that this is already available when using the hydra
>>> process manager (ie. mpiexec or mpiexec.hydra).  Are you using this
>>> launcher within your torque environment?  If this isn't working then
>>> it may be a matter of the torque development files not being found
>>> when mvapich2 was compiled.  Also, please tell us which version of
>>> MVAPICH2 you're using.
>>>
>>> On Tue, Apr 29, 2014 at 9:07 AM, Shenglong Wang <sw77 at nyu.edu> wrote:
>>>>
>>>> Hi,
>>>>
>>>> Is it possible to tightly integrate MVAPICH2 with Torque to get the correct total CPU time and memory usage from all the compute nodes?
>>>>
>>>> Best,
>>>>
>>>> Shenglong
>>>>
>>>>
>>>> _______________________________________________
>>>> mvapich-discuss mailing list
>>>> mvapich-discuss at cse.ohio-state.edu
>>>> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>>>
>>>
>>>
>>>
>>> --
>>> Jonathan Perkins
>>> http://www.cse.ohio-state.edu/~perkinjo
>>>
>>
>> _______________________________________________
>> mvapich-discuss mailing list
>> mvapich-discuss at cse.ohio-state.edu
>> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss

-----------------------------------------
Albert Everett | Computational Specialist
University of Arkansas at Little Rock | Graduate Institute of Technology
501.569.8346 | aeeverett at ualr.edu | git.ualr.edu | ualr.edu




_______________________________________________
mvapich-discuss mailing list
mvapich-discuss at cse.ohio-state.edu
http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
-------------- next part --------------
A non-text attachment was scrubbed...
Name: winmail.dat
Type: application/ms-tnef
Size: 6001 bytes
Desc: not available
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20140429/0507ad88/attachment-0001.bin>


More information about the mvapich-discuss mailing list