[mvapich-discuss] configuring MVAPICH2 with PBS Pro for tight integration

Subramoni, Hari subramoni.1 at osu.edu
Thu Jul 2 11:25:33 EDT 2020


Hi, Matt.

We are very glad to hear that you are using MVAPICH2 at INL and like it!

Many thanks for working on the patch and sending it to us. We appreciate it. Let us take a look at the patch and get back to you in some time.

Best,
Hari.

From: mvapich-discuss-bounces at cse.ohio-state.edu <mvapich-discuss-bounces at mailman.cse.ohio-state.edu> On Behalf Of Matthew W. Anderson
Sent: Thursday, July 2, 2020 9:57 AM
To: mvapich-discuss at cse.ohio-state.edu <mvapich-discuss at mailman.cse.ohio-state.edu>
Cc: Peter P. Cebull <peter.cebull at inl.gov>
Subject: [mvapich-discuss] configuring MVAPICH2 with PBS Pro for tight integration


  Hello,

  First, thanks for creating and supporting MVAPICH -- we use it every day and love it!

  We are experimenting with configuring MVAPICH 2.3.4 to enable tighter integration with PBS Pro.  The key issue we have found is that MVAPICH only tries to use Torque and has no "with-pbs" option to indicate integration is desired with PBS Pro.  The hydra configure file ( src/pm/hydra/configure) appears to only link libtorque.so.    This prevents the PBSs MoMs from collecting information we like to have (i.e. CPU and memory usage on worker nodes) and the worker node processes don't get assigned to the appropriate cgroups which makes it difficult to allow job sharing on GPU nodes.

  A patch was created to tightly integrate MVAPICH with PBS Pro by modifying the hydra configure file ( src/pm/hydra/configure ) as follows:


-    { $as_echo "$as_me:${as_lineno-$LINENO}: checking for tm_init in -ltorque" >&5

-$as_echo_n "checking for tm_init in -ltorque... " >&6; }

-if ${ac_cv_lib_torque_tm_init+:} false; then :

-  $as_echo_n "(cached) " >&6

-else

-  ac_check_lib_save_LIBS=$LIBS

-LIBS="-ltorque  $LIBS"

+    { $as_echo "$as_me:${as_lineno-$LINENO}: checking for tm_init in -lpbs" >&5

+$as_echo_n "checking for tm_init in -lpbs... " >&6; }

+ac_check_lib_save_LIBS=$LIBS

+LIBS="-lpbs  $LIBS"

 cat confdefs.h - <<_ACEOF >conftest.$ac_ext

 /* end confdefs.h.  */



@@ -15819,7 +15816,7 @@

 rm -f core conftest.err conftest.$ac_objext \

     conftest$ac_exeext conftest.$ac_ext

 LIBS=$ac_check_lib_save_LIBS

-fi

+

 { $as_echo "$as_me:${as_lineno-$LINENO}: result: $ac_cv_lib_torque_tm_init" >&5

 $as_echo "$ac_cv_lib_torque_tm_init" >&6; }

 if test "x$ac_cv_lib_torque_tm_init" = xyes; then :

@@ -15827,7 +15824,7 @@

 #define HAVE_LIBTORQUE 1

 _ACEOF



-  LIBS="-ltorque $LIBS"

+  LIBS="-lpbs $LIBS"



 else

   failure=yes

@@ -15848,11 +15845,11 @@

      available_launchers="$available_launchers pbs"





- if echo "$WRAPPER_LIBS" | $FGREP -e "\<-ltorque\>" >/dev/null 2>&1; then :

-  echo "WRAPPER_LIBS(='$WRAPPER_LIBS') contains '-ltorque', not appending" >&5

+ if echo "$WRAPPER_LIBS" | $FGREP -e "\<-lpbs\>" >/dev/null 2>&1; then :

+  echo "WRAPPER_LIBS(='$WRAPPER_LIBS') contains '-lpbs', not appending" >&5

 else

-  echo "WRAPPER_LIBS(='$WRAPPER_LIBS') does not contain '-ltorque', appending" >&5

- WRAPPER_LIBS="$WRAPPER_LIBS -ltorque"

+  echo "WRAPPER_LIBS(='$WRAPPER_LIBS') does not contain '-lpbs', appending" >&5

+ WRAPPER_LIBS="$WRAPPER_LIBS -lpbs"



 fi


This patch results in a tighter integration with PBS Pro and gives the PBS MoMs the information we like to collect on worker nodes.  In general, this solution works quite well for us.  However, an unintended side effect we just found is that MPI_ABORT no longer works!  There seems to have been a similar problem in the past that was resolved in a fix in version 1.8a2 (from the release nodes):


- Fix a process cleanup issue in Hydra when MPI_ABORT is called (upstream

      MPICH2 patch)

   Do you have any suggestions on how we might address the MPI_ABORT issue when we integrate MVAPICH with PBS Pro?

   Many thanks!

  Matt

--------------------------------------------------------------------------------------------------------------------------
Office:  208-526-4104
   Cell:  812-320-4818
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20200702/b7fcb4b3/attachment.html>


More information about the mvapich-discuss mailing list