[mvapich-discuss] configuring MVAPICH2 with PBS Pro for tight integration

Matthew W. Anderson Matthew.Anderson2 at inl.gov
Thu Jul 2 09:56:41 EDT 2020


  Hello,

  First, thanks for creating and supporting MVAPICH -- we use it every day and love it!

  We are experimenting with configuring MVAPICH 2.3.4 to enable tighter integration with PBS Pro.  The key issue we have found is that MVAPICH only tries to use Torque and has no "with-pbs" option to indicate integration is desired with PBS Pro.  The hydra configure file ( src/pm/hydra/configure) appears to only link libtorque.so.    This prevents the PBSs MoMs from collecting information we like to have (i.e. CPU and memory usage on worker nodes) and the worker node processes don't get assigned to the appropriate cgroups which makes it difficult to allow job sharing on GPU nodes.

  A patch was created to tightly integrate MVAPICH with PBS Pro by modifying the hydra configure file ( src/pm/hydra/configure ) as follows:


-    { $as_echo "$as_me:${as_lineno-$LINENO}: checking for tm_init in -ltorque" >&5

-$as_echo_n "checking for tm_init in -ltorque... " >&6; }

-if ${ac_cv_lib_torque_tm_init+:} false; then :

-  $as_echo_n "(cached) " >&6

-else

-  ac_check_lib_save_LIBS=$LIBS

-LIBS="-ltorque  $LIBS"

+    { $as_echo "$as_me:${as_lineno-$LINENO}: checking for tm_init in -lpbs" >&5

+$as_echo_n "checking for tm_init in -lpbs... " >&6; }

+ac_check_lib_save_LIBS=$LIBS

+LIBS="-lpbs  $LIBS"

 cat confdefs.h - <<_ACEOF >conftest.$ac_ext

 /* end confdefs.h.  */



@@ -15819,7 +15816,7 @@

 rm -f core conftest.err conftest.$ac_objext \

     conftest$ac_exeext conftest.$ac_ext

 LIBS=$ac_check_lib_save_LIBS

-fi

+

 { $as_echo "$as_me:${as_lineno-$LINENO}: result: $ac_cv_lib_torque_tm_init" >&5

 $as_echo "$ac_cv_lib_torque_tm_init" >&6; }

 if test "x$ac_cv_lib_torque_tm_init" = xyes; then :

@@ -15827,7 +15824,7 @@

 #define HAVE_LIBTORQUE 1

 _ACEOF



-  LIBS="-ltorque $LIBS"

+  LIBS="-lpbs $LIBS"



 else

   failure=yes

@@ -15848,11 +15845,11 @@

      available_launchers="$available_launchers pbs"





- if echo "$WRAPPER_LIBS" | $FGREP -e "\<-ltorque\>" >/dev/null 2>&1; then :

-  echo "WRAPPER_LIBS(='$WRAPPER_LIBS') contains '-ltorque', not appending" >&5

+ if echo "$WRAPPER_LIBS" | $FGREP -e "\<-lpbs\>" >/dev/null 2>&1; then :

+  echo "WRAPPER_LIBS(='$WRAPPER_LIBS') contains '-lpbs', not appending" >&5

 else

-  echo "WRAPPER_LIBS(='$WRAPPER_LIBS') does not contain '-ltorque', appending" >&5

- WRAPPER_LIBS="$WRAPPER_LIBS -ltorque"

+  echo "WRAPPER_LIBS(='$WRAPPER_LIBS') does not contain '-lpbs', appending" >&5

+ WRAPPER_LIBS="$WRAPPER_LIBS -lpbs"



 fi


This patch results in a tighter integration with PBS Pro and gives the PBS MoMs the information we like to collect on worker nodes.  In general, this solution works quite well for us.  However, an unintended side effect we just found is that MPI_ABORT no longer works!  There seems to have been a similar problem in the past that was resolved in a fix in version 1.8a2 (from the release nodes):


- Fix a process cleanup issue in Hydra when MPI_ABORT is called (upstream
      MPICH2 patch)

   Do you have any suggestions on how we might address the MPI_ABORT issue when we integrate MVAPICH with PBS Pro?

   Many thanks!

  Matt

--------------------------------------------------------------------------------------------------------------------------
Office:  208-526-4104
   Cell:  812-320-4818
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20200702/4f2b4491/attachment-0001.html>


More information about the mvapich-discuss mailing list