[mvapich-discuss] Docs re: MVAPICH2 2.0.x and PBS/Torque

Novosielski, Ryan novosirj at ca.rutgers.edu
Tue Mar 31 14:32:34 EDT 2015


A bit of my own look at this, for what is different between the two log files, I see the following:

< configure:16424: icc -o conftest    -DNDEBUG -DNVALGRIND -O2    -I/scratch/novosirj/install-files/mvapich2-2.0.1/src/mpl/include -I/scratch/novosirj/install
-files/mvapich2-2.0.1/src/mpl/include -I/scratch/novosirj/install-files/mvapich2-2.0.1/src/openpa/src -I/scratch/novosirj/install-files/mvapich2-2.0.1/src/ope
npa/src -I/scratch/novosirj/install-files/mvapich2-2.0.1/src/mpi/romio/include -I/include -I/include -I/include -I/include -I/opt/sw/admin/torque/2.5.13/inclu
de -Wl,-rpath,/opt/intel/composer_xe_2015.2.164/compiler/lib/intel64 -L/opt/sw/admin/torque/2.5.13/lib64 -L/opt/sw/admin/torque/2.5.13/lib conftest.c -ltorque
  >&5
< configure:16424: $? = 0
< configure:16424: ./conftest
< ./conftest: error while loading shared libraries: libtorque.so.2: cannot open shared object file: No such file or directory
< configure:16424: $? = 127
< configure: program exited with status 127
< configure: failed program was:
< | /* confdefs.h */

…and then also later in the file:

< configure:16682: icc -o conftest    -DNDEBUG -DNVALGRIND -O2    -I/scratch/novosirj/install-files/mvapich2-2.0.1/src/mpl/include -I/scratch/novosirj/install
-files/mvapich2-2.0.1/src/mpl/include -I/scratch/novosirj/install-files/mvapich2-2.0.1/src/openpa/src -I/scratch/novosirj/install-files/mvapich2-2.0.1/src/ope
npa/src -I/scratch/novosirj/install-files/mvapich2-2.0.1/src/mpi/romio/include -I/include -I/include -I/include -I/include -I/opt/sw/admin/torque/2.5.13/inclu
de -Wl,-rpath,/opt/intel/composer_xe_2015.2.164/compiler/lib/intel64 -L/opt/sw/admin/torque/2.5.13/lib64 -L/opt/sw/admin/torque/2.5.13/lib conftest.c -ltorque
  >&5
< configure:16682: $? = 0
< configure:16682: ./conftest
< ./conftest: error while loading shared libraries: libtorque.so.2: cannot open shared object file: No such file or directory
< configure:16682: $? = 127
< configure: program exited with status 127
< configure: failed program was:
< | /* confdefs.h */

That probably should have blown up the configure step, shouldn’t it? For the record, that file is present:

root at newton /tmp (1292) # ls -la /opt/sw/admin/torque/2.5.13/lib/
total 2504
drwxr-xr-x 4 root root    4096 Nov 27  2013 ./
drwxr-xr-x 9 root root    4096 Nov 27  2013 ../
-rw-r--r-- 1 root root 1782968 Nov 27  2013 libtorque.a
-rwxr-xr-x 1 root root     821 Nov 27  2013 libtorque.la*
lrwxrwxrwx 1 root root      18 Nov 27  2013 libtorque.so -> libtorque.so.2.0.0*
lrwxrwxrwx 1 root root      18 Nov 27  2013 libtorque.so.2 -> libtorque.so.2.0.0*
-rwxr-xr-x 1 root root  748039 Nov 27  2013 libtorque.so.2.0.0*
drwxr-xr-x 5 root root    4096 Nov 27  2013 xpbs/
drwxr-xr-x 4 root root    4096 Nov 27  2013 xpbsmon/

> On Mar 31, 2015, at 2:25 PM, Novosielski, Ryan <novosirj at ca.rutgers.edu> wrote:
> 
> Huh, so, 2.1rc2 does not fail. Can we figure out what’s wrong with 2.0 and 2.0.1 that does cause them to fail so that I can patch them or use a workaround to recompile these to be Torque-aware? These versions are still in use and it would take awhile to rebuild the software that I’d have to rebuild in order to stop using those versions entirely.
> 
> I’ve attached the config.log, in case you were expecting this would not fail and that’s what you meant. Just because I suspect it might help, I’m also attaching the config.log from a failed attempt at compiling 2.0.1 with the exact same flags. Ignore the —prefix in the 2.1rc2 log that might lead one to believe it’s 2.0.1.<config.log-2.1rc2><config.log-2.0.1>
> 
>> On Mar 31, 2015, at 12:44 PM, Jonathan Perkins <perkinjo at cse.ohio-state.edu> wrote:
>> 
>> Hi Ryan.  I'm very sorry that we missed thi.  Can you try with MVAPICH2 2.1rc2 and send the config.log from the src/pm/hydra subdirectory?
>> 
>> My guess is that something was not being detected correctly at configure time that led to this.
>> 
>> On Fri, Mar 27, 2015 at 8:14 PM Novosielski, Ryan <novosirj at ca.rutgers.edu> wrote:
>>>> Thank you for pointing this out.  This portion of the userguide needs to
>>>> be updated.  It may get updated to something of the following lines...
>>>> 
>>>>   Both mpirun_rsh and mpiexec can take information from the PBS/Torque
>>>>   environment to launch jobs (ie. launch on nodes found in
>>>>   PBS_NODEFILE).
>>>> 
>>>>   You can also use MVAPICH2 in a tightly integrated manner with PBS.
>>>>   To do this you can install mvapich2 by adding the --with-pbs option
>>>>   to mvapich2. Below is a snippet from ./configure --help of the hydra
>>>>   process manager (mpiexec) that you will use with PBS/Torque.
>>>> 
>>>>   --with-pbs=PATH         specify path where pbs include directory
>>>>                           and lib directory can be found
>>>>   --with-pbs-include=PATH specify path where pbs include directory
>>>>                           can be found
>>>>   --with-pbs-lib=PATH     specify path where pbs lib directory can
>>>>                           be found
>>>> 
>>>>   For more information on using hydra, please visit the following url:
>>>>   http://wiki.mpich.org/mpich/index.php/Using_the_Hydra_Process_Manager
>>> 
>>> As it happens, MVAPICH2 will not build if I specify --with-pbs. I tried 2.0 and 2.0.1 and I have TORQUE 2.5.13. Does it have to be real PBS?
>>> 
>>> The place it fails is a little surprising to me, so I was suspecting I'd maybe broken something else in the meantime. But removing --with-pbs makes it work again. The directory is right:
>>> 
>>> root at newton /scratch/novosirj/install-files/mvapich2-2.0.1 (1421) # ls -la /opt/sw/admin/torque/current/
>>> total 36
>>> drwxr-xr-x  9 root root 4096 Nov 27  2013 ./
>>> drwxr-xr-x  4 root root 4096 Dec 10  2013 ../
>>> drwxr-xr-x  2 root root 4096 Nov 27  2013 bin/
>>> drwxr-xr-x  2 root root 4096 Nov 27  2013 include/
>>> drwxr-xr-x  4 root root 4096 Nov 27  2013 lib/
>>> drwxr-xr-x  6 root root 4096 Nov 27  2013 man/
>>> drwxr-xr-x  2 root root 4096 Nov 27  2013 sbin/
>>> drwxr-xr-x 13 root root 4096 Feb  7  2013 var/
>>> drwxr-xr-x 13 root root 4096 Nov 27  2013 var.orig/
>>> 
>>> CC=icc CXX=icpc FC=ifort LDFLAGS='-Wl,-rpath,/opt/intel/composer_xe_2015.1.133/compiler/lib/intel64' ./configure --without-cma --prefix=/opt/sw/mpi/mvapich2/2.0.1_intel-15.0.1 --with-pbs=/opt/sw/admin/torque/current
>>> ...
>>> make -j12
>>> ...
>>>  CC       topology-synthetic.lo
>>> In file included from traversal.c(12):
>>> /scratch/novosirj/install-files/mvapich2-2.0/src/pm/hydra/tools/topo/hwloc/hwloc/include/private/misc.h(28): error: #error directive: "unknown size for unsigned int."
>>>  #error "unknown size for unsigned int."
>>>   ^
>>> 
>>> Internal error: null pointer
>>> 
>>> compilation aborted for traversal.c (code 4)
>>> make[4]: *** [traversal.lo] Error 1
>>> make[4]: *** Waiting for unfinished jobs....
>>> In file included from bitmap.c(12):
>>> /scratch/novosirj/install-files/mvapich2-2.0/src/pm/hydra/tools/topo/hwloc/hwloc/include/private/misc.h(28): error: #error directive: "unknown size for unsigned int."
>>>  #error "unknown size for unsigned int."
>>>   ^
>>> 
>>> In file included from topology-synthetic.c(12):
>>> /scratch/novosirj/install-files/mvapich2-2.0/src/pm/hydra/tools/topo/hwloc/hwloc/include/private/misc.h(28): error: #error directive: "unknown size for unsigned int."
>>>  #error "unknown size for unsigned int."
>>>   ^
>>> 
>>> In file included from diff.c(8):
>>> /scratch/novosirj/install-files/mvapich2-2.0/src/pm/hydra/tools/topo/hwloc/hwloc/include/private/misc.h(28): error: #error directive: "unknown size for unsigned int."
>>>  #error "unknown size for unsigned int."
>>>   ^
>>> 
>>> In file included from misc.c(11):
>>> /scratch/novosirj/install-files/mvapich2-2.0/src/pm/hydra/tools/topo/hwloc/hwloc/include/private/misc.h(28): error: #error directive: "unknown size for unsigned int."
>>>  #error "unknown size for unsigned int."
>>>   ^
>>> 
>>> Internal error: null pointer
>>> 
>>> Internal error: null pointer
>>> 
>>> compilation aborted for diff.c (code 4)
>>> compilation aborted for topology-synthetic.c (code 4)
>>> make[4]: *** [diff.lo] Error 1
>>> make[4]: *** [topology-synthetic.lo] Error 1
>>> Internal error: null pointer
>>> 
>>> compilation aborted for misc.c (code 4)
>>> make[4]: *** [misc.lo] Error 1
>>> Internal error: null pointer
>>> 
>>> compilation aborted for bitmap.c (code 4)
>>> make[4]: *** [bitmap.lo] Error 1
>>> make[4]: Leaving directory `/HPCTMP_NOBKUP/novosirj/install-files/mvapich2-2.0/src/pm/hydra/tools/topo/hwloc/hwloc/src'
>>> make[3]: *** [all-recursive] Error 1
>>> make[3]: Leaving directory `/HPCTMP_NOBKUP/novosirj/install-files/mvapich2-2.0/src/pm/hydra/tools/topo/hwloc/hwloc'
>>> make[2]: *** [all-recursive] Error 1
>>> make[2]: Leaving directory `/HPCTMP_NOBKUP/novosirj/install-files/mvapich2-2.0/src/pm/hydra'
>>> make[1]: *** [all-recursive] Error 1
>>> make[1]: Leaving directory `/HPCTMP_NOBKUP/novosirj/install-files/mvapich2-2.0'
>>> make: *** [all] Error 2
>>> 
>>> Any ideas?
>> 
>> I am still interested in getting this to work, but it does not with the Intel Composer XE Compiler 15.0.2 (you can see earlier I tried 15.0.1) and Torque 2.5.13. I can build MVAPICH 2.0.1 with the same compiler just fine if I do not provide —with-pbs. Any ideas here? It also seems like a strange place to fail (eg. not that related to PBS).
>> 
>> --
>> ____ *Note: UMDNJ is now Rutgers-Biomedical and Health Sciences*
>> || \\UTGERS      |---------------------*O*---------------------
>> ||_// Biomedical | Ryan Novosielski - Senior Technologist
>> || \\ and Health | novosirj at rutgers.edu - 973/972.0922 (2x0922)
>> ||  \\  Sciences | OIRT/High Perf & Res Comp - MSB C630, Newark
>>     `'
>> 
>> 
>> _______________________________________________
>> mvapich-discuss mailing list
>> mvapich-discuss at cse.ohio-state.edu
>> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> 
> ____ *Note: UMDNJ is now Rutgers-Biomedical and Health Sciences*
> || \\UTGERS      |---------------------*O*---------------------
> ||_// Biomedical | Ryan Novosielski - Senior Technologist
> || \\ and Health | novosirj at rutgers.edu - 973/972.0922 (2x0922)
> ||  \\  Sciences | OIRT/High Perf & Res Comp - MSB C630, Newark
>     `'
> 

____ *Note: UMDNJ is now Rutgers-Biomedical and Health Sciences*
|| \\UTGERS      |---------------------*O*---------------------
||_// Biomedical | Ryan Novosielski - Senior Technologist
|| \\ and Health | novosirj at rutgers.edu - 973/972.0922 (2x0922)
||  \\  Sciences | OIRT/High Perf & Res Comp - MSB C630, Newark
     `'




More information about the mvapich-discuss mailing list