[mvapich-discuss] Docs re: MVAPICH2 2.0.x and PBS/Torque
Novosielski, Ryan
novosirj at ca.rutgers.edu
Tue Mar 31 14:32:34 EDT 2015
A bit of my own look at this, for what is different between the two log files, I see the following:
< configure:16424: icc -o conftest -DNDEBUG -DNVALGRIND -O2 -I/scratch/novosirj/install-files/mvapich2-2.0.1/src/mpl/include -I/scratch/novosirj/install
-files/mvapich2-2.0.1/src/mpl/include -I/scratch/novosirj/install-files/mvapich2-2.0.1/src/openpa/src -I/scratch/novosirj/install-files/mvapich2-2.0.1/src/ope
npa/src -I/scratch/novosirj/install-files/mvapich2-2.0.1/src/mpi/romio/include -I/include -I/include -I/include -I/include -I/opt/sw/admin/torque/2.5.13/inclu
de -Wl,-rpath,/opt/intel/composer_xe_2015.2.164/compiler/lib/intel64 -L/opt/sw/admin/torque/2.5.13/lib64 -L/opt/sw/admin/torque/2.5.13/lib conftest.c -ltorque
>&5
< configure:16424: $? = 0
< configure:16424: ./conftest
< ./conftest: error while loading shared libraries: libtorque.so.2: cannot open shared object file: No such file or directory
< configure:16424: $? = 127
< configure: program exited with status 127
< configure: failed program was:
< | /* confdefs.h */
…and then also later in the file:
< configure:16682: icc -o conftest -DNDEBUG -DNVALGRIND -O2 -I/scratch/novosirj/install-files/mvapich2-2.0.1/src/mpl/include -I/scratch/novosirj/install
-files/mvapich2-2.0.1/src/mpl/include -I/scratch/novosirj/install-files/mvapich2-2.0.1/src/openpa/src -I/scratch/novosirj/install-files/mvapich2-2.0.1/src/ope
npa/src -I/scratch/novosirj/install-files/mvapich2-2.0.1/src/mpi/romio/include -I/include -I/include -I/include -I/include -I/opt/sw/admin/torque/2.5.13/inclu
de -Wl,-rpath,/opt/intel/composer_xe_2015.2.164/compiler/lib/intel64 -L/opt/sw/admin/torque/2.5.13/lib64 -L/opt/sw/admin/torque/2.5.13/lib conftest.c -ltorque
>&5
< configure:16682: $? = 0
< configure:16682: ./conftest
< ./conftest: error while loading shared libraries: libtorque.so.2: cannot open shared object file: No such file or directory
< configure:16682: $? = 127
< configure: program exited with status 127
< configure: failed program was:
< | /* confdefs.h */
That probably should have blown up the configure step, shouldn’t it? For the record, that file is present:
root at newton /tmp (1292) # ls -la /opt/sw/admin/torque/2.5.13/lib/
total 2504
drwxr-xr-x 4 root root 4096 Nov 27 2013 ./
drwxr-xr-x 9 root root 4096 Nov 27 2013 ../
-rw-r--r-- 1 root root 1782968 Nov 27 2013 libtorque.a
-rwxr-xr-x 1 root root 821 Nov 27 2013 libtorque.la*
lrwxrwxrwx 1 root root 18 Nov 27 2013 libtorque.so -> libtorque.so.2.0.0*
lrwxrwxrwx 1 root root 18 Nov 27 2013 libtorque.so.2 -> libtorque.so.2.0.0*
-rwxr-xr-x 1 root root 748039 Nov 27 2013 libtorque.so.2.0.0*
drwxr-xr-x 5 root root 4096 Nov 27 2013 xpbs/
drwxr-xr-x 4 root root 4096 Nov 27 2013 xpbsmon/
> On Mar 31, 2015, at 2:25 PM, Novosielski, Ryan <novosirj at ca.rutgers.edu> wrote:
>
> Huh, so, 2.1rc2 does not fail. Can we figure out what’s wrong with 2.0 and 2.0.1 that does cause them to fail so that I can patch them or use a workaround to recompile these to be Torque-aware? These versions are still in use and it would take awhile to rebuild the software that I’d have to rebuild in order to stop using those versions entirely.
>
> I’ve attached the config.log, in case you were expecting this would not fail and that’s what you meant. Just because I suspect it might help, I’m also attaching the config.log from a failed attempt at compiling 2.0.1 with the exact same flags. Ignore the —prefix in the 2.1rc2 log that might lead one to believe it’s 2.0.1.<config.log-2.1rc2><config.log-2.0.1>
>
>> On Mar 31, 2015, at 12:44 PM, Jonathan Perkins <perkinjo at cse.ohio-state.edu> wrote:
>>
>> Hi Ryan. I'm very sorry that we missed thi. Can you try with MVAPICH2 2.1rc2 and send the config.log from the src/pm/hydra subdirectory?
>>
>> My guess is that something was not being detected correctly at configure time that led to this.
>>
>> On Fri, Mar 27, 2015 at 8:14 PM Novosielski, Ryan <novosirj at ca.rutgers.edu> wrote:
>>>> Thank you for pointing this out. This portion of the userguide needs to
>>>> be updated. It may get updated to something of the following lines...
>>>>
>>>> Both mpirun_rsh and mpiexec can take information from the PBS/Torque
>>>> environment to launch jobs (ie. launch on nodes found in
>>>> PBS_NODEFILE).
>>>>
>>>> You can also use MVAPICH2 in a tightly integrated manner with PBS.
>>>> To do this you can install mvapich2 by adding the --with-pbs option
>>>> to mvapich2. Below is a snippet from ./configure --help of the hydra
>>>> process manager (mpiexec) that you will use with PBS/Torque.
>>>>
>>>> --with-pbs=PATH specify path where pbs include directory
>>>> and lib directory can be found
>>>> --with-pbs-include=PATH specify path where pbs include directory
>>>> can be found
>>>> --with-pbs-lib=PATH specify path where pbs lib directory can
>>>> be found
>>>>
>>>> For more information on using hydra, please visit the following url:
>>>> http://wiki.mpich.org/mpich/index.php/Using_the_Hydra_Process_Manager
>>>
>>> As it happens, MVAPICH2 will not build if I specify --with-pbs. I tried 2.0 and 2.0.1 and I have TORQUE 2.5.13. Does it have to be real PBS?
>>>
>>> The place it fails is a little surprising to me, so I was suspecting I'd maybe broken something else in the meantime. But removing --with-pbs makes it work again. The directory is right:
>>>
>>> root at newton /scratch/novosirj/install-files/mvapich2-2.0.1 (1421) # ls -la /opt/sw/admin/torque/current/
>>> total 36
>>> drwxr-xr-x 9 root root 4096 Nov 27 2013 ./
>>> drwxr-xr-x 4 root root 4096 Dec 10 2013 ../
>>> drwxr-xr-x 2 root root 4096 Nov 27 2013 bin/
>>> drwxr-xr-x 2 root root 4096 Nov 27 2013 include/
>>> drwxr-xr-x 4 root root 4096 Nov 27 2013 lib/
>>> drwxr-xr-x 6 root root 4096 Nov 27 2013 man/
>>> drwxr-xr-x 2 root root 4096 Nov 27 2013 sbin/
>>> drwxr-xr-x 13 root root 4096 Feb 7 2013 var/
>>> drwxr-xr-x 13 root root 4096 Nov 27 2013 var.orig/
>>>
>>> CC=icc CXX=icpc FC=ifort LDFLAGS='-Wl,-rpath,/opt/intel/composer_xe_2015.1.133/compiler/lib/intel64' ./configure --without-cma --prefix=/opt/sw/mpi/mvapich2/2.0.1_intel-15.0.1 --with-pbs=/opt/sw/admin/torque/current
>>> ...
>>> make -j12
>>> ...
>>> CC topology-synthetic.lo
>>> In file included from traversal.c(12):
>>> /scratch/novosirj/install-files/mvapich2-2.0/src/pm/hydra/tools/topo/hwloc/hwloc/include/private/misc.h(28): error: #error directive: "unknown size for unsigned int."
>>> #error "unknown size for unsigned int."
>>> ^
>>>
>>> Internal error: null pointer
>>>
>>> compilation aborted for traversal.c (code 4)
>>> make[4]: *** [traversal.lo] Error 1
>>> make[4]: *** Waiting for unfinished jobs....
>>> In file included from bitmap.c(12):
>>> /scratch/novosirj/install-files/mvapich2-2.0/src/pm/hydra/tools/topo/hwloc/hwloc/include/private/misc.h(28): error: #error directive: "unknown size for unsigned int."
>>> #error "unknown size for unsigned int."
>>> ^
>>>
>>> In file included from topology-synthetic.c(12):
>>> /scratch/novosirj/install-files/mvapich2-2.0/src/pm/hydra/tools/topo/hwloc/hwloc/include/private/misc.h(28): error: #error directive: "unknown size for unsigned int."
>>> #error "unknown size for unsigned int."
>>> ^
>>>
>>> In file included from diff.c(8):
>>> /scratch/novosirj/install-files/mvapich2-2.0/src/pm/hydra/tools/topo/hwloc/hwloc/include/private/misc.h(28): error: #error directive: "unknown size for unsigned int."
>>> #error "unknown size for unsigned int."
>>> ^
>>>
>>> In file included from misc.c(11):
>>> /scratch/novosirj/install-files/mvapich2-2.0/src/pm/hydra/tools/topo/hwloc/hwloc/include/private/misc.h(28): error: #error directive: "unknown size for unsigned int."
>>> #error "unknown size for unsigned int."
>>> ^
>>>
>>> Internal error: null pointer
>>>
>>> Internal error: null pointer
>>>
>>> compilation aborted for diff.c (code 4)
>>> compilation aborted for topology-synthetic.c (code 4)
>>> make[4]: *** [diff.lo] Error 1
>>> make[4]: *** [topology-synthetic.lo] Error 1
>>> Internal error: null pointer
>>>
>>> compilation aborted for misc.c (code 4)
>>> make[4]: *** [misc.lo] Error 1
>>> Internal error: null pointer
>>>
>>> compilation aborted for bitmap.c (code 4)
>>> make[4]: *** [bitmap.lo] Error 1
>>> make[4]: Leaving directory `/HPCTMP_NOBKUP/novosirj/install-files/mvapich2-2.0/src/pm/hydra/tools/topo/hwloc/hwloc/src'
>>> make[3]: *** [all-recursive] Error 1
>>> make[3]: Leaving directory `/HPCTMP_NOBKUP/novosirj/install-files/mvapich2-2.0/src/pm/hydra/tools/topo/hwloc/hwloc'
>>> make[2]: *** [all-recursive] Error 1
>>> make[2]: Leaving directory `/HPCTMP_NOBKUP/novosirj/install-files/mvapich2-2.0/src/pm/hydra'
>>> make[1]: *** [all-recursive] Error 1
>>> make[1]: Leaving directory `/HPCTMP_NOBKUP/novosirj/install-files/mvapich2-2.0'
>>> make: *** [all] Error 2
>>>
>>> Any ideas?
>>
>> I am still interested in getting this to work, but it does not with the Intel Composer XE Compiler 15.0.2 (you can see earlier I tried 15.0.1) and Torque 2.5.13. I can build MVAPICH 2.0.1 with the same compiler just fine if I do not provide —with-pbs. Any ideas here? It also seems like a strange place to fail (eg. not that related to PBS).
>>
>> --
>> ____ *Note: UMDNJ is now Rutgers-Biomedical and Health Sciences*
>> || \\UTGERS |---------------------*O*---------------------
>> ||_// Biomedical | Ryan Novosielski - Senior Technologist
>> || \\ and Health | novosirj at rutgers.edu - 973/972.0922 (2x0922)
>> || \\ Sciences | OIRT/High Perf & Res Comp - MSB C630, Newark
>> `'
>>
>>
>> _______________________________________________
>> mvapich-discuss mailing list
>> mvapich-discuss at cse.ohio-state.edu
>> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
> ____ *Note: UMDNJ is now Rutgers-Biomedical and Health Sciences*
> || \\UTGERS |---------------------*O*---------------------
> ||_// Biomedical | Ryan Novosielski - Senior Technologist
> || \\ and Health | novosirj at rutgers.edu - 973/972.0922 (2x0922)
> || \\ Sciences | OIRT/High Perf & Res Comp - MSB C630, Newark
> `'
>
____ *Note: UMDNJ is now Rutgers-Biomedical and Health Sciences*
|| \\UTGERS |---------------------*O*---------------------
||_// Biomedical | Ryan Novosielski - Senior Technologist
|| \\ and Health | novosirj at rutgers.edu - 973/972.0922 (2x0922)
|| \\ Sciences | OIRT/High Perf & Res Comp - MSB C630, Newark
`'
More information about the mvapich-discuss
mailing list