[mvapich-discuss] Problems installing mvapich2/2.3 with Slurm

Carlson, Timothy S Timothy.Carlson at pnnl.gov
Tue Mar 12 13:18:12 EDT 2019


That should work.  Can you check that those nodes do actually have HFI working?  Does "opainfo" spit out information that the card is there?

This example is pretty much your setup.

[tim at compy-admin ~]$ module purge
[tim at compy-admin ~]$ module load intel/19.0.3
[tim at compy-admin ~]$ module load mvapich2/2.3.1
[tim at compy-admin ~]$ mpicc hello_node.c -o hello.mvapich2
[tim at compy-admin ~]$ srun --ntasks=4 --ntasks-per-node=2 ./hello.mvapich2
Hello world!  I am process number: 0 on host n0002.local
Hello world!  I am process number: 0 on host n0002.local
Hello world!  I am process number: 0 on host n0001.local
Hello world!  I am process number: 0 on host n0001.local

From: Raghu Reddy <raghu.reddy at noaa.gov>
Sent: Tuesday, March 12, 2019 5:26 AM
To: 'Subramoni, Hari' <subramoni.1 at osu.edu>; Carlson, Timothy S <Timothy.Carlson at pnnl.gov>; 'mvapich-discuss at cse.ohio-state.edu' <mvapich-discuss at mailman.cse.ohio-state.edu>
Cc: 'Brian Osmond' <brian.osmond at noaa.gov>; 'Kyle Stern' <kstern at redlineperf.com>; Raghu Reddy <raghu.reddy at noaa.gov>
Subject: RE: [mvapich-discuss] Problems installing mvapich2/2.3 with Slurm

Hi Hari,

Thank you very much for this information!  That was helpful in getting past the building stage.  And once it was built I was able to compile the hello world code successfully.  However am running into the run time error shown below.

Here is the current configure line:

./configure --prefix=$INSTALLDIR --with-device=ch3:psm --enable-romio=yes --enable-shared -enable-fortran=yes --with-pm=slurm --with-pmi=pmi2 --with-slurm=/apps/slurm/default CC=icc CXX=icpc F77=ifort FC=ifort | & tee configure-ch3.out-rr

Followed by the usual make, make check, and make install, and all of them completed successfully.

I just want to mention that we are new to the Slurm environment.  We have been running Intel MPI applications successfully in this new environment, but we have not exercised the mvapich2 much.  Initially we tried to use our old mvapich2 installation (not specificially configured for Slurm) and because that was not working we are trying to install the latest version.  We had been planning to upgrade to the latest libraries anyway.

Any suggestions on what we may have missed?

Thanks again for the quick feedback!

sfe01% module load slurm intel/18.0.3.222 mvapich2/2.3
sfe01%

sfe01% mpicc hello_mpi_c.c
sfe01%

sfe01% srun --ntasks=4 --ntasks-per-node=2 ./a.out
s0014.83669PSM2 no hfi units are available (err=23)
s0015.134286PSM2 no hfi units are available (err=23)
s0014.83668PSM2 no hfi units are available (err=23)
s0015.134287PSM2 no hfi units are available (err=23)
[s0015:mpi_rank_3][mv2_psm_err_handler] PSM error handler: Failure in initializing endpoint : PSM2 no hfi units are available
[s0015:mpi_rank_2][mv2_psm_err_handler] PSM error handler: Failure in initializing endpoint : PSM2 no hfi units are available
[s0015:mpi_rank_2][psm_doinit] MV2_WARNING: Failed to open an end-point: Failure in initializing endpoint, retry attempt 1 of 10 in 10 seconds
[s0015:mpi_rank_3][psm_doinit] MV2_WARNING: Failed to open an end-point: Failure in initializing endpoint, retry attempt 1 of 10 in 10 seconds
[s0014:mpi_rank_0][mv2_psm_err_handler] PSM error handler: Failure in initializing endpoint : PSM2 no hfi units are available
[s0014:mpi_rank_0][psm_doinit] MV2_WARNING: Failed to open an end-point: Failure in initializing endpoint, retry attempt 1 of 10 in 10 seconds
[s0014:mpi_rank_1][mv2_psm_err_handler] PSM error handler: Failure in initializing endpoint : PSM2 no hfi units are available
[s0014:mpi_rank_1][psm_doinit] MV2_WARNING: Failed to open an end-point: Failure in initializing endpoint, retry attempt 1 of 10 in 10 seconds
^Csrun: interrupt (one more within 1 sec to abort)
srun: step:126522.0 tasks 0-3: running
^Csrun: interrupt (one more within 1 sec to abort)
srun: step:126522.0 tasks 0-3: running
^Csrun: interrupt (one more within 1 sec to abort)
srun: step:126522.0 tasks 0-3: running
^Csrun: sending Ctrl-C to job 126522.0
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurmstepd: error: *** STEP 126522.0 ON s0014 CANCELLED AT 2019-03-12T12:16:42 ***
sfe01%


Thanks,
Raghu



From: Subramoni, Hari [mailto:subramoni.1 at osu.edu]
Sent: Monday, March 11, 2019 5:48 PM
To: Carlson, Timothy S <Timothy.Carlson at pnnl.gov<mailto:Timothy.Carlson at pnnl.gov>>; Raghu Reddy <raghu.reddy at noaa.gov<mailto:raghu.reddy at noaa.gov>>; mvapich-discuss at cse.ohio-state.edu<mailto:mvapich-discuss at cse.ohio-state.edu> <mvapich-discuss at mailman.cse.ohio-state.edu<mailto:mvapich-discuss at mailman.cse.ohio-state.edu>>
Cc: Brian Osmond <brian.osmond at noaa.gov<mailto:brian.osmond at noaa.gov>>; 'Kyle Stern' <kstern at redlineperf.com<mailto:kstern at redlineperf.com>>; Subramoni, Hari <subramoni.1 at osu.edu<mailto:subramoni.1 at osu.edu>>
Subject: RE: [mvapich-discuss] Problems installing mvapich2/2.3 with Slurm

Hi, Raghu.

--with-device=ch3:psm and --with-rdma=gen2 are not compatible. Can you just mention --with-device=ch3:psm for QLogic/Omni-Path and see if it helps?

Thx,
Hari.

From: mvapich-discuss <mvapich-discuss-bounces at cse.ohio-state.edu<mailto:mvapich-discuss-bounces at cse.ohio-state.edu>> On Behalf Of Carlson, Timothy S
Sent: Monday, March 11, 2019 5:21 PM
To: Raghu Reddy <raghu.reddy at noaa.gov<mailto:raghu.reddy at noaa.gov>>; mvapich-discuss at cse.ohio-state.edu<mailto:mvapich-discuss at cse.ohio-state.edu> <mvapich-discuss at mailman.cse.ohio-state.edu<mailto:mvapich-discuss at mailman.cse.ohio-state.edu>>
Cc: Brian Osmond <brian.osmond at noaa.gov<mailto:brian.osmond at noaa.gov>>; 'Kyle Stern' <kstern at redlineperf.com<mailto:kstern at redlineperf.com>>
Subject: Re: [mvapich-discuss] Problems installing mvapich2/2.3 with Slurm

That include file should be in the tarball. At least it is there in 2.3.1

# tar ztf mvapich2-2.3.1.tar.gz | grep ibv_param.h
mvapich2-2.3.1/src/mpid/ch3/channels/mrail/src/gen2/ibv_param.h

From: mvapich-discuss <mvapich-discuss-bounces at cse.ohio-state.edu<mailto:mvapich-discuss-bounces at cse.ohio-state.edu>> On Behalf Of Raghu Reddy
Sent: Monday, March 11, 2019 2:17 PM
To: mvapich-discuss at cse.ohio-state.edu<mailto:mvapich-discuss at cse.ohio-state.edu> <mvapich-discuss at mailman.cse.ohio-state.edu<mailto:mvapich-discuss at mailman.cse.ohio-state.edu>>
Cc: Brian Osmond <brian.osmond at noaa.gov<mailto:brian.osmond at noaa.gov>>; 'Kyle Stern' <kstern at redlineperf.com<mailto:kstern at redlineperf.com>>
Subject: [mvapich-discuss] Problems installing mvapich2/2.3 with Slurm

Hi all,

Here is the info on the hardware:


  *   Intel haswell processors, with 2 12 cores sockets (for  a total of 24 cores/node)
  *   Intel TruScale IB network

I am using the following configure line for building with the Intel compiler (intel/18.0.3.222):

./configure --prefix=$INSTALLDIR --with-device=ch3:psm --with-ib-libpath=/usr/lib64 --with-rdma=gen2 --enable-romio=yes --enable-shared -enable-fortran=yes --with-pm=slurm --with-pmi=pmi2 --with-slurm=/apps/slurm/default CC=icc CXX=icpc F77=ifort FC=ifort | & tee configure-ch3.out-rr

I get the following error at make:

----------------
  CC       src/mpid/ch3/channels/common/src/util/lib_libmpi_la-mv2_config.lo
  CC       src/mpid/ch3/channels/common/src/util/lib_libmpi_la-error_handling.lo
  CC       src/mpid/ch3/channels/common/src/util/lib_libmpi_la-debug_utils.lo
  CC       src/mpid/ch3/channels/common/src/util/lib_libmpi_la-mv2_clock.lo
  CC       src/mpid/ch3/channels/common/src/ft/lib_libmpi_la-cr.lo
src/mpid/ch3/channels/common/src/ft/cr.c(19): catastrophic error: cannot open source file "ibv_param.h"
  #include "ibv_param.h"
                        ^

compilation aborted for src/mpid/ch3/channels/common/src/ft/cr.c (code 4)
make[2]: *** [src/mpid/ch3/channels/common/src/ft/lib_libmpi_la-cr.lo] Error 1
make[2]: *** Waiting for unfinished jobs....
make[2]: Leaving directory `/tds_scratch3/SYSADMIN/nesccmgmt/Raghu.Reddy/apps/mvapich2-2.3'
make[1]: *** [all-recursive] Error 1
make[1]: Leaving directory `/tds_scratch3/SYSADMIN/nesccmgmt/Raghu.Reddy/apps/mvapich2-2.3'
make: *** [all] Error 2
sfe01%
----------------

If I leave out "--with-device=ch3:psm" it completes the build process, but when I run a test code I get the following error:

sfe01% srun --ntasks=4 --ntasks-per-node=2 ./a.out
[s0014:mpi_rank_0][rdma_find_network_type] QLogic IB card detected in system
[s0014:mpi_rank_0][rdma_find_network_type] Please re-configure the library with the '--with-device=ch3:psm' configure option for best performance
[s0014:mpi_rank_1][rdma_find_network_type] QLogic IB card detected in system
[s0014:mpi_rank_1][rdma_find_network_type] Please re-configure the library with the '--with-device=ch3:psm' configure option for best performance
[s0015:mpi_rank_2][rdma_find_network_type] QLogic IB card detected in system
[s0015:mpi_rank_2][rdma_find_network_type] Please re-configure the library with the '--with-device=ch3:psm' configure option for best performance
[s0015:mpi_rank_3][rdma_find_network_type] QLogic IB card detected in system
[s0015:mpi_rank_3][rdma_find_network_type] Please re-configure the library with the '--with-device=ch3:psm' configure option for best performance
Warning: RDMA CM Initialization failed. Continuing without RDMA CM support. Please set MV2_USE_RDMA_CM=0 to disable RDMA CM.
Hello from rank 00 out of 4; procname = s0014, cpuid = 0
Hello from rank 02 out of 4; procname = s0015, cpuid = 0
Hello from rank 01 out of 4; procname = s0014, cpuid = 1
Hello from rank 03 out of 4; procname = s0015, cpuid = 1
sfe01%

I believe "--with-device=ch3:psm" is the right thing to do for this architecture, but I am not able to get past the step above.

I do see that the file exist in the distribution, not sure why it is not finding it:

sfe01% find . -name ibv_param.h
./src/mpid/ch3/channels/mrail/src/gen2/ibv_param.h
sfe01%

Any suggestions on what I may be doing wrong?

Thanks,
Raghu

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20190312/5aa0da4b/attachment-0001.html>


More information about the mvapich-discuss mailing list