[mvapich-discuss] Problems installing mvapich2/2.3 with Slurm

Raghu Reddy raghu.reddy at noaa.gov
Tue Mar 12 08:26:25 EDT 2019


Hi Hari,

 

Thank you very much for this information!  That was helpful in getting past
the building stage.  And once it was built I was able to compile the hello
world code successfully.  However am running into the run time error shown
below.

 

Here is the current configure line:

 

./configure --prefix=$INSTALLDIR --with-device=ch3:psm --enable-romio=yes
--enable-shared -enable-fortran=yes --with-pm=slurm --with-pmi=pmi2
--with-slurm=/apps/slurm/default CC=icc CXX=icpc F77=ifort FC=ifort | & tee
configure-ch3.out-rr

 

Followed by the usual make, make check, and make install, and all of them
completed successfully.

 

I just want to mention that we are new to the Slurm environment.  We have
been running Intel MPI applications successfully in this new environment,
but we have not exercised the mvapich2 much.  Initially we tried to use our
old mvapich2 installation (not specificially configured for Slurm) and
because that was not working we are trying to install the latest version.
We had been planning to upgrade to the latest libraries anyway.

 

Any suggestions on what we may have missed?

 

Thanks again for the quick feedback!

 

sfe01% module load slurm intel/18.0.3.222 mvapich2/2.3

sfe01%

 

sfe01% mpicc hello_mpi_c.c

sfe01%

 

sfe01% srun --ntasks=4 --ntasks-per-node=2 ./a.out

s0014.83669PSM2 no hfi units are available (err=23)

s0015.134286PSM2 no hfi units are available (err=23)

s0014.83668PSM2 no hfi units are available (err=23)

s0015.134287PSM2 no hfi units are available (err=23)

[s0015:mpi_rank_3][mv2_psm_err_handler] PSM error handler: Failure in
initializing endpoint : PSM2 no hfi units are available

[s0015:mpi_rank_2][mv2_psm_err_handler] PSM error handler: Failure in
initializing endpoint : PSM2 no hfi units are available

[s0015:mpi_rank_2][psm_doinit] MV2_WARNING: Failed to open an end-point:
Failure in initializing endpoint, retry attempt 1 of 10 in 10 seconds

[s0015:mpi_rank_3][psm_doinit] MV2_WARNING: Failed to open an end-point:
Failure in initializing endpoint, retry attempt 1 of 10 in 10 seconds

[s0014:mpi_rank_0][mv2_psm_err_handler] PSM error handler: Failure in
initializing endpoint : PSM2 no hfi units are available

[s0014:mpi_rank_0][psm_doinit] MV2_WARNING: Failed to open an end-point:
Failure in initializing endpoint, retry attempt 1 of 10 in 10 seconds

[s0014:mpi_rank_1][mv2_psm_err_handler] PSM error handler: Failure in
initializing endpoint : PSM2 no hfi units are available

[s0014:mpi_rank_1][psm_doinit] MV2_WARNING: Failed to open an end-point:
Failure in initializing endpoint, retry attempt 1 of 10 in 10 seconds

^Csrun: interrupt (one more within 1 sec to abort)

srun: step:126522.0 tasks 0-3: running

^Csrun: interrupt (one more within 1 sec to abort)

srun: step:126522.0 tasks 0-3: running

^Csrun: interrupt (one more within 1 sec to abort)

srun: step:126522.0 tasks 0-3: running

^Csrun: sending Ctrl-C to job 126522.0

srun: Job step aborted: Waiting up to 32 seconds for job step to finish.

slurmstepd: error: *** STEP 126522.0 ON s0014 CANCELLED AT
2019-03-12T12:16:42 ***

sfe01%

 

 

Thanks,

Raghu

 

 

 

From: Subramoni, Hari [mailto:subramoni.1 at osu.edu] 
Sent: Monday, March 11, 2019 5:48 PM
To: Carlson, Timothy S <Timothy.Carlson at pnnl.gov>; Raghu Reddy
<raghu.reddy at noaa.gov>; mvapich-discuss at cse.ohio-state.edu
<mvapich-discuss at mailman.cse.ohio-state.edu>
Cc: Brian Osmond <brian.osmond at noaa.gov>; 'Kyle Stern'
<kstern at redlineperf.com>; Subramoni, Hari <subramoni.1 at osu.edu>
Subject: RE: [mvapich-discuss] Problems installing mvapich2/2.3 with Slurm

 

Hi, Raghu.

 

--with-device=ch3:psm and --with-rdma=gen2 are not compatible. Can you just
mention --with-device=ch3:psm for QLogic/Omni-Path and see if it helps?

 

Thx,

Hari.

 

From: mvapich-discuss <mvapich-discuss-bounces at cse.ohio-state.edu
<mailto:mvapich-discuss-bounces at cse.ohio-state.edu> > On Behalf Of Carlson,
Timothy S
Sent: Monday, March 11, 2019 5:21 PM
To: Raghu Reddy <raghu.reddy at noaa.gov <mailto:raghu.reddy at noaa.gov> >;
mvapich-discuss at cse.ohio-state.edu
<mailto:mvapich-discuss at cse.ohio-state.edu>
<mvapich-discuss at mailman.cse.ohio-state.edu
<mailto:mvapich-discuss at mailman.cse.ohio-state.edu> >
Cc: Brian Osmond <brian.osmond at noaa.gov <mailto:brian.osmond at noaa.gov> >;
'Kyle Stern' <kstern at redlineperf.com <mailto:kstern at redlineperf.com> >
Subject: Re: [mvapich-discuss] Problems installing mvapich2/2.3 with Slurm

 

That include file should be in the tarball. At least it is there in 2.3.1

 

# tar ztf mvapich2-2.3.1.tar.gz | grep ibv_param.h

mvapich2-2.3.1/src/mpid/ch3/channels/mrail/src/gen2/ibv_param.h

 

From: mvapich-discuss <mvapich-discuss-bounces at cse.ohio-state.edu
<mailto:mvapich-discuss-bounces at cse.ohio-state.edu> > On Behalf Of Raghu
Reddy
Sent: Monday, March 11, 2019 2:17 PM
To: mvapich-discuss at cse.ohio-state.edu
<mailto:mvapich-discuss at cse.ohio-state.edu>
<mvapich-discuss at mailman.cse.ohio-state.edu
<mailto:mvapich-discuss at mailman.cse.ohio-state.edu> >
Cc: Brian Osmond <brian.osmond at noaa.gov <mailto:brian.osmond at noaa.gov> >;
'Kyle Stern' <kstern at redlineperf.com <mailto:kstern at redlineperf.com> >
Subject: [mvapich-discuss] Problems installing mvapich2/2.3 with Slurm

 

Hi all,

 

Here is the info on the hardware:

 

-        Intel haswell processors, with 2 12 cores sockets (for  a total of
24 cores/node)

-        Intel TruScale IB network

 

I am using the following configure line for building with the Intel compiler
(intel/18.0.3.222):

 

./configure --prefix=$INSTALLDIR --with-device=ch3:psm
--with-ib-libpath=/usr/lib64 --with-rdma=gen2 --enable-romio=yes
--enable-shared -enable-fortran=yes --with-pm=slurm --with-pmi=pmi2
--with-slurm=/apps/slurm/default CC=icc CXX=icpc F77=ifort FC=ifort | & tee
configure-ch3.out-rr

 

I get the following error at make:

 

----------------

  CC       src/mpid/ch3/channels/common/src/util/lib_libmpi_la-mv2_config.lo

  CC
src/mpid/ch3/channels/common/src/util/lib_libmpi_la-error_handling.lo

  CC
src/mpid/ch3/channels/common/src/util/lib_libmpi_la-debug_utils.lo

  CC       src/mpid/ch3/channels/common/src/util/lib_libmpi_la-mv2_clock.lo

  CC       src/mpid/ch3/channels/common/src/ft/lib_libmpi_la-cr.lo

src/mpid/ch3/channels/common/src/ft/cr.c(19): catastrophic error: cannot
open source file "ibv_param.h"

  #include "ibv_param.h"

                        ^

 

compilation aborted for src/mpid/ch3/channels/common/src/ft/cr.c (code 4)

make[2]: *** [src/mpid/ch3/channels/common/src/ft/lib_libmpi_la-cr.lo] Error
1

make[2]: *** Waiting for unfinished jobs....

make[2]: Leaving directory
`/tds_scratch3/SYSADMIN/nesccmgmt/Raghu.Reddy/apps/mvapich2-2.3'

make[1]: *** [all-recursive] Error 1

make[1]: Leaving directory
`/tds_scratch3/SYSADMIN/nesccmgmt/Raghu.Reddy/apps/mvapich2-2.3'

make: *** [all] Error 2

sfe01%

----------------

 

If I leave out "--with-device=ch3:psm" it completes the build process, but
when I run a test code I get the following error:

 

sfe01% srun --ntasks=4 --ntasks-per-node=2 ./a.out

[s0014:mpi_rank_0][rdma_find_network_type] QLogic IB card detected in system

[s0014:mpi_rank_0][rdma_find_network_type] Please re-configure the library
with the '--with-device=ch3:psm' configure option for best performance

[s0014:mpi_rank_1][rdma_find_network_type] QLogic IB card detected in system

[s0014:mpi_rank_1][rdma_find_network_type] Please re-configure the library
with the '--with-device=ch3:psm' configure option for best performance

[s0015:mpi_rank_2][rdma_find_network_type] QLogic IB card detected in system

[s0015:mpi_rank_2][rdma_find_network_type] Please re-configure the library
with the '--with-device=ch3:psm' configure option for best performance

[s0015:mpi_rank_3][rdma_find_network_type] QLogic IB card detected in system

[s0015:mpi_rank_3][rdma_find_network_type] Please re-configure the library
with the '--with-device=ch3:psm' configure option for best performance

Warning: RDMA CM Initialization failed. Continuing without RDMA CM support.
Please set MV2_USE_RDMA_CM=0 to disable RDMA CM.

Hello from rank 00 out of 4; procname = s0014, cpuid = 0

Hello from rank 02 out of 4; procname = s0015, cpuid = 0

Hello from rank 01 out of 4; procname = s0014, cpuid = 1

Hello from rank 03 out of 4; procname = s0015, cpuid = 1

sfe01%

 

I believe "--with-device=ch3:psm" is the right thing to do for this
architecture, but I am not able to get past the step above.

 

I do see that the file exist in the distribution, not sure why it is not
finding it:

 

sfe01% find . -name ibv_param.h

./src/mpid/ch3/channels/mrail/src/gen2/ibv_param.h

sfe01%

 

Any suggestions on what I may be doing wrong?

 

Thanks,

Raghu

 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20190312/65c04067/attachment-0001.html>


More information about the mvapich-discuss mailing list