[mvapich-discuss] Help using MVAPICH2 with Omni-Path

Subramoni, Hari subramoni.1 at osu.edu
Fri Jul 17 12:27:44 EDT 2020


Hi, Matt.

Sorry to hear that you have been consistently facing issues with MVAPICH2.

Can you please let me know how many processes per node you have on each node?

Can you also try setting MV2_SHMEM_COLL_MAX_MSG_SIZE=4096 to see if it allows you go to more number of shared memory windows?

Thx,
Hari.

From: mvapich-discuss-bounces at cse.ohio-state.edu <mvapich-discuss-bounces at mailman.cse.ohio-state.edu> On Behalf Of Thompson, Matt (GSFC-610.1)[SCIENCE SYSTEMS AND APPLICATIONS INC]
Sent: Friday, July 17, 2020 10:15 AM
To: mvapich-discuss at cse.ohio-state.edu <mvapich-discuss at mailman.cse.ohio-state.edu>
Subject: [mvapich-discuss] Help using MVAPICH2 with Omni-Path

MVAPICH2 Support,

I recently decided to try and build MVAPICH2 2.3.4 for Omnipath. The build seemed to work using:

  ./configure --with-device=ch3:psm CC=icc CXX=icpc FC=ifort

using Intel 19.1.1.217 and 'make check' seemed happy. I could also do a helloworld, so I didn't screw things up too much. (I did try ch3:psm2 first and realized my mistake.)

I then built the base libraries for our model and then the climate model (GEOS) itself. And that's when things got interesting.

First when I tried to run it with just:

   setenv MV2_ENABLE_AFFINITY 0

which is my "usual" MVAPICH2 flag. We aren't running with OpenMP so it's sort of moot, but it lives in my mind. When I did this I got:

[borgc003:mpi_rank_80][MPIDI_CH3I_Win_allocate_shm]
        [WARNING] Shared memory window cannot be created, for better performance, please consider increasing the value of MV2_SHMEM_COLL_NUM_COMM (current value 8)

So, okay, I started increasing. I started with MV2_SHMEM_COLL_NUM_COMM=10, then 25, 50, 100, 200, I got up to 204:

[borgc001:mpi_rank_0][MPIDI_CH3I_Win_allocate_shm]
        [WARNING] Shared memory window cannot be created, for better performance, please consider increasing the value of MV2_SHMEM_COLL_NUM_COMM (current value 204)

And then 205:

/discover/swdev/gmao_SIteam/MPI/mvapich2/2.3.4/intel-19.1.1.217-omnipath/bin/mpirun_rsh  -export -hostfile $PBS_NODEFILE -np 96 ./GEOSgcm.x
[cli_40]: aborting job:
Fatal error in PMPI_Init_thread: Internal MPI error!, error stack:
MPIR_Init_thread(490):
MPID_Init(396).......: channel initialization failed
(unknown)(): Internal MPI error!

I guess my first question is, why might 205 fail on our system? I see webpages out there where people have set this number to 1024 and it works!

I then thought, well let's try:

   setenv MV2_USE_SHMEM_ALLREDUCE 0
   setenv MV2_USE_SHMEM_BARRIER   0
   setenv MV2_USE_SHMEM_BCAST     0
   setenv MV2_USE_SHMEM_COLL      0
   setenv MV2_USE_SHMEM_REDUCE    0

Turn off all the SHMEM I see (and can understand) in the User's Guide and:

[borgc003:mpi_rank_80][MPIDI_CH3I_Win_allocate_shm]
        [WARNING] Shared memory window cannot be created, for better performance, please consider increasing the value of MV2_SHMEM_COLL_NUM_COMM (current value 8)

Aww.

So, I am in the "I need help" category, but not sure where to go. I'm pretty sure the psm2 bits were found by configure as a grep on the configure output shows:

checking psm2.h usability... yes
checking psm2.h presence... yes
checking for psm2.h... yes
checking for psm2_init in -lpsm2... yes

I've always had issues with MVAPICH2 in recent years on our system, but I'd never tried it on the Omnipath part of our cluster (there is also an Infiniband part). I'd like to have a different MPI stack than Intel MPI for comparison, so I'm hoping to get this to work.

I freely admit I might have messed things up even at the configure step. I'm currently going back to do careful 'make check' on our base libraries (hdf5, netCDF, etc.) maybe this is bleeding over from netCDF or something?

Thanks,
Matt
--
Matt Thompson, SSAI, Ld Scientific Programmer/Analyst
NASA GSFC,    Global Modeling and Assimilation Office
Code 610.1,  8800 Greenbelt Rd,  Greenbelt,  MD 20771
Phone: 301-614-6712                 Fax: 301-614-6246
http://science.gsfc.nasa.gov/sed/bio/matthew.thompson<https://urldefense.com/v3/__http:/science.gsfc.nasa.gov/sed/bio/matthew.thompson__;!!KGKeukY!lA2yOy86fUlzmDsDH1s_epQ54V_x35W1_jMUl-sDtUEj14hO_cm_NNTTOmNgZHHfp-H2oGbJFXd9fWc$>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20200717/ceff7da0/attachment.html>


More information about the mvapich-discuss mailing list