[mvapich-discuss] Behavior of MV2_USE_SHMEM flags

Thompson, Matt (GSFC-610.1)[SCIENCE SYSTEMS AND APPLICATIONS INC] matthew.thompson at nasa.gov
Mon Jun 16 14:16:09 EDT 2014


MVAPICH2 Discuss,

I'm currently investigating a problem with our code where running 12x4 
on 4 Westmere nodes causes a divide-by-zero in an MPI_ALLREDUCE call, 
but it doesn't occur if run 12x4 on 3 Sandy Bridge nodes. This is using 
MVAPICH2 2.0rc1:

> forrtl: error (73): floating divide by zero
> Image              PC                Routine            Line        Source
> libirc.so          00002AAFDA4EB2C9  Unknown               Unknown  Unknown
> libirc.so          00002AAFDA4E9B9E  Unknown               Unknown  Unknown
> GEOSgcm.x          000000000323C342  Unknown               Unknown  Unknown
> GEOSgcm.x          00000000031C7FE3  Unknown               Unknown  Unknown
> GEOSgcm.x          00000000031D23C1  Unknown               Unknown  Unknown
> libc.so.6          00002AAFDA7679E0  Unknown               Unknown  Unknown
> libm.so.6          00002AAFD689504F  Unknown               Unknown  Unknown
> libmpich.so.12     00002AAFD9F62A87  Unknown               Unknown  Unknown
> libmpich.so.12     00002AAFD9F05055  Unknown               Unknown  Unknown
> libmpich.so.12     00002AAFD9F04AD3  Unknown               Unknown  Unknown
> libmpich.so.12     00002AAFD9E3DBA5  Unknown               Unknown  Unknown
> GEOSgcm.x          0000000002012704  parutilitiesmodul        4319  parutilitiesmodule.F90

My first thought is "let's try to toggle MV2_USE_SHMEM_ALLREDUCE" 
because in the past we've often fixed weird MVAPICH2 collective behavior 
by that toggle (or the other MV2_USE_SHMEM flags), but I'm having a bit 
of a hard time trying to figure out the behavior of 
MV2_USE_SHMEM_ALLREDUCE. Namely the user guide says:

> 11.91 MV2_USE_SHMEM_ALLREDUCE
>
> Class: Run Time
> Applicable interface(s): OFA-IB-CH3 and OFA-iWARP-CH3
> This parameter can be used to turn off shared memory based
> MPI_Allreduce for OFA-IB-CH3 over IBA by setting this to 0.

And yet, when one sets MV2_USE_SHMEM_ALLREDUCE to 0 and 
MV2_SHOW_ENV_INFO to 2, you see:

> 	MV2_USE_SHMEM_ALLREDUCE             : 1

and vice versa for the other direction.

Huh. So I look in the code and see, in src/mpi/coll/ch3_shmem_coll.c:

> 1583     if ((value = getenv("MV2_USE_SHMEM_ALLREDUCE")) != NULL) {
> 1584         flag = (int) atoi(value);
> 1585         if (flag > 0)
> 1586             mv2_disable_shmem_allreduce = 0;
> 1587         else
> 1588             mv2_disable_shmem_allreduce = 1;
> 1589     }

Well, okay, unlike other flags, it sets a *disable* flag rather than 
unsets an *enable* flag. But why does MV2_SHOW_ENV_INFO seem to indicate 
the opposite of what I set?

Now, in the end, it seems as if setting MV2_USE_SHMEM_COLL=0 was the 
solution (or at least workaround). So, I wondered, does setting 
MV2_USE_SHMEM_COLL to 0 turn off MV2_USE_SHMEM_ALLREDUCE, 
MV2_USE_SHMEM_BARRIER, and MV2_USE_SHMEM_BCAST? I didn't see any logic 
to make me think that in the code but, perhaps, it's at a level below 
where I am looking? Looking at MV2_SHOW_ENV_INFO my guess is "no" since 
if I set MV2_USE_SHMEM_COLL to 0, MV2_USE_SHMEM_BCAST is 1.[1]

Matt

[1] And, weirdly, MV2_USE_SHMEM_BCAST=0 changes 
MV2_SHMEM_COLL_NUM_PROCS, but MV2_USE_SHMEM_COLL=0 does not. MPICH 
oddness, I guess.


-- 
Matt Thompson          SSAI, Sr Software Test Engr
NASA GSFC, Global Modeling and Assimilation Office
Code 610.1, 8800 Greenbelt Rd, Greenbelt, MD 20771
Phone: 301-614-6712              Fax: 301-614-6246




More information about the mvapich-discuss mailing list