[mvapich-discuss] Behavior of MV2_USE_SHMEM flags
Thompson, Matt (GSFC-610.1)[SCIENCE SYSTEMS AND APPLICATIONS INC]
matthew.thompson at nasa.gov
Mon Jun 16 14:16:09 EDT 2014
MVAPICH2 Discuss,
I'm currently investigating a problem with our code where running 12x4
on 4 Westmere nodes causes a divide-by-zero in an MPI_ALLREDUCE call,
but it doesn't occur if run 12x4 on 3 Sandy Bridge nodes. This is using
MVAPICH2 2.0rc1:
> forrtl: error (73): floating divide by zero
> Image PC Routine Line Source
> libirc.so 00002AAFDA4EB2C9 Unknown Unknown Unknown
> libirc.so 00002AAFDA4E9B9E Unknown Unknown Unknown
> GEOSgcm.x 000000000323C342 Unknown Unknown Unknown
> GEOSgcm.x 00000000031C7FE3 Unknown Unknown Unknown
> GEOSgcm.x 00000000031D23C1 Unknown Unknown Unknown
> libc.so.6 00002AAFDA7679E0 Unknown Unknown Unknown
> libm.so.6 00002AAFD689504F Unknown Unknown Unknown
> libmpich.so.12 00002AAFD9F62A87 Unknown Unknown Unknown
> libmpich.so.12 00002AAFD9F05055 Unknown Unknown Unknown
> libmpich.so.12 00002AAFD9F04AD3 Unknown Unknown Unknown
> libmpich.so.12 00002AAFD9E3DBA5 Unknown Unknown Unknown
> GEOSgcm.x 0000000002012704 parutilitiesmodul 4319 parutilitiesmodule.F90
My first thought is "let's try to toggle MV2_USE_SHMEM_ALLREDUCE"
because in the past we've often fixed weird MVAPICH2 collective behavior
by that toggle (or the other MV2_USE_SHMEM flags), but I'm having a bit
of a hard time trying to figure out the behavior of
MV2_USE_SHMEM_ALLREDUCE. Namely the user guide says:
> 11.91 MV2_USE_SHMEM_ALLREDUCE
>
> Class: Run Time
> Applicable interface(s): OFA-IB-CH3 and OFA-iWARP-CH3
> This parameter can be used to turn off shared memory based
> MPI_Allreduce for OFA-IB-CH3 over IBA by setting this to 0.
And yet, when one sets MV2_USE_SHMEM_ALLREDUCE to 0 and
MV2_SHOW_ENV_INFO to 2, you see:
> MV2_USE_SHMEM_ALLREDUCE : 1
and vice versa for the other direction.
Huh. So I look in the code and see, in src/mpi/coll/ch3_shmem_coll.c:
> 1583 if ((value = getenv("MV2_USE_SHMEM_ALLREDUCE")) != NULL) {
> 1584 flag = (int) atoi(value);
> 1585 if (flag > 0)
> 1586 mv2_disable_shmem_allreduce = 0;
> 1587 else
> 1588 mv2_disable_shmem_allreduce = 1;
> 1589 }
Well, okay, unlike other flags, it sets a *disable* flag rather than
unsets an *enable* flag. But why does MV2_SHOW_ENV_INFO seem to indicate
the opposite of what I set?
Now, in the end, it seems as if setting MV2_USE_SHMEM_COLL=0 was the
solution (or at least workaround). So, I wondered, does setting
MV2_USE_SHMEM_COLL to 0 turn off MV2_USE_SHMEM_ALLREDUCE,
MV2_USE_SHMEM_BARRIER, and MV2_USE_SHMEM_BCAST? I didn't see any logic
to make me think that in the code but, perhaps, it's at a level below
where I am looking? Looking at MV2_SHOW_ENV_INFO my guess is "no" since
if I set MV2_USE_SHMEM_COLL to 0, MV2_USE_SHMEM_BCAST is 1.[1]
Matt
[1] And, weirdly, MV2_USE_SHMEM_BCAST=0 changes
MV2_SHMEM_COLL_NUM_PROCS, but MV2_USE_SHMEM_COLL=0 does not. MPICH
oddness, I guess.
--
Matt Thompson SSAI, Sr Software Test Engr
NASA GSFC, Global Modeling and Assimilation Office
Code 610.1, 8800 Greenbelt Rd, Greenbelt, MD 20771
Phone: 301-614-6712 Fax: 301-614-6246
More information about the mvapich-discuss
mailing list