[mvapich-discuss] Behavior of MV2_USE_SHMEM flags

Hari Subramoni subramoni.1 at osu.edu
Tue Jun 17 12:51:48 EDT 2014


Hello Matt,

Thanks for reporting the discrepancy with the way we are reporting the
environment variables. We will fix this with our next release.

We tried to reproduce the issue you mentioned locally, but we were unable
to do so with simple microbenchmarks. Could you please send us a reproducer
so that we can try it out locally?

Regards,
Hari.


On Mon, Jun 16, 2014 at 2:16 PM, Thompson, Matt (GSFC-610.1)[SCIENCE
SYSTEMS AND APPLICATIONS INC] <matthew.thompson at nasa.gov> wrote:

> MVAPICH2 Discuss,
>
> I'm currently investigating a problem with our code where running 12x4 on
> 4 Westmere nodes causes a divide-by-zero in an MPI_ALLREDUCE call, but it
> doesn't occur if run 12x4 on 3 Sandy Bridge nodes. This is using MVAPICH2
> 2.0rc1:
>
>  forrtl: error (73): floating divide by zero
>> Image              PC                Routine            Line        Source
>> libirc.so          00002AAFDA4EB2C9  Unknown               Unknown
>>  Unknown
>> libirc.so          00002AAFDA4E9B9E  Unknown               Unknown
>>  Unknown
>> GEOSgcm.x          000000000323C342  Unknown               Unknown
>>  Unknown
>> GEOSgcm.x          00000000031C7FE3  Unknown               Unknown
>>  Unknown
>> GEOSgcm.x          00000000031D23C1  Unknown               Unknown
>>  Unknown
>> libc.so.6          00002AAFDA7679E0  Unknown               Unknown
>>  Unknown
>> libm.so.6          00002AAFD689504F  Unknown               Unknown
>>  Unknown
>> libmpich.so.12     00002AAFD9F62A87  Unknown               Unknown
>>  Unknown
>> libmpich.so.12     00002AAFD9F05055  Unknown               Unknown
>>  Unknown
>> libmpich.so.12     00002AAFD9F04AD3  Unknown               Unknown
>>  Unknown
>> libmpich.so.12     00002AAFD9E3DBA5  Unknown               Unknown
>>  Unknown
>> GEOSgcm.x          0000000002012704  parutilitiesmodul        4319
>>  parutilitiesmodule.F90
>>
>
> My first thought is "let's try to toggle MV2_USE_SHMEM_ALLREDUCE" because
> in the past we've often fixed weird MVAPICH2 collective behavior by that
> toggle (or the other MV2_USE_SHMEM flags), but I'm having a bit of a hard
> time trying to figure out the behavior of MV2_USE_SHMEM_ALLREDUCE. Namely
> the user guide says:
>
>  11.91 MV2_USE_SHMEM_ALLREDUCE
>>
>> Class: Run Time
>> Applicable interface(s): OFA-IB-CH3 and OFA-iWARP-CH3
>> This parameter can be used to turn off shared memory based
>> MPI_Allreduce for OFA-IB-CH3 over IBA by setting this to 0.
>>
>
> And yet, when one sets MV2_USE_SHMEM_ALLREDUCE to 0 and MV2_SHOW_ENV_INFO
> to 2, you see:
>
>          MV2_USE_SHMEM_ALLREDUCE             : 1
>>
>
> and vice versa for the other direction.
>
> Huh. So I look in the code and see, in src/mpi/coll/ch3_shmem_coll.c:
>
>  1583     if ((value = getenv("MV2_USE_SHMEM_ALLREDUCE")) != NULL) {
>> 1584         flag = (int) atoi(value);
>> 1585         if (flag > 0)
>> 1586             mv2_disable_shmem_allreduce = 0;
>> 1587         else
>> 1588             mv2_disable_shmem_allreduce = 1;
>> 1589     }
>>
>
> Well, okay, unlike other flags, it sets a *disable* flag rather than
> unsets an *enable* flag. But why does MV2_SHOW_ENV_INFO seem to indicate
> the opposite of what I set?
>
> Now, in the end, it seems as if setting MV2_USE_SHMEM_COLL=0 was the
> solution (or at least workaround). So, I wondered, does setting
> MV2_USE_SHMEM_COLL to 0 turn off MV2_USE_SHMEM_ALLREDUCE,
> MV2_USE_SHMEM_BARRIER, and MV2_USE_SHMEM_BCAST? I didn't see any logic to
> make me think that in the code but, perhaps, it's at a level below where I
> am looking? Looking at MV2_SHOW_ENV_INFO my guess is "no" since if I set
> MV2_USE_SHMEM_COLL to 0, MV2_USE_SHMEM_BCAST is 1.[1]
>
> Matt
>
> [1] And, weirdly, MV2_USE_SHMEM_BCAST=0 changes MV2_SHMEM_COLL_NUM_PROCS,
> but MV2_USE_SHMEM_COLL=0 does not. MPICH oddness, I guess.
>
>
> --
> Matt Thompson          SSAI, Sr Software Test Engr
> NASA GSFC, Global Modeling and Assimilation Office
> Code 610.1, 8800 Greenbelt Rd, Greenbelt, MD 20771
> Phone: 301-614-6712              Fax: 301-614-6246
>
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20140617/f1409d70/attachment.html>


More information about the mvapich-discuss mailing list