[mvapich-discuss] MVAPICH2 2.1a: Code stalls on Sandy Bridge, works on Westmere

Hari Subramoni subramoni.1 at osu.edu
Tue Jan 6 09:35:53 EST 2015


Hello Matt,

Sorry to hear that you're seeing issues with MVAPICH2-2.1rc1. Could you
please give us some more information about the experimental setup like
number of processes, number of nodes, processes per node as well as the
config flags and compilers used to build MVAPICH2? This will enable us to
debug the issue further.

Are you using CMA here? If not, could you please try using CMA
(MV2_SMP_USE_CMA=1) to see if the hang goes away?

Regards,
Hari.

On Mon, Jan 5, 2015 at 12:37 PM, Thompson, Matt (GSFC-610.1)[SCIENCE
SYSTEMS AND APPLICATIONS INC] <matthew.thompson at nasa.gov> wrote:

> All,
>
> I'm trying to diagnose an issue that is appearing in a model I work on:
> GEOS-5. The problem seems to be architecture-dependent and, most likely,
> due to MVAPICH2 (as the same code compiled with Intel MPI 5 and the same
> Fortran compiler seems to have no problem).
>
> I can try to go into more detail (for example if I start adding print
> statements to find the stall, it can sometimes cure it!), but my first
> question is:
>
>   Are there environment variables that control architecture-dependent
>   behaviour of MVAPICH2?
>
> I ask because I saw in the recent MVAPICH2 2.1rc1 announcement:
>
>   (NEW) MVAPICH2 2.1rc1 (based on MPICH 3.1.3) with ...
>    *optimization and tuning for Haswell architecture*
>
> (I tried searching the User's Guide for "Haswell", but no luck. Could you
> point me to possible switches?)
>
> Note, also, that this could also not be due to Westmere/Sandy Bridge
> tuning, but to the underlying fabric. Here at NCCS, the Westmeres, I
> believe, are on DDR interconnects while the Sandy Bridges I was using are
> on FDR (which, I think, is actually connected to a QDR main switch) and
> some are on QDR.
>
> If I turn on MV2_SHOW_ENV_INFO=2, I see these differences (left, Sandy;
> right, Westmere):
>
>  PROCESSOR ARCH NAME         : MV2_ARCH_INTEL_XEON_E5_2670_16 |  PROCESSOR
>> ARCH NAME            : MV2_ARCH_INTEL_XEON_X5650_12
>> PROCESSOR MODEL NUMBER      : 45                              | PROCESSOR
>> MODEL NUMBER         : 44
>> HCA NAME                    : MV2_HCA_MLX_CX_FDR              | HCA NAME
>>                      : MV2_HCA_MLX_CX_DDR
>> MV2_RDMA_FAST_PATH_BUF_SIZE : 5120                            |
>> MV2_RDMA_FAST_PATH_BUF_SIZE    : 9216
>> MV2_EAGERSIZE_1SC           : 8192                            |
>> MV2_EAGERSIZE_1SC              : 4096
>> MV2_SMP_EAGERSIZE           : 32769                           |
>> MV2_SMP_EAGERSIZE              : 65537
>> MV2_SMPI_LENGTH_QUEUE       : 131072                          |
>> MV2_SMPI_LENGTH_QUEUE          : 262144
>> MV2_SMP_NUM_SEND_BUFFER     : 16                              |
>> MV2_SMP_NUM_SEND_BUFFER        : 32
>> MPISPAWN_MPIRUN_HOST        : borg01y001                      |
>> MPISPAWN_MPIRUN_HOST           : borgi117
>> MPISPAWN_MPIRUN_ID          : 21662                           |
>> MPISPAWN_MPIRUN_ID             : 23359
>> MPISPAWN_NNODES             : 6                       | MPISPAWN_NNODES
>>               : 8
>> PMI_PORT                    : borg01y001:44036                | PMI_PORT
>>                      : borgi117:37003
>> MV2_DEFAULT_MTU             : 4                       | MV2_DEFAULT_MTU
>>               : 3
>> MV2_DEFAULT_PKEY            : 393216                          |
>> MV2_DEFAULT_PKEY               : 524288
>> MV2_NUM_NODES_IN_JOB        : 6                       |
>> MV2_NUM_NODES_IN_JOB           : 8
>>
>
> Now some of these can be ignored (MPISPAWN, PROCESSOR, etc.), but of the
> MV2_ flag differences here, there is an opportunity.
>
> Some testing showed that if we set:
>
>    MV2_SMP_NUM_SEND_BUFFER=32
>
> on the Sandy Bridge, the issue was avoided. Huzzah, right? Well, when an
> end-user tried it...it hanged for him at some point. So...yeah. Should I
> perhaps use all 5 settings from the DDR run?
>
> Any ideas from the experts on why IMPI 5 would not be affected in the same
> situation?
>
> Matt
>
> --
> Matt Thompson          SSAI, Sr Software Test Engr
> NASA GSFC, Global Modeling and Assimilation Office
> Code 610.1, 8800 Greenbelt Rd, Greenbelt, MD 20771
> Phone: 301-614-6712              Fax: 301-614-6246
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20150106/c3e7fcd6/attachment.html>


More information about the mvapich-discuss mailing list