[mvapich-discuss] problems with getting both shared memory and tcp communication

Jonathan Perkins perkinjo at cse.ohio-state.edu
Fri Mar 30 18:49:23 EDT 2012


Hello Michael,
It sounds like you're using the correct configuration option (you may
use --with-device=ch3:nemesis:tcp to be more explicit).

You've mentioned that you're using mvapich2-1.7rc1.  Maybe you should
try upgrading to a newer version of MVAPICH2.  If you wish to stick to
the 1.7 version you should use the latest from the 1.7 stable branch
since we've had a few bug fixes applied since our 1.7 release.  You can
also try 1.8rc1 if you'd like to try our latest.

http://mvapich.cse.ohio-state.edu/nightly/mvapich2/branches/1.7/mvapich2-latest.tar.gz
http://mvapich.cse.ohio-state.edu/download/mvapich2/mvapich2-1.8rc1.tgz

Please let us know if this helps to resolve your problem.

On Fri, Mar 30, 2012 at 02:21:55PM -0700, Zulauf, Michael wrote:
> Hello all,
> 
>  
> 
> At my place of work for the last several months, we've been using
> mvapich2-1.7rc1 on hardware with an Infiniband interconnect, and it's
> been working extremely well.  Recently we got some new hardware, but
> this came with 10gigE (not my choice), and I've been trying to get
> things running on it.  In case it matters, I believe the network
> controllers are Intel 82598EB. Our nodes have dual Intel Xeon X5672 (so
> 8 cores per node).  We're primarily using the PGI 7.2-5 compilers.
> We've also got PGI 10.6 (I believe), but we don't use it as much.
> 
>  
> 
> >From the documentation for 1.7rc1, it's not entirely clear to me how to
> configure and build things so that we can run jobs that will utilize
> both shared memory communication within nodes, and tcp communication
> across nodes.  We thought our best bet would be some flavor of Nemesis,
> but the mvapich2 documentation confused me a bit.  It did refer us to
> the mpich2 docs, which made it sound like we could configure for
> ch3:nemesis.
> 
>  
> 
>       ch3:nemesis This method is our new, high performance method. It
> has
> 
> been made the default communication channel starting the 1.1 release
> 
> of MPICH2. It uses shared-memory to send messages between processes
> 
> on the same node and the network for processes between nodes.
> 
>  
> 
> The configure/build steps appeared to go well (I used ./configure
> --with-device=ch3:nemesis).  But when I try running the OSU benchmark
> tests, they will only work if I attempt to run a single process on each
> node (specified using the -hosts option).  Interestingly, if I use our
> original mvapich2 installation (which was built for our Infiniband
> hardware), it works for shared memory tests, but not for tests across
> nodes - which is not surprising.
> 
>  
> 
> Here's an example of a failed test (new build, within a single node):
> 
> ------------------------------------------------------------------------
> ---------------------------------------------------------
> 
> %
> /apps/new_cluster/mvapich2-1.7rc1_PGI7.2-5-nemesis/install_dir/bin/mpiex
> ec -hosts compute-1-15,compute-1-15 -n 2
> /apps/new_cluster/mvapich2-1.7rc1_PGI7.2-5-nemesis/osu_benchmarks/osu_bi
> bw
> 
> # OSU MPI Bi-Directional Bandwidth Test v3.3
> 
> # Size     Bi-Bandwidth (MB/s)
> 
> [compute-1-15.local:mpi_rank_0][error_sighandler] Caught error:
> Segmentation fault (signal 11)
> 
>  
> 
> ========================================================================
> =============
> 
> =   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
> 
> =   EXIT CODE: 11
> 
> =   CLEANING UP REMAINING PROCESSES
> 
> =   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
> 
> ========================================================================
> =============
> 
> APPLICATION TERMINATED WITH THE EXIT STRING: Segmentation fault (signal
> 11)
> 
> ------------------------------------------------------------------------
> ---------------------------------------------------------
> 
>  
> 
>  
> 
> But if I run it between different nodes, it works:
> 
> ------------------------------------------------------------------------
> ---------------------------------------------------------
> 
> %
> /apps/new_cluster/mvapich2-1.7rc1_PGI7.2-5-nemesis/install_dir/bin/mpiex
> ec -hosts compute-1-14,compute-1-15 -n 2
> /apps/new_cluster/mvapich2-1.7rc1_PGI7.2-5-nemesis/osu_benchmarks/osu_bi
> bw
> 
> # OSU MPI Bi-Directional Bandwidth Test v3.3
> 
> # Size     Bi-Bandwidth (MB/s)
> 
> 1                         0.28
> 
> 2                         0.57
> 
> 4                         1.31
> 
> 8                         3.37
> 
> 16                        6.58
> 
> 32                       13.14
> 
> 64                       24.21
> 
> 128                      48.28
> 
> 256                      84.43
> 
> 512                     122.64
> 
> 1024                    180.40
> 
> 2048                    331.52
> 
> 4096                    562.93
> 
> 8192                    597.75
> 
> 16384                   604.92
> 
> 32768                   608.34
> 
> 65536                   607.86
> 
> 131072                  626.89
> 
> 262144                  623.12
> 
> 524288                  611.56
> 
> 1048576                 644.33
> 
> 2097152                 661.96
> 
> 4194304                 654.56
> 
> ------------------------------------------------------------------------
> ---------------------------------------------------------
> 
>  
> 
>  
> 
> And if I use our earlier install (setup for the Infiniband hardware) for
> a case within a single node, that works also:
> 
> ------------------------------------------------------------------------
> ---------------------------------------------------------
> 
> % /apps/new_cluster/mvapich2-1.7rc1_PGI7.2-5/install_dir/bin/mpiexec
> -hosts compute-1-15,compute-1-15 -n 2
> /apps/new_cluster/mvapich2-1.7rc1_PGI7.2-5/osu_benchmarks/osu_bibw
> 
> stty: standard input: Invalid argument
> 
> # OSU MPI Bi-Directional Bandwidth Test v3.3
> 
> # Size     Bi-Bandwidth (MB/s)
> 
> 1                         2.83
> 
> 2                         5.74
> 
> 4                        11.54
> 
> 8                        22.66
> 
> 16                       46.08
> 
> 32                       90.02
> 
> 64                      175.56
> 
> 128                     346.18
> 
> 256                     650.94
> 
> 512                    1182.73
> 
> 1024                   1994.11
> 
> 2048                   3347.98
> 
> 4096                   4738.25
> 
> 8192                   5939.29
> 
> 16384                  6443.79
> 
> 32768                  6166.21
> 
> 65536                  6238.25
> 
> 131072                 6032.80
> 
> 262144                 9905.82
> 
> 524288                 9807.30
> 
> 1048576                9789.05
> 
> 2097152                9126.49
> 
> 4194304                5609.62
> 
> ------------------------------------------------------------------------
> ---------------------------------------------------------
> 
>  
> 
>  
> 
> Any thoughts on the best way to configure and run this on our new
> hardware?
> 
>  
> 
> Thanks,
> 
> Mike
> 
>  
> 
> -- 
> 
> Mike Zulauf
> 
> Meteorologist, Lead Senior
> 
> Asset Optimization 
> 
> Iberdrola Renewables
> 
> 1125 NW Couch, Suite 700
> 
> Portland, OR 97209
> 
> Office: 503-478-6304  Cell: 503-913-0403
> 
>  
> 
> 
> This message is intended for the exclusive attention of the recipient(s) indicated.  Any information contained herein is strictly confidential and privileged.  If you are not the intended recipient, please notify us by return e-mail and delete this message from your computer system. Any unauthorized use, reproduction, alteration, filing or sending of this message and/or any attached files may lead to legal action being taken against the party(ies) responsible for said unauthorized use. Any opinion expressed herein is solely that of the author(s) and does not necessarily represent the opinion of the Company. The sender does not guarantee the integrity, speed or safety of this message, and does not accept responsibility for any possible damage arising from the interception, incorporation of viruses, or any other damage as a result of manipulation.
> 
> 

> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss


-- 
Jonathan Perkins
http://www.cse.ohio-state.edu/~perkinjo


More information about the mvapich-discuss mailing list