[mvapich-discuss] performance problems on mpi/openmp hybrid code

Tue Mar 10 13:54:25 EDT 2009

Hi Susan,

Just to clarify -- this is my understanding: You are running 8 processes
per node, and then there is one master process per node. At some point the
master process will spawn 8 OpenMP threads. At this point we have 7 slave
processes that are single-threaded and 1 master process that has 8
threads.

If you are using such a situation, then you will need to turn on
"blocking" mode. This will prevent the slave processes from using the CPU
while the master threads are working. You will want to use both
VIADEV_USE_AFFINITY=0 and VIADEV_USE_BLOCKING=1

This is an interesting hybrid mode we have not seen. I assume the slave
processes are working during the other parts of the code?

The alternative is to try use OpenMP fully on the node (and just have one
master and no slaves). Or 2 MPI tasks, each with 4 threads, etc.  In those
cases you would not need to have VIADEV_USE_BLOCKING=1.

Let us know if this helps,

Matt

On Tue, 10 Mar 2009, Susan A. Schwarz wrote:

> I am running an MPI/OpenMP code using mvapich on dual quad-core AMD nodes on a
> RHEL 5.3 cluster.  Initially I found that the code took longer to run using the
> infiniband than when I ran it with just ethernet connections. I found the
> section in the MVAPICH User and Tuning Guide about setting VIADEV_USE_AFFINITY=0
> to allow the openmp threads to run on other CPUs. Now when I set
> VIADEV_USE_AFFINITY=0, I find that now the openmp section is using other CPUs
> but because the load on the other CPUs is about 50%, my code is still not
> running as fast as the version that uses the ethernet. Here is the structure of
> the fortran code which I am compiling with Intel v11.0 compilers:
>
> do i= 1 to # of iterations
>
>     [ perform mpi-based calculation]
>     if master processor
>        perform openmp-based calculation using 8 threads
>        mpi_bcast(broadcast results to the other processes
>     else if not the master
>        mpi_bcast(obtain results from master)
>     end if
> end do
>
> So the slave processors do an mpi_bcast and wait for the master process to
> complete the openmp-based calculation and broadcast the result. When I run
> 'top', I see that the slave processes are using 50% of each of the CPUs while
> waiting for the master process to complete the openmp section of the code.
> During the OpenMP section of the code, top shows the master processor running
> with a load of atmost 400%.
>
> During the ethernet-based run, the load on the slave processes is almost 0 and
> the master processor has a load of 800% during the openmp section of the code
> which is what I expected because I am using 8 threads. When I compare the
> elapsed times for the openmp section of the code, the infiniband version takes
> twice as long as the ethernet version.
>
> My question is why is the load on the slave processors 50% when I am using the
> infiniband when they are doing nothing except waiting for the results to be
> broadcast to them and why is my openmp running only at 400% and not 800% . Is
> there any way to either change my code or the  configuration of mvapich so this
> doesn't happen.
>
> thank you,
> Susan Schwarz
> Research Computing
> Dartmouth College
>
>
>
>
>
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>