[mvapich-discuss] performance problems on mpi/openmp hybrid code

Wed Mar 11 08:26:42 EDT 2009

Matt,

Thank you for your response. The use of VIADEV_USE_AFFINITY=0 and 
VIADEV_USE_BLOCKING=1 have fixed my problem and my program is now running as I 
expected.

Susan

Matthew Koop wrote:
> Hi Susan,
> 
> Just to clarify -- this is my understanding: You are running 8 processes
> per node, and then there is one master process per node. At some point the
> master process will spawn 8 OpenMP threads. At this point we have 7 slave
> processes that are single-threaded and 1 master process that has 8
> threads.
> 
> If you are using such a situation, then you will need to turn on
> "blocking" mode. This will prevent the slave processes from using the CPU
> while the master threads are working. You will want to use both
> VIADEV_USE_AFFINITY=0 and VIADEV_USE_BLOCKING=1
> 
> This is an interesting hybrid mode we have not seen. I assume the slave
> processes are working during the other parts of the code?
> 
> The alternative is to try use OpenMP fully on the node (and just have one
> master and no slaves). Or 2 MPI tasks, each with 4 threads, etc.  In those
> cases you would not need to have VIADEV_USE_BLOCKING=1.
> 
> Let us know if this helps,
> 
> Matt
> 
> On Tue, 10 Mar 2009, Susan A. Schwarz wrote:
> 
>> I am running an MPI/OpenMP code using mvapich on dual quad-core AMD nodes on a
>> RHEL 5.3 cluster.  Initially I found that the code took longer to run using the
>> infiniband than when I ran it with just ethernet connections. I found the
>> section in the MVAPICH User and Tuning Guide about setting VIADEV_USE_AFFINITY=0
>> to allow the openmp threads to run on other CPUs. Now when I set
>> VIADEV_USE_AFFINITY=0, I find that now the openmp section is using other CPUs
>> but because the load on the other CPUs is about 50%, my code is still not
>> running as fast as the version that uses the ethernet. Here is the structure of
>> the fortran code which I am compiling with Intel v11.0 compilers:
>>
>> do i= 1 to # of iterations
>>
>>     [ perform mpi-based calculation]
>>     if master processor
>>        perform openmp-based calculation using 8 threads
>>        mpi_bcast(broadcast results to the other processes
>>     else if not the master
>>        mpi_bcast(obtain results from master)
>>     end if
>> end do
>>
>> So the slave processors do an mpi_bcast and wait for the master process to
>> complete the openmp-based calculation and broadcast the result. When I run
>> 'top', I see that the slave processes are using 50% of each of the CPUs while
>> waiting for the master process to complete the openmp section of the code.
>> During the OpenMP section of the code, top shows the master processor running
>> with a load of atmost 400%.
>>
>> During the ethernet-based run, the load on the slave processes is almost 0 and
>> the master processor has a load of 800% during the openmp section of the code
>> which is what I expected because I am using 8 threads. When I compare the
>> elapsed times for the openmp section of the code, the infiniband version takes
>> twice as long as the ethernet version.
>>
>> My question is why is the load on the slave processors 50% when I am using the
>> infiniband when they are doing nothing except waiting for the results to be
>> broadcast to them and why is my openmp running only at 400% and not 800% . Is
>> there any way to either change my code or the  configuration of mvapich so this
>> doesn't happen.
>>
>> thank you,
>> Susan Schwarz
>> Research Computing
>> Dartmouth College
>>
>>
>>
>>
>>
>>
>> _______________________________________________
>> mvapich-discuss mailing list
>> mvapich-discuss at cse.ohio-state.edu
>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>
>