[mvapich-discuss] one-sided passive communications

Fri Dec 14 11:47:44 EST 2012

Hi Maria,

The async progress threads (one is spawned for every MPI process) are 
implemented as a Wait() on an Irecv() control message that will be sent 
when MPI_Finalize is called.  Depending on the underlying implementation 
(maybe an MVAPICH expert could comment here), this could poll and 
consume a core.

Is it possible that the application is running very slowly, rather than 
deadlocking?  Another debugging suggestion would be to attach a debugger 
or generate a core file during the hang to give some more insight into 
where the processes are stuck.

  ~Jim.

On 12/14/12 6:54 AM, María J. Martín wrote:
> Hi Jim,
>
> Our signal handler is only used to put a flag on. None MPI function is
> called inside it.  It is as simple as:
>
>     / void Controller::signalHandler(int signo) {/
>     /        Controller::instance().state.executedHandler = 1;/
>     / }/
>     /
>     /
>
> The code runs successfully (without deadlock) if MPICH_ASYNC_PROGRESS is
> turned off and MPI_Init  is substituted by MPI_Init_thread
> with  required = MPI_THREAD_MULTIPLE.
>
> Thanks anyway for the feedback,
>
> María
>
> El 13/12/2012, a las 16:52, Jim Dinan escribió:
>
>> Hi Maria,
>>
>> MPICH (and presumably also MVAPICH) is thread-safe, but not currently
>> multithreaded; only one thread can enter the MPI library at a time,
>> and this is arbitrated in the MPICH library via a mutex.  If the
>> signal handler is invoked when the main process is already inside of
>> an MPI call, then it will block on that mutex which is already held by
>> the main process and deadlock.
>>
>> My guess is that you have always had this problem, but you've been
>> getting lucky (especially lucky that you haven't had data corruption)
>> because MPICH's thread safe locks are not enabled unless you call
>> MPI_Init_thread, or turn on async progress threads.  An easy way to
>> confirm this diagnosis would be to leave MPICH_ASYNC_PROGRESS off, but
>> call MPI_Init_thread instead of MPI_Init and see if you still get
>> deadlock.
>>
>> In general, it's not safe to make MPI calls (well, really, most
>> function calls) from a signal handler.  A better approach would be for
>> the signal handler to set a flag (of type 'volatile sig_atomic_t')
>> that the main process polls to detect when to initiate the checkpoint.
>>
>> Best,
>> ~Jim.
>>
>> On 12/12/12 8:29 AM, María J. Martín wrote:
>>> Thanks Sreeram. We are executing using the MVAPICH2 1.8.1 release with
>>> MV2_ENABLE_AFFINITY=0 and MPICH_ASYNC_PROGRESS=1.
>>>
>>> We are having some problems to get this configuration working on our
>>> machine, a multicore cluster with HP RX7640 nodes, each of them with 16
>>> IA64 Itanium2 Montvale cores. We are submitting using SGE.  Sometimes
>>> the jobs do not finish when they are submitted requiring the same number
>>> of cores as mpi processes (qsub -l num_procs=1 -pe mpi
>>> number_mpi_processes). If extra cores are required for the helper
>>> threads (-l num_procs=2 -pe mpi number_mpi_processes) then all jobs run
>>> successfully.
>>>
>>> I am assuming extra hardware is not needed to get this configuration to
>>> work. Is that right?
>>>
>>> Our MPI applications call a signal handler when a checkpoint signal is
>>> received. Apparently the program freezes when returning from the signal
>>> handler.  Could the signal handler be a problem for this configuration?
>>>
>>> Any ideas, advices?
>>>
>>> Thanks again,
>>>
>>> María
>>>
>>>
>>>
>>>
>>> El 11/12/2012, a las 18:30, sreeram potluri escribió:
>>>
>>>> Maria,
>>>>
>>>> As Jim pointed out, enabling MPI_THREAD_MULTIPLE in this case is taken
>>>> care of internally by the MPI library. So you will not need any
>>>> changes to your application. It should work with just MPI_Init.
>>>>
>>>> However, make sure to disable affinity using MV2_ENABLE_AFFINITY=0
>>>> along with MPICH_ASYNC_PROGRESS=1. This is required for MVAPICH2 to
>>>> launch the async progress thread.
>>>>
>>>> Sreeram Potluri
>>>>
>>>> On Tue, Dec 11, 2012 at 8:09 AM, "María J. Martín"
>>>> <maria.martin.santamaria at udc.es <mailto:maria.martin.santamaria at udc.es>
>>>> <mailto:maria.martin.santamaria at udc.es>> wrote:
>>>>
>>>>    Hi Sreeram,
>>>>
>>>>    One more question. Is it necessary to substitute MPI_init by
>>>>     MPI_Init_thread with  required = MPI_THREAD_MULTIPLE  in order to
>>>>    make use of the helper threads?
>>>>
>>>>    Thanks,
>>>>
>>>>    María
>>>>
>>>>
>>>>
>>>>    El 05/12/2012, a las 17:10, sreeram potluri escribió:
>>>>
>>>>>    Hi Maria,
>>>>>
>>>>>    Truly passive one-sided communication is currently supported at
>>>>>    the intra-node level (with LiMIC and shared memory-based
>>>>>    windows), but not at the inter-node level. Please refer to the
>>>>>    following sections of our user guide for further information on
>>>>>    the intra-node designs
>>>>>
>>>>>    LiMIC:
>>>>> http://mvapich.cse.ohio-state.edu/support/user_guide_mvapich2-1.9a2.html#x1-540006.5
>>>>>    Shared Memory Based Windows:
>>>>> http://mvapich.cse.ohio-state.edu/support/user_guide_mvapich2-1.9a2.html#x1-550006.6
>>>>>
>>>>>    But, you can enable asynchronous progress for inter-node
>>>>>    communication by using helper threads. This can be done using the
>>>>>    runtime parameters:
>>>>>
>>>>>    MPICH_ASYNC_PROGRESS=1 MV2_ENABLE_AFFINITY=0
>>>>>
>>>>>    However, as this involves a helper thread per process, you might
>>>>>    see a negative impact on performance when running MPI jobs in
>>>>>    fully subscribed mode, due to contention for cores. Do let us
>>>>>    know if you have further questions.
>>>>>
>>>>>    As a side note, we suggest that you move to our latest standard
>>>>>    release MVAPICH2 1.8.1 as it has several features and bug fixes
>>>>>    compared to 1.7.
>>>>>
>>>>>    Best
>>>>>    Sreeram Potluri
>>>>>
>>>>>    On Wed, Dec 5, 2012 at 7:15 AM, "María J. Martín"
>>>>>    <maria.martin.santamaria at udc.es
>>>>> <mailto:maria.martin.santamaria at udc.es>
>>>>>    <mailto:maria.martin.santamaria at udc.es>> wrote:
>>>>>
>>>>>        Hello,
>>>>>
>>>>>        We are using MVAPICH2.1-7 to run an asynchronous algorithm
>>>>>        using one-sided passive communications on an Infiniband
>>>>>        cluster. We observe that some unlocks take a long time to
>>>>>        progress. If extra mpi calls are inserted, the times spent in
>>>>>        some unlock calls decrease. It seems that the target of the
>>>>>        remote operation should enter the MPI library to progress the
>>>>>        unlock calls. However, we had understood from this article
>>>>> http://nowlab.cse.ohio-state.edu/publications/conf-papers/2008/santhana-ipdps08.pdf
>>>>> that
>>>>>        this requirement was avoided through the use of RDMA data
>>>>>        transfers. We have executed with the MV2_USE_RDMA_ONE_SIDED
>>>>>        parameter set to 1 and to 0 but none difference was observed
>>>>>        in the execution times. Any clarification about the behavior
>>>>>        of passive one-sided communications would be welcome.
>>>>>
>>>>>        Thanks,
>>>>>
>>>>>        María
>>>>>
>>>>>        ---------------------------------------------
>>>>>        María J. Martín
>>>>>        Computer Architecture Group
>>>>>        University of A Coruña
>>>>>        Spain
>>>>>
>>>>>
>>>>>
>>>>>        _______________________________________________
>>>>>        mvapich-discuss mailing list
>>>>> mvapich-discuss at cse.ohio-state.edu
>>>>> <mailto:mvapich-discuss at cse.ohio-state.edu>
>>>>>        <mailto:mvapich-discuss at cse.ohio-state.edu>
>>>>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> mvapich-discuss mailing list
>>> mvapich-discuss at cse.ohio-state.edu
>>> <mailto:mvapich-discuss at cse.ohio-state.edu>
>>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>>
>> _______________________________________________
>> mvapich-discuss mailing list
>> mvapich-discuss at cse.ohio-state.edu
>> <mailto:mvapich-discuss at cse.ohio-state.edu>
>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>