[mvapich-discuss] (no subject)

Tue Mar 8 18:55:48 EST 2016

I am trying to figure out my problem here.  After going through the
code I noticed the following code:

src/mpid/ch3/src/ch3u_rma_ops.c

240 #if defined(CHANNEL_MRAIL)
241    MPID_Datatype_get_size_macro(target_datatype, target_type_size);
242    size = target_count * target_type_size;
243    if (MPIR_DATATYPE_IS_PREDEFINED(origin_datatype)
244         && MPIR_DATATYPE_IS_PREDEFINED(target_datatype)
245         && win_ptr->fall_back != 1 && win_ptr->enable_fast_path == 1
246         && win_ptr->use_rdma_path == 1
247         && ((win_ptr->is_active && win_ptr->post_flag[target_rank] == 1)
248         || (!win_ptr->is_active && win_ptr->using_lock == 0))
249         && size < rdma_large_msg_rail_sharing_threshold)
250     {
251         transfer_complete =
MPIDI_CH3I_RDMA_try_rma_op_fast(MPIDI_RMA_PUT, (void *)origin_addr,
252                 origin_count, origin_datatype, target_rank, target_disp,
253                 target_count, target_datatype, NULL, NULL, win_ptr);
254     }
255     if (transfer_complete) {
256         goto fn_exit;
257     }
258     else
259 #endif

In my case, call to ...try_rma_op_fast() is never executed as
'win_ptr->post_flag[target_rank]' is set to 0.  Would this be normal?
This routine would handle fast RMA put/get/fop.

On Sat, Mar 5, 2016 at 12:22 PM, Mingzhe Li <li.2192 at osu.edu> wrote:
> Hi Nenad,
>
> Thanks for the note. We are not able to reproduce this issue on our systems.
> Both 2.1 and 2.2b give similar performance with your benchmarks. The 2.2b
> performance degradation issue might be coming due to ptmalloc issue you had
> indicated in a previous posting on your system. If you build 2.1 and 2.2b on
> the same system and run your benchmark, do you see the performance
> difference?
>
> Thanks,
> Mingzhe
>
> On Sat, Mar 5, 2016 at 1:17 AM, Nenad Vukicevic <nenad at intrepid.com> wrote:
>>
>> I am attaching a benchmark that shows one-sided Put/Get operations
>> being almost 100 times slower from the previous release.  Both version
>> are compiled the same way with -enable-fast=O3, however the same can
>> be shown for the debugged version.  Note that measurements of
>> accumulate operations are compatible across these two versions.
>>
>> From the attached results you can see that version 2.1 executes 1.4
>> million operations per second, while 2.2b does only close to 30
>> thousand operations.
>>
>> Any idea what might be the cause?
>>
>> Here are the outputs of the runs:
>>
>> RUNNING 2.1
>> nranks = 4
>> Main table size = 2^18 = 262144 words
>> Number of updates = 1048576
>> init(c= 0.0000 w= 0.0000) up(c= 0.7229 w= 0.7300) mil. up/sec= 1.4364
>> Found 6469 errors in 262144 locations (passed).
>>
>> RUNNING 2.2b
>> nranks = 4
>> Main table size = 2^18 = 262144 words
>> Number of updates = 1048576
>> init(c= 0.0000 w= 0.0000) up(c= 15.6167 w= 35.1200) mil. up/sec= 0.0299
>> Found 6478 errors in 262144 locations (passed).
>>
>> --
>> Nenad
>>
>> _______________________________________________
>> mvapich-discuss mailing list
>> mvapich-discuss at cse.ohio-state.edu
>> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>
>

-- 
Nenad