[mvapich-discuss] mvapich2-1.5.1p1: Send queue issue (ch3:mrail)

Wed Dec 8 16:09:00 EST 2010

Hi Alexander,

Thanks for your detailed note. The RDMA_DEFAULT_MAX_SEND_WQE parameter
controls the maximum send queue elements for each QP. Increasing this
may result in more memory utilization for large scale clusters.

We are trying some mechanisms to call progress test from within
MVAPICH2 without hurting performance. We will aim to fix this in the
upcoming release.

Thanks again.

On Wed, Dec 8, 2010 at 10:33 AM, Alexander Alekhin
<alexander.alekhin at itseez.com> wrote:
> Hi,
>
> I use MVAPICH2 1.5.1p1 code from svn.
> My mpiname info:
> MVAPICH2 1.5.1p1 Unofficial Build ch3:mrail
>
> Compilation
> CC: gcc -g
> CXX: c++ -g
> F77: g77 -g
> F90: f95 -g
>
> Configuration
> --prefix=$HOME/mvapich2/install --enable-g=all --enable-error-messages=all
> --enable-fast=none
>
> Here is an example of code (based on examples/cpi.c):
> int i;
> for (i = 0; i < 100; i++) {
>     printf("%d: MPI_Reduce: %d\n", myid, i); fflush(stdout);
>     usleep(250000);
>     if (myid == 0) {
>         int result = 0;
>         if (MPI_Reduce(&myid, &result, 1, MPI_INT, MPI_SUM, 0,
> MPI_COMM_WORLD) != MPI_SUCCESS)
>             exit(1);
>     } else {
>         if (MPI_Reduce(&myid, NULL, 1, MPI_INT, MPI_SUM, 0, MPI_COMM_WORLD)
> != MPI_SUCCESS)
>             exit(1);
>     }
>     printf("%d: MPI_Reduce end: %d\n", myid, i); fflush(stdout);
>     usleep(250000);
> }
>
> This code generate this output (two processes were launched on different
> nodes):
> ...
> 1: MPI_Reduce: 59
> 0: MPI_Reduce: 59
> 1: MPI_Reduce end: 59
> 0: MPI_Reduce end: 59
> 1: MPI_Reduce: 60
> 0: MPI_Reduce: 60
> 1: MPI_Reduce end: 60
> 0: MPI_Reduce end: 60
> 1: MPI_Reduce: 61
> 0: MPI_Reduce: 61
> 1: MPI_Reduce end: 61
> 0: MPI_Reduce end: 61
> 1: MPI_Reduce: 62
> 0: MPI_Reduce: 62
> 1: MPI_Reduce end: 62
> 0: MPI_Reduce end: 62
> 1: MPI_Reduce: 63               (message is enqueued on rank 1 and not
> really sent)
> 0: MPI_Reduce: 63         <---- Problem starts here
> 1: MPI_Reduce end: 63
> 1: MPI_Reduce: 64
> 1: MPI_Reduce end: 64
> 1: MPI_Reduce: 65
> 1: MPI_Reduce end: 65
> 1: MPI_Reduce: 66
> 1: MPI_Reduce end: 66
> 1: MPI_Reduce: 67
> 1: MPI_Reduce end: 67
> 1: MPI_Reduce: 68
> 1: MPI_Reduce end: 68
> 1: MPI_Reduce: 69
> 0: MPI_Reduce end: 63     <---- Problem ends here
> 0: MPI_Reduce: 64
> 0: MPI_Reduce end: 64
> 0: MPI_Reduce: 65
> 0: MPI_Reduce end: 65
> 0: MPI_Reduce: 66
> 0: MPI_Reduce end: 66
> 0: MPI_Reduce: 67
> 0: MPI_Reduce end: 67
> 0: MPI_Reduce: 68
> 0: MPI_Reduce end: 68
> 0: MPI_Reduce: 69
> 0: MPI_Reduce end: 69
> 1: MPI_Reduce end: 69
> 0: MPI_Reduce: 70
> 1: MPI_Reduce: 70
> 0: MPI_Reduce end: 70
> 1: MPI_Reduce end: 70
> 0: MPI_Reduce: 71
> 1: MPI_Reduce: 71
> 0: MPI_Reduce end: 71
> 1: MPI_Reduce end: 71
> 0: MPI_Reduce: 72
> 1: MPI_Reduce: 72
> 0: MPI_Reduce end: 72
> 1: MPI_Reduce end: 72
> 0: MPI_Reduce: 73
> 1: MPI_Reduce: 73
> 0: MPI_Reduce end: 73
> 1: MPI_Reduce end: 73
> ...
> Delay is more than 2 seconds between marks, this is incorrect. There are no
> problems with screen output. Reduce #62 on rank=0 was called very long time,
> because messages on rank 1 were not sent, they were queued.
>
> This is a detail debug output for a "MPI_Barrier+usleep+MPI_Reduce+usleep"
> loop (MVAPICH was compiled with #define DEBUG in mpidpre.h):
> ...
> [1][ibv_send.c:774] buffer is 0x2b6368dc7000
> [1][ibv_send.c:775] pheader buffer is 0x2b6368dc7000
> [1][ibv_send.c:782] incrementing the outstanding eager vbufs: eager 64
> [1][ibv_send.c:800] coalesce not ok
> [1][ibv_send.c:838] [eager send]vbuf addr 0x32e0380, buffer: 0x2b6368dc7000
> [1][ibv_send.c:717] buffer: 0x2b6368dc7000, content size: 0
> [1][ibv_send.c:721] [fill buf]avail 16380, len 56
> [1][ibv_send.c:723] [fill buf] cpy ptr 0x2b6368dc7000
> [1][ibv_send.c:874] [eager send] len 56, selected rail hca 0, rail 0
> [1][ibv_send.c:79] [ibv_send] enqueue, head 0x32e0380, tail 0x32e0380
> [1][ch3_progress.c:166] Entering ch3 progress
> [1][ibv_channel_manager.c:354] Get one!
> [1][ibv_channel_manager.c:368] Get one exact seq: 72
> [1][ch3_progress.c:978] [handle read] buffer 0x3433d48
> [1][ch3_progress.c:1046] [handle read] pheader: 0x3450570
> [1][ibv_recv.c:58] [parse header] vbuf address 0x3433d48
> [1][ibv_recv.c:61] [parse header] header type 1
> [1][ibv_recv.c:99] [receiver side] cached credit 0
> [1][ibv_recv.c:107] [recv: parse header] faster headersize returned 4
> [1][ibv_recv.c:297] Before set credit, vc: 0x3297658, v->rail: 0, pkt:
> 0x7fff84b2ee50, pheader: 0x3450570
> [1][ch3_progress.c:1056] [handle read] header type 0
> [1][ch3_progress.c:1134] [handle read] header eager 0, headersize
> 4[1][ch3_progress.c:1148] [recv: progress] about to fill request,
> recv_active (nil)
> [1][ch3_progress.c:1263] exiting handle read
> [1][ch3_progress.c:1007] Finished with buffer – size: 4, consumed: 4
> [1][ch3_progress.c:1013] buffer set to: 0x344c578
> [1][ch3_progress.c:309] Exiting ch3 progress
> 1: MPI_Barrier: 59
> [1][ch3_istartmsgv.c:115] ch3_istartmsgv, header 0
> [1][ch3_istartmsgv.c:128] remote local nodes -1, myid 0
> [1][ch3_istartmsgv.c:174] [send], n_iov: 2, pkt_len 60
> [1][ibv_send.c:774] buffer is 0x2b6368dcb000
> [1][ibv_send.c:775] pheader buffer is 0x2b6368dcb000
> [1][ibv_send.c:782] incrementing the outstanding eager vbufs: eager 65
> [1][ibv_send.c:800] coalesce not ok
> [1][ibv_send.c:838] [eager send]vbuf addr 0x32e0440, buffer: 0x2b6368dcb000
> [1][ibv_send.c:717] buffer: 0x2b6368dcb000, content size: 0
> [1][ibv_send.c:721] [fill buf]avail 16380, len 56
> [1][ibv_send.c:723] [fill buf] cpy ptr 0x2b6368dcb000
> [1][ibv_send.c:721] [fill buf]avail 16324, len 4
> [1][ibv_send.c:723] [fill buf] cpy ptr 0x2b6368dcb038
> [1][ibv_send.c:874] [eager send] len 60, selected rail hca 0, rail 0
> [1][ibv_send.c:79] [ibv_send] enqueue, head 0x32e0380, tail 0x32e0440
> [1][ch3_istartmsgv.c:187] [istartmsgv] mpierr 0, nb 60
> [1][ch3_istartmsgv.c:190] [send path] eager send return 60 bytes
> 0: MPI_Reduce: 58
> [1][ibv_send.c:774] buffer is 0x2b6368dcf000
> [1][ibv_send.c:775] [0][ibv_channel_manager.c:637] [0] Possibly received a
> duplicate send completion event
> [0][ibv_channel_manager.c:718] [channel manager] get one with exact seqnum
> [0][ch3_read_progress.c:152] Get one packet with exact seq num
> [0][ch3_progress.c:978] [handle read] buffer 0x10296c0
> [0][ch3_progress.c:1046] [handle read] pheader: 0x2b3852837000
> [0][ibv_recv.c:58] [parse header] vbuf address 0x10296c0
> [0][ibv_recv.c:61] [parse header] header type 0
> [0][ibv_recv.c:113] [recv: parse header] pkt eager send
> [0][ibv_recv.c:129] [recv: parse header] headersize returned 56
> [0][ibv_recv.c:297] Before set credit, vc: 0xff5cb8, v->rail: 0, pkt:
> 0x7fff1c5c1780, pheader: 0x2b3852837000
> [0][ch3_progress.c:1056] [handle read] header type 0
> [0][ch3_progress.c:1134] [handle read] header eager 0, headersize
> 56[0][ch3_progress.c:1148] [recv: progress] about to fill request,
> recv_active (nil)
> [0][ch3_progress.c:1263] exiting handle read
> [0][ch3_progress.c:1007] Finished with buffer – size: 56pheader buffer is
> 0x2b6368dcf000
> [1][ibv_send.c:782] incrementing the outstanding eager vbufs: eager 66
> [1][ibv_send.c:800] coalesce not ok
> [1][ibv_send.c:838] [eager send]vbuf addr 0x32e0500, buffer: 0x2b6368dcf000
> [1][ibv_send.c:717] buffer: 0x2b6368dcf000, content size: 0
> [1][ibv_send.c:721] [fill buf]avail 16380, len 56
> [1][ibv_send.c:723] [fill buf] cpy ptr 0x2b6368dcf000
> [1][ibv_send.c:874] [eager send] len 56, selected rail hca 0, rail 0
> [1][ibv_send.c:79] [ibv_send] enqueue, head 0x32e0380, tail 0x32e0500
> [1][ch3_progress.c:166] Entering ch3 progress
> [1][ibv_channel_manager.c:655] [device_Check] process send, v 0x32dfb40
> [1][ibv_send.c:1124] Eager, decrementing to: 53345088
> [1][ibv_send.c:151] [ibv_send] dequeue, head 0x32e0440, tail 0x32e0500
> [1][ibv_send.c:1244] [process send] complete for eager msg, req (nil)
> [1][ibv_send.c:1254] [process send] normal flag, free vbuf
> [1][vbuf.c:383] release_vbuf: releasing 0x32dfb40 previous head = 0x32e05c0,
> padding 222
> [1][ibv_send.c:1406] return from process sen, consumed: 56
> [0][ch3_progress.c:1013] buffer set to: 0x2b3852837000
> [0][vbuf.c:383] release_vbuf: releasing 0x10296c0 previous head = 0x1029600,
> padding 222
> [0][ch3_progress.c:309] Exiting ch3 progress
> ...
> Note: Output is not in order for different processes (fflush(stderr) was not
> used).
>
> If there are no MVAPICH IO calls for a long time (app calculations) then
> messages were queued and not sent. Other process do nothing and waiting for
> these messages.
> Using of MPIDI_CH3I_Progress_test() after all MPI routines solves send
> problem, but add some performance penalty.
>
> 1. Why MVAPICH enqueue sending messages (with usleep we guarantee that IO
> device is not busy, but messages was enqueued in any case)? Is it normal or
> was something configured incorrectly?
> 2. How to enable message sending in background, for example in separate
> thread (application process calculations in main thread and doesn't call
> MVAPICH IO routines)?
> 3. What is the RDMA_DEFAULT_MAX_SEND_WQE=64 parameter
> (/src/mpid/ch3/channels/mrail/src/gen2/ibv_param.h)? If send message count
> exceed this value then next 1-3 messages will be enqueued and will not sent
> to device.
>
> --
> Regards,
> Alexander Alekhin
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
>

-- 
Sayantan Sur

Research Scientist
Department of Computer Science
http://www.cse.ohio-state.edu/~surs