[mvapich-discuss] mvapich2-1.5.1p1: Send queue issue (ch3:mrail)

Wed Dec 8 05:33:35 EST 2010

Hi,

I use MVAPICH2 1.5.1p1 code from svn.
My mpiname info:
MVAPICH2 1.5.1p1 Unofficial Build ch3:mrail

Compilation
CC: gcc -g
CXX: c++ -g
F77: g77 -g
F90: f95 -g

Configuration
--prefix=$HOME/mvapich2/install --enable-g=all --enable-error-messages=all  
--enable-fast=none

Here is an example of code (based on examples/cpi.c):
int i;
for (i = 0; i < 100; i++) {
     printf("%d: MPI_Reduce: %d\n", myid, i); fflush(stdout);
     usleep(250000);
     if (myid == 0) {
         int result = 0;
         if (MPI_Reduce(&myid, &result, 1, MPI_INT, MPI_SUM, 0,  
MPI_COMM_WORLD) != MPI_SUCCESS)
             exit(1);
     } else {
         if (MPI_Reduce(&myid, NULL, 1, MPI_INT, MPI_SUM, 0,  
MPI_COMM_WORLD) != MPI_SUCCESS)
             exit(1);
     }
     printf("%d: MPI_Reduce end: %d\n", myid, i); fflush(stdout);
     usleep(250000);
}

This code generate this output (two processes were launched on different  
nodes):
...
1: MPI_Reduce: 59
0: MPI_Reduce: 59
1: MPI_Reduce end: 59
0: MPI_Reduce end: 59
1: MPI_Reduce: 60
0: MPI_Reduce: 60
1: MPI_Reduce end: 60
0: MPI_Reduce end: 60
1: MPI_Reduce: 61
0: MPI_Reduce: 61
1: MPI_Reduce end: 61
0: MPI_Reduce end: 61
1: MPI_Reduce: 62
0: MPI_Reduce: 62
1: MPI_Reduce end: 62
0: MPI_Reduce end: 62
1: MPI_Reduce: 63               (message is enqueued on rank 1 and not  
really sent)
0: MPI_Reduce: 63         <---- Problem starts here
1: MPI_Reduce end: 63
1: MPI_Reduce: 64
1: MPI_Reduce end: 64
1: MPI_Reduce: 65
1: MPI_Reduce end: 65
1: MPI_Reduce: 66
1: MPI_Reduce end: 66
1: MPI_Reduce: 67
1: MPI_Reduce end: 67
1: MPI_Reduce: 68
1: MPI_Reduce end: 68
1: MPI_Reduce: 69
0: MPI_Reduce end: 63     <---- Problem ends here
0: MPI_Reduce: 64
0: MPI_Reduce end: 64
0: MPI_Reduce: 65
0: MPI_Reduce end: 65
0: MPI_Reduce: 66
0: MPI_Reduce end: 66
0: MPI_Reduce: 67
0: MPI_Reduce end: 67
0: MPI_Reduce: 68
0: MPI_Reduce end: 68
0: MPI_Reduce: 69
0: MPI_Reduce end: 69
1: MPI_Reduce end: 69
0: MPI_Reduce: 70
1: MPI_Reduce: 70
0: MPI_Reduce end: 70
1: MPI_Reduce end: 70
0: MPI_Reduce: 71
1: MPI_Reduce: 71
0: MPI_Reduce end: 71
1: MPI_Reduce end: 71
0: MPI_Reduce: 72
1: MPI_Reduce: 72
0: MPI_Reduce end: 72
1: MPI_Reduce end: 72
0: MPI_Reduce: 73
1: MPI_Reduce: 73
0: MPI_Reduce end: 73
1: MPI_Reduce end: 73
...
Delay is more than 2 seconds between marks, this is incorrect. There are  
no problems with screen output. Reduce #62 on rank=0 was called very long  
time, because messages on rank 1 were not sent, they were queued.

This is a detail debug output for a "MPI_Barrier+usleep+MPI_Reduce+usleep"  
loop (MVAPICH was compiled with #define DEBUG in mpidpre.h):
...
[1][ibv_send.c:774] buffer is 0x2b6368dc7000
[1][ibv_send.c:775] pheader buffer is 0x2b6368dc7000
[1][ibv_send.c:782] incrementing the outstanding eager vbufs: eager 64
[1][ibv_send.c:800] coalesce not ok
[1][ibv_send.c:838] [eager send]vbuf addr 0x32e0380, buffer: 0x2b6368dc7000
[1][ibv_send.c:717] buffer: 0x2b6368dc7000, content size: 0
[1][ibv_send.c:721] [fill buf]avail 16380, len 56
[1][ibv_send.c:723] [fill buf] cpy ptr 0x2b6368dc7000
[1][ibv_send.c:874] [eager send] len 56, selected rail hca 0, rail 0
[1][ibv_send.c:79] [ibv_send] enqueue, head 0x32e0380, tail 0x32e0380
[1][ch3_progress.c:166] Entering ch3 progress
[1][ibv_channel_manager.c:354] Get one!
[1][ibv_channel_manager.c:368] Get one exact seq: 72
[1][ch3_progress.c:978] [handle read] buffer 0x3433d48
[1][ch3_progress.c:1046] [handle read] pheader: 0x3450570
[1][ibv_recv.c:58] [parse header] vbuf address 0x3433d48
[1][ibv_recv.c:61] [parse header] header type 1
[1][ibv_recv.c:99] [receiver side] cached credit 0
[1][ibv_recv.c:107] [recv: parse header] faster headersize returned 4
[1][ibv_recv.c:297] Before set credit, vc: 0x3297658, v->rail: 0, pkt:  
0x7fff84b2ee50, pheader: 0x3450570
[1][ch3_progress.c:1056] [handle read] header type 0
[1][ch3_progress.c:1134] [handle read] header eager 0, headersize  
4[1][ch3_progress.c:1148] [recv: progress] about to fill request,  
recv_active (nil)
[1][ch3_progress.c:1263] exiting handle read
[1][ch3_progress.c:1007] Finished with buffer – size: 4, consumed: 4
[1][ch3_progress.c:1013] buffer set to: 0x344c578
[1][ch3_progress.c:309] Exiting ch3 progress
1: MPI_Barrier: 59
[1][ch3_istartmsgv.c:115] ch3_istartmsgv, header 0
[1][ch3_istartmsgv.c:128] remote local nodes -1, myid 0
[1][ch3_istartmsgv.c:174] [send], n_iov: 2, pkt_len 60
[1][ibv_send.c:774] buffer is 0x2b6368dcb000
[1][ibv_send.c:775] pheader buffer is 0x2b6368dcb000
[1][ibv_send.c:782] incrementing the outstanding eager vbufs: eager 65
[1][ibv_send.c:800] coalesce not ok
[1][ibv_send.c:838] [eager send]vbuf addr 0x32e0440, buffer: 0x2b6368dcb000
[1][ibv_send.c:717] buffer: 0x2b6368dcb000, content size: 0
[1][ibv_send.c:721] [fill buf]avail 16380, len 56
[1][ibv_send.c:723] [fill buf] cpy ptr 0x2b6368dcb000
[1][ibv_send.c:721] [fill buf]avail 16324, len 4
[1][ibv_send.c:723] [fill buf] cpy ptr 0x2b6368dcb038
[1][ibv_send.c:874] [eager send] len 60, selected rail hca 0, rail 0
[1][ibv_send.c:79] [ibv_send] enqueue, head 0x32e0380, tail 0x32e0440
[1][ch3_istartmsgv.c:187] [istartmsgv] mpierr 0, nb 60
[1][ch3_istartmsgv.c:190] [send path] eager send return 60 bytes
0: MPI_Reduce: 58
[1][ibv_send.c:774] buffer is 0x2b6368dcf000
[1][ibv_send.c:775] [0][ibv_channel_manager.c:637] [0] Possibly received a  
duplicate send completion event
[0][ibv_channel_manager.c:718] [channel manager] get one with exact seqnum
[0][ch3_read_progress.c:152] Get one packet with exact seq num
[0][ch3_progress.c:978] [handle read] buffer 0x10296c0
[0][ch3_progress.c:1046] [handle read] pheader: 0x2b3852837000
[0][ibv_recv.c:58] [parse header] vbuf address 0x10296c0
[0][ibv_recv.c:61] [parse header] header type 0
[0][ibv_recv.c:113] [recv: parse header] pkt eager send
[0][ibv_recv.c:129] [recv: parse header] headersize returned 56
[0][ibv_recv.c:297] Before set credit, vc: 0xff5cb8, v->rail: 0, pkt:  
0x7fff1c5c1780, pheader: 0x2b3852837000
[0][ch3_progress.c:1056] [handle read] header type 0
[0][ch3_progress.c:1134] [handle read] header eager 0, headersize  
56[0][ch3_progress.c:1148] [recv: progress] about to fill request,  
recv_active (nil)
[0][ch3_progress.c:1263] exiting handle read
[0][ch3_progress.c:1007] Finished with buffer – size: 56pheader buffer is  
0x2b6368dcf000
[1][ibv_send.c:782] incrementing the outstanding eager vbufs: eager 66
[1][ibv_send.c:800] coalesce not ok
[1][ibv_send.c:838] [eager send]vbuf addr 0x32e0500, buffer: 0x2b6368dcf000
[1][ibv_send.c:717] buffer: 0x2b6368dcf000, content size: 0
[1][ibv_send.c:721] [fill buf]avail 16380, len 56
[1][ibv_send.c:723] [fill buf] cpy ptr 0x2b6368dcf000
[1][ibv_send.c:874] [eager send] len 56, selected rail hca 0, rail 0
[1][ibv_send.c:79] [ibv_send] enqueue, head 0x32e0380, tail 0x32e0500
[1][ch3_progress.c:166] Entering ch3 progress
[1][ibv_channel_manager.c:655] [device_Check] process send, v 0x32dfb40
[1][ibv_send.c:1124] Eager, decrementing to: 53345088
[1][ibv_send.c:151] [ibv_send] dequeue, head 0x32e0440, tail 0x32e0500
[1][ibv_send.c:1244] [process send] complete for eager msg, req (nil)
[1][ibv_send.c:1254] [process send] normal flag, free vbuf
[1][vbuf.c:383] release_vbuf: releasing 0x32dfb40 previous head =  
0x32e05c0, padding 222
[1][ibv_send.c:1406] return from process sen, consumed: 56
[0][ch3_progress.c:1013] buffer set to: 0x2b3852837000
[0][vbuf.c:383] release_vbuf: releasing 0x10296c0 previous head =  
0x1029600, padding 222
[0][ch3_progress.c:309] Exiting ch3 progress
...
Note: Output is not in order for different processes (fflush(stderr) was  
not used).

If there are no MVAPICH IO calls for a long time (app calculations) then  
messages were queued and not sent. Other process do nothing and waiting  
for these messages.
Using of MPIDI_CH3I_Progress_test() after all MPI routines solves send  
problem, but add some performance penalty.

1. Why MVAPICH enqueue sending messages (with usleep we guarantee that IO  
device is not busy, but messages was enqueued in any case)? Is it normal  
or was something configured incorrectly?
2. How to enable message sending in background, for example in separate  
thread (application process calculations in main thread and doesn't call  
MVAPICH IO routines)?
3. What is the RDMA_DEFAULT_MAX_SEND_WQE=64 parameter  
(/src/mpid/ch3/channels/mrail/src/gen2/ibv_param.h)? If send message count  
exceed this value then next 1-3 messages will be enqueued and will not  
sent to device.

--
Regards,
Alexander Alekhin
-------------- next part --------------
Skipped content of type multipart/related