[mvapich-discuss] FW: Abort: fail to register rdma memory

sreeram potluri potluri at cse.ohio-state.edu
Thu Jan 7 14:26:48 EST 2010


Hi,

Can you please try using MV2_USE_RDMA_FAST_PATH=0 ?

Thank you
Sreeram Potluri

On Thu, Jan 7, 2010 at 1:57 PM, Jean-Christophe Ducom <jcducom at gmail.com>wrote:

> Sreeman-
> Thanks for the quick reply.
> After adding the variables in the .bashrc:
> mpiexec -machinefile ./machinefile -n 128 -env MV2_VBUF_TOTAL_SIZE 2048
> -env MV2_NUM_RDMA_BUFFER 4 -env MV2_IBA_EAGER_THRESHOLD 2044 cdp_if
> returns the error message:
>
> [96] Abort: fail to register rdma memory, size 8192
>
>  at line 105 in file ibv_priv.c
> [98] Abort: fail to register rdma memory, size 8192
>
>  at line 105 in file ibv_priv.c
> [97] Abort: fail to register rdma memory, size 8192
>
>  at line 105 in file ibv_priv.c
> [92] Abort: fail to register rdma memory, size 8192
>
>  at line 105 in file ibv_priv.c
> send desc error
> [89] Abort: [] Got completion with error 12, vendor code=0, dest rank=92
>
>  at line 581 in file ibv_channel_manager.c
> send desc error
> [67] Abort: [] Got completion with error 12, vendor code=0, dest rank=97
>
>  at line 581 in file ibv_channel_manager.c
> send desc error
> [66] Abort: [] Got completion with error 12, vendor code=0, dest rank=97
>
>  at line 581 in file ibv_channel_manager.c
> send desc error
> [106] Abort: [] Got completion with error 12, vendor code=0, dest rank=98
>
>  at line 581 in file ibv_channel_manager.c
> send desc error
> [24] Abort: [] Got completion with error 12, vendor code=0, dest rank=92
>
>  at line 581 in file ibv_channel_manager.c
> send desc error
> [104] Abort: [] Got completion with error 12, vendor code=0, dest rank=98
>
>  at line 581 in file ibv_channel_manager.c
> send desc error
> [76] Abort: [] Got completion with error 12, vendor code=0, dest rank=83
>
>  at line 581 in file ibv_channel_manager.c
> send desc error
> [80] Abort: [] Got completion with error 12, vendor code=0, dest rank=86
>
>  at line 581 in file ibv_channel_manager.c
> send desc error
> [73] Abort: [] Got completion with error 12, vendor code=0, dest rank=93
>
>  at line 581 in file ibv_channel_manager.c
>
>
> Just in case:
> # ulimit -a
> core file size          (blocks, -c) 0
> data seg size           (kbytes, -d) unlimited
> scheduling priority             (-e) 0
> file size               (blocks, -f) unlimited
> pending signals                 (-i) 204800
> max locked memory       (kbytes, -l) unlimited
> max memory size         (kbytes, -m) unlimited
> open files                      (-n) 1024
> pipe size            (512 bytes, -p) 8
> POSIX message queues     (bytes, -q) 819200
> real-time priority              (-r) 0
> stack size              (kbytes, -s) unlimited
> cpu time               (seconds, -t) unlimited
> max user processes              (-u) 204800
> virtual memory          (kbytes, -v) unlimited
> file locks                      (-x) unlimited
>
>
>
> JC
>
>
> sreeram potluri wrote:
>
>> Hi,
>>
>> Please try these parameters and values:
>>
>> MV2_VBUF_TOTAL_SIZE=2048
>> MV2_NUM_RDMA_BUFFER=4
>> MV2_IBA_EAGER_THRESHOLD=2044
>>  Note for any performance degradation and please let us know.
>>
>> Thank you
>> Sreeram Potluri
>>
>> - Show quoted text -
>>
>>
>> On Thu, Jan 7, 2010 at 12:55 AM, Rajeev Thakur <thakur at mcs.anl.gov<mailto:
>> thakur at mcs.anl.gov>> wrote:
>>
>>    I am forwarding your note to the mvapich-discuss mailing list.
>>
>>    Rajeev
>>
>>    -----Original Message-----
>>    From: mpich-discuss-bounces at mcs.anl.gov
>>    <mailto:mpich-discuss-bounces at mcs.anl.gov>
>>    [mailto:mpich-discuss-bounces at mcs.anl.gov
>>    <mailto:mpich-discuss-bounces at mcs.anl.gov>] On Behalf Of
>>    Jean-Christophe Ducom
>>    Sent: Wednesday, January 06, 2010 3:09 PM
>>    To: mpich-discuss at mcs.anl.gov <mailto:mpich-discuss at mcs.anl.gov>
>>    Subject: [mpich-discuss] Abort: fail to register rdma memory
>>
>>    All-
>>    The system is a cluster of  Nehalem 8cores (E5520  @ 2.27GHz) with
>>    24GB of memory and InfiniPath_QLE7240 cards.
>>    The nodes are running RHEL5.4 with mvapich2/1.4 compiled with Intel9.0.
>>
>>    When I run a medium size (16nodes/128cores) CFD simulation, the run
>>    stops with the following error message (it runs fine with
>>    64cores) [...] [49] Abort: fail to register rdma memory, size 32768
>>     at line 105 in file ibv_priv.c
>>    [51] Abort: fail to register rdma memory, size 32768
>>     at line 105 in file ibv_priv.c
>>    [47] Abort: fail to register rdma memory, size 32768
>>     at line 105 in file ibv_priv.c
>>    [50] Abort: fail to register rdma memory, size 32768
>>     at line 105 in file ibv_priv.c
>>    send desc error
>>    [58] Abort: send desc error
>>    [60] Abort: [] Got completion with error 12, vendor code=0, dest
>> rank=52
>>     at line 581 in file ibv_channel_manager.c send desc error [62]
>>    Abort: [] Got completion with error 12, vendor code=0, dest rank=52
>>     at line 581 in file ibv_channel_manager.c send desc error [118]
>>    Abort: [] Got completion with error 12, vendor code=0, dest
>>    rank=52
>>     at line 581 in file ibv_channel_manager.c send desc error [65]
>>    Abort: [] Got completion with error 12, vendor code=0, dest rank=52
>>     at line 581 in file ibv_channel_manager.c send desc error [116]
>>    Abort: [] Got completion with error 12, vendor code=0, dest
>>    rank=52
>>     at line 581 in file ibv_channel_manager.c send desc error [64]
>>    Abort: [] Got completion with error 12, vendor code=0, dest rank=52
>>     at line 581 in file ibv_channel_manager.c send desc error [76]
>>    Abort: [] Got completion with error 12, vendor code=0, dest rank=20
>>     at line 581 in file ibv_channel_manager.c [] Got completion with
>>    error 12, vendor code=0, dest rank=52
>>     at line 581 in file ibv_channel_manager.c send desc error [63]
>>    Abort: [] Got completion with error 12, vendor code=0, dest rank=52
>>     at line 581 in file ibv_channel_manager.c send desc error [84]
>>    Abort: [] Got completion with error 12, vendor code=0, dest rank=65
>>     at line 581 in file ibv_channel_manager.c send desc error [85]
>>    Abort: [] Got completion with error 12, vendor code=0, dest rank=20
>>     at line 581 in file ibv_channel_manager.c [...]
>>
>>    Looking at the ibv_priv.c:
>>    mem_handle[i] =  register_memory(vbuf_rdma_buf,
>>                                     rdma_vbuf_total_size *
>>    num_rdma_buffer, i); I believe I need to change the runtime parameters
>>    MV2 VBUF TOTAL SIZE (and then MV2_IBA_EAGER_THRESHOLD)
>>    MV2 NUM RDMA BUFFER
>>    MV2 RDMA VBUF POOL SIZE
>>
>>    Could anyone confirm it and suggest values for them?
>>    Thank you
>>    JC
>>    _______________________________________________
>>    mpich-discuss mailing list
>>    mpich-discuss at mcs.anl.gov <mailto:mpich-discuss at mcs.anl.gov>
>>
>>    https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>
>>    _______________________________________________
>>    mvapich-discuss mailing list
>>    mvapich-discuss at cse.ohio-state.edu
>>    <mailto:mvapich-discuss at cse.ohio-state.edu>
>>
>>    http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20100107/8e770d88/attachment.html


More information about the mvapich-discuss mailing list