[mvapich-discuss] FW: Abort: fail to register rdma memory
sreeram potluri
potluri at cse.ohio-state.edu
Thu Jan 7 14:26:48 EST 2010
Hi,
Can you please try using MV2_USE_RDMA_FAST_PATH=0 ?
Thank you
Sreeram Potluri
On Thu, Jan 7, 2010 at 1:57 PM, Jean-Christophe Ducom <jcducom at gmail.com>wrote:
> Sreeman-
> Thanks for the quick reply.
> After adding the variables in the .bashrc:
> mpiexec -machinefile ./machinefile -n 128 -env MV2_VBUF_TOTAL_SIZE 2048
> -env MV2_NUM_RDMA_BUFFER 4 -env MV2_IBA_EAGER_THRESHOLD 2044 cdp_if
> returns the error message:
>
> [96] Abort: fail to register rdma memory, size 8192
>
> at line 105 in file ibv_priv.c
> [98] Abort: fail to register rdma memory, size 8192
>
> at line 105 in file ibv_priv.c
> [97] Abort: fail to register rdma memory, size 8192
>
> at line 105 in file ibv_priv.c
> [92] Abort: fail to register rdma memory, size 8192
>
> at line 105 in file ibv_priv.c
> send desc error
> [89] Abort: [] Got completion with error 12, vendor code=0, dest rank=92
>
> at line 581 in file ibv_channel_manager.c
> send desc error
> [67] Abort: [] Got completion with error 12, vendor code=0, dest rank=97
>
> at line 581 in file ibv_channel_manager.c
> send desc error
> [66] Abort: [] Got completion with error 12, vendor code=0, dest rank=97
>
> at line 581 in file ibv_channel_manager.c
> send desc error
> [106] Abort: [] Got completion with error 12, vendor code=0, dest rank=98
>
> at line 581 in file ibv_channel_manager.c
> send desc error
> [24] Abort: [] Got completion with error 12, vendor code=0, dest rank=92
>
> at line 581 in file ibv_channel_manager.c
> send desc error
> [104] Abort: [] Got completion with error 12, vendor code=0, dest rank=98
>
> at line 581 in file ibv_channel_manager.c
> send desc error
> [76] Abort: [] Got completion with error 12, vendor code=0, dest rank=83
>
> at line 581 in file ibv_channel_manager.c
> send desc error
> [80] Abort: [] Got completion with error 12, vendor code=0, dest rank=86
>
> at line 581 in file ibv_channel_manager.c
> send desc error
> [73] Abort: [] Got completion with error 12, vendor code=0, dest rank=93
>
> at line 581 in file ibv_channel_manager.c
>
>
> Just in case:
> # ulimit -a
> core file size (blocks, -c) 0
> data seg size (kbytes, -d) unlimited
> scheduling priority (-e) 0
> file size (blocks, -f) unlimited
> pending signals (-i) 204800
> max locked memory (kbytes, -l) unlimited
> max memory size (kbytes, -m) unlimited
> open files (-n) 1024
> pipe size (512 bytes, -p) 8
> POSIX message queues (bytes, -q) 819200
> real-time priority (-r) 0
> stack size (kbytes, -s) unlimited
> cpu time (seconds, -t) unlimited
> max user processes (-u) 204800
> virtual memory (kbytes, -v) unlimited
> file locks (-x) unlimited
>
>
>
> JC
>
>
> sreeram potluri wrote:
>
>> Hi,
>>
>> Please try these parameters and values:
>>
>> MV2_VBUF_TOTAL_SIZE=2048
>> MV2_NUM_RDMA_BUFFER=4
>> MV2_IBA_EAGER_THRESHOLD=2044
>> Note for any performance degradation and please let us know.
>>
>> Thank you
>> Sreeram Potluri
>>
>> - Show quoted text -
>>
>>
>> On Thu, Jan 7, 2010 at 12:55 AM, Rajeev Thakur <thakur at mcs.anl.gov<mailto:
>> thakur at mcs.anl.gov>> wrote:
>>
>> I am forwarding your note to the mvapich-discuss mailing list.
>>
>> Rajeev
>>
>> -----Original Message-----
>> From: mpich-discuss-bounces at mcs.anl.gov
>> <mailto:mpich-discuss-bounces at mcs.anl.gov>
>> [mailto:mpich-discuss-bounces at mcs.anl.gov
>> <mailto:mpich-discuss-bounces at mcs.anl.gov>] On Behalf Of
>> Jean-Christophe Ducom
>> Sent: Wednesday, January 06, 2010 3:09 PM
>> To: mpich-discuss at mcs.anl.gov <mailto:mpich-discuss at mcs.anl.gov>
>> Subject: [mpich-discuss] Abort: fail to register rdma memory
>>
>> All-
>> The system is a cluster of Nehalem 8cores (E5520 @ 2.27GHz) with
>> 24GB of memory and InfiniPath_QLE7240 cards.
>> The nodes are running RHEL5.4 with mvapich2/1.4 compiled with Intel9.0.
>>
>> When I run a medium size (16nodes/128cores) CFD simulation, the run
>> stops with the following error message (it runs fine with
>> 64cores) [...] [49] Abort: fail to register rdma memory, size 32768
>> at line 105 in file ibv_priv.c
>> [51] Abort: fail to register rdma memory, size 32768
>> at line 105 in file ibv_priv.c
>> [47] Abort: fail to register rdma memory, size 32768
>> at line 105 in file ibv_priv.c
>> [50] Abort: fail to register rdma memory, size 32768
>> at line 105 in file ibv_priv.c
>> send desc error
>> [58] Abort: send desc error
>> [60] Abort: [] Got completion with error 12, vendor code=0, dest
>> rank=52
>> at line 581 in file ibv_channel_manager.c send desc error [62]
>> Abort: [] Got completion with error 12, vendor code=0, dest rank=52
>> at line 581 in file ibv_channel_manager.c send desc error [118]
>> Abort: [] Got completion with error 12, vendor code=0, dest
>> rank=52
>> at line 581 in file ibv_channel_manager.c send desc error [65]
>> Abort: [] Got completion with error 12, vendor code=0, dest rank=52
>> at line 581 in file ibv_channel_manager.c send desc error [116]
>> Abort: [] Got completion with error 12, vendor code=0, dest
>> rank=52
>> at line 581 in file ibv_channel_manager.c send desc error [64]
>> Abort: [] Got completion with error 12, vendor code=0, dest rank=52
>> at line 581 in file ibv_channel_manager.c send desc error [76]
>> Abort: [] Got completion with error 12, vendor code=0, dest rank=20
>> at line 581 in file ibv_channel_manager.c [] Got completion with
>> error 12, vendor code=0, dest rank=52
>> at line 581 in file ibv_channel_manager.c send desc error [63]
>> Abort: [] Got completion with error 12, vendor code=0, dest rank=52
>> at line 581 in file ibv_channel_manager.c send desc error [84]
>> Abort: [] Got completion with error 12, vendor code=0, dest rank=65
>> at line 581 in file ibv_channel_manager.c send desc error [85]
>> Abort: [] Got completion with error 12, vendor code=0, dest rank=20
>> at line 581 in file ibv_channel_manager.c [...]
>>
>> Looking at the ibv_priv.c:
>> mem_handle[i] = register_memory(vbuf_rdma_buf,
>> rdma_vbuf_total_size *
>> num_rdma_buffer, i); I believe I need to change the runtime parameters
>> MV2 VBUF TOTAL SIZE (and then MV2_IBA_EAGER_THRESHOLD)
>> MV2 NUM RDMA BUFFER
>> MV2 RDMA VBUF POOL SIZE
>>
>> Could anyone confirm it and suggest values for them?
>> Thank you
>> JC
>> _______________________________________________
>> mpich-discuss mailing list
>> mpich-discuss at mcs.anl.gov <mailto:mpich-discuss at mcs.anl.gov>
>>
>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>
>> _______________________________________________
>> mvapich-discuss mailing list
>> mvapich-discuss at cse.ohio-state.edu
>> <mailto:mvapich-discuss at cse.ohio-state.edu>
>>
>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20100107/8e770d88/attachment.html
More information about the mvapich-discuss
mailing list