[mvapich-discuss] FW: Abort: fail to register rdma memory

sreeram potluri potluri at cse.ohio-state.edu
Thu Jan 7 13:21:50 EST 2010


Hi,

Please try these parameters and values:

MV2_VBUF_TOTAL_SIZE=2048
MV2_NUM_RDMA_BUFFER=4
MV2_IBA_EAGER_THRESHOLD=2044

Note for any performance degradation and please let us know.

Thank you
Sreeram Potluri

- Show quoted text -


On Thu, Jan 7, 2010 at 12:55 AM, Rajeev Thakur <thakur at mcs.anl.gov> wrote:

> I am forwarding your note to the mvapich-discuss mailing list.
>
> Rajeev
>
> -----Original Message-----
> From: mpich-discuss-bounces at mcs.anl.gov [mailto:
> mpich-discuss-bounces at mcs.anl.gov] On Behalf Of Jean-Christophe Ducom
> Sent: Wednesday, January 06, 2010 3:09 PM
> To: mpich-discuss at mcs.anl.gov
> Subject: [mpich-discuss] Abort: fail to register rdma memory
>
> All-
> The system is a cluster of  Nehalem 8cores (E5520  @ 2.27GHz) with 24GB of
> memory and InfiniPath_QLE7240 cards.
> The nodes are running RHEL5.4 with mvapich2/1.4 compiled with Intel9.0.
>
> When I run a medium size (16nodes/128cores) CFD simulation, the run stops
> with the following error message (it runs fine with
> 64cores) [...] [49] Abort: fail to register rdma memory, size 32768
>  at line 105 in file ibv_priv.c
> [51] Abort: fail to register rdma memory, size 32768
>  at line 105 in file ibv_priv.c
> [47] Abort: fail to register rdma memory, size 32768
>  at line 105 in file ibv_priv.c
> [50] Abort: fail to register rdma memory, size 32768
>  at line 105 in file ibv_priv.c
> send desc error
> [58] Abort: send desc error
> [60] Abort: [] Got completion with error 12, vendor code=0, dest rank=52
>  at line 581 in file ibv_channel_manager.c send desc error [62] Abort: []
> Got completion with error 12, vendor code=0, dest rank=52
>  at line 581 in file ibv_channel_manager.c send desc error [118] Abort: []
> Got completion with error 12, vendor code=0, dest
> rank=52
>  at line 581 in file ibv_channel_manager.c send desc error [65] Abort: []
> Got completion with error 12, vendor code=0, dest rank=52
>  at line 581 in file ibv_channel_manager.c send desc error [116] Abort: []
> Got completion with error 12, vendor code=0, dest
> rank=52
>  at line 581 in file ibv_channel_manager.c send desc error [64] Abort: []
> Got completion with error 12, vendor code=0, dest rank=52
>  at line 581 in file ibv_channel_manager.c send desc error [76] Abort: []
> Got completion with error 12, vendor code=0, dest rank=20
>  at line 581 in file ibv_channel_manager.c [] Got completion with error 12,
> vendor code=0, dest rank=52
>  at line 581 in file ibv_channel_manager.c send desc error [63] Abort: []
> Got completion with error 12, vendor code=0, dest rank=52
>  at line 581 in file ibv_channel_manager.c send desc error [84] Abort: []
> Got completion with error 12, vendor code=0, dest rank=65
>  at line 581 in file ibv_channel_manager.c send desc error [85] Abort: []
> Got completion with error 12, vendor code=0, dest rank=20
>  at line 581 in file ibv_channel_manager.c [...]
>
> Looking at the ibv_priv.c:
> mem_handle[i] =  register_memory(vbuf_rdma_buf,
>                                  rdma_vbuf_total_size * num_rdma_buffer,
> i); I believe I need to change the runtime parameters
> MV2 VBUF TOTAL SIZE (and then MV2_IBA_EAGER_THRESHOLD)
> MV2 NUM RDMA BUFFER
> MV2 RDMA VBUF POOL SIZE
>
> Could anyone confirm it and suggest values for them?
> Thank you
> JC
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20100107/7d3646c7/attachment.html


More information about the mvapich-discuss mailing list