[mvapich-discuss] FW: Abort: fail to register rdma memory

Rajeev Thakur thakur at mcs.anl.gov
Thu Jan 7 00:55:17 EST 2010


I am forwarding your note to the mvapich-discuss mailing list.

Rajeev

-----Original Message-----
From: mpich-discuss-bounces at mcs.anl.gov [mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of Jean-Christophe Ducom
Sent: Wednesday, January 06, 2010 3:09 PM
To: mpich-discuss at mcs.anl.gov
Subject: [mpich-discuss] Abort: fail to register rdma memory

All-
The system is a cluster of  Nehalem 8cores (E5520  @ 2.27GHz) with 24GB of memory and InfiniPath_QLE7240 cards.
The nodes are running RHEL5.4 with mvapich2/1.4 compiled with Intel9.0.

When I run a medium size (16nodes/128cores) CFD simulation, the run stops with the following error message (it runs fine with
64cores) [...] [49] Abort: fail to register rdma memory, size 32768
  at line 105 in file ibv_priv.c
[51] Abort: fail to register rdma memory, size 32768
  at line 105 in file ibv_priv.c
[47] Abort: fail to register rdma memory, size 32768
  at line 105 in file ibv_priv.c
[50] Abort: fail to register rdma memory, size 32768
  at line 105 in file ibv_priv.c
send desc error
[58] Abort: send desc error
[60] Abort: [] Got completion with error 12, vendor code=0, dest rank=52
  at line 581 in file ibv_channel_manager.c send desc error [62] Abort: [] Got completion with error 12, vendor code=0, dest rank=52
  at line 581 in file ibv_channel_manager.c send desc error [118] Abort: [] Got completion with error 12, vendor code=0, dest
rank=52
  at line 581 in file ibv_channel_manager.c send desc error [65] Abort: [] Got completion with error 12, vendor code=0, dest rank=52
  at line 581 in file ibv_channel_manager.c send desc error [116] Abort: [] Got completion with error 12, vendor code=0, dest
rank=52
  at line 581 in file ibv_channel_manager.c send desc error [64] Abort: [] Got completion with error 12, vendor code=0, dest rank=52
  at line 581 in file ibv_channel_manager.c send desc error [76] Abort: [] Got completion with error 12, vendor code=0, dest rank=20
  at line 581 in file ibv_channel_manager.c [] Got completion with error 12, vendor code=0, dest rank=52
  at line 581 in file ibv_channel_manager.c send desc error [63] Abort: [] Got completion with error 12, vendor code=0, dest rank=52
  at line 581 in file ibv_channel_manager.c send desc error [84] Abort: [] Got completion with error 12, vendor code=0, dest rank=65
  at line 581 in file ibv_channel_manager.c send desc error [85] Abort: [] Got completion with error 12, vendor code=0, dest rank=20
  at line 581 in file ibv_channel_manager.c [...]

Looking at the ibv_priv.c:
mem_handle[i] =  register_memory(vbuf_rdma_buf,
                                  rdma_vbuf_total_size * num_rdma_buffer, i); I believe I need to change the runtime parameters
MV2 VBUF TOTAL SIZE (and then MV2_IBA_EAGER_THRESHOLD)
MV2 NUM RDMA BUFFER
MV2 RDMA VBUF POOL SIZE

Could anyone confirm it and suggest values for them?
Thank you
JC
_______________________________________________
mpich-discuss mailing list
mpich-discuss at mcs.anl.gov
https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss



More information about the mvapich-discuss mailing list