[mvapich-discuss] FW: Abort: fail to register rdma memory

Jean-Christophe Ducom jcducom at gmail.com
Thu Jan 7 16:42:15 EST 2010


Sreeram-
I ran pair-wise tests using osu_latency.
# OSU MPI Latency Test v3.1.2
# Size            Latency (us)
0                         0.78
1                         0.99
2                         0.98
4                         0.99
8                         0.99
16                        0.99
32                        1.02
64                        1.05
128                       1.14
256                       1.29
512                       1.42
1024                      1.68
2048                      1.89
4096                      2.56
8192                      3.95
16384                     6.63
32768                    14.08
65536                    27.54
131072                   45.23
262144                   78.25
524288                  140.83
1048576                 267.52
2097152                 529.97
4194304                1197.03


Each results fall within 10% of those numbers ( I can do a more thorough 
statistical analysis if needed) for large size
JC



sreeram potluri wrote:
> Hi,
> 
> To rule out any system issues, can you please try running pair-wise 
> tests using osu_latency test between the nodes you are using for the 128 
> process run?
> 
> Thank you
> Sreeram Potluri
> 
> FYI. when you use MV2_USE_RDMA_FAST_PATH=0, you need not specify the 
> other parameters as this parameter disables the RDMA message path
> 
> On Thu, Jan 7, 2010 at 2:58 PM, Jean-Christophe Ducom <jcducom at gmail.com 
> <mailto:jcducom at gmail.com>> wrote:
> 
>     Sreeram-
>     After setting the new variablein .bashrc, I get:
>     mpiexec -machinefile ./machinefile -n 128 -env MV2_VBUF_TOTAL_SIZE
>     2048 -env MV2_NUM_RDMA_BUFFER 4 -env MV2_IBA_EAGER_THRESHOLD 2044
>     -env MV2_USE_RDMA_FAST_PATH 0 cdp_if
>     send desc error
>     [7] Abort: [] Got completion with error 12, vendor code=0, dest rank=47
> 
>      at line 581 in file ibv_channel_manager.c
>     send desc error
>     [12] Abort: [] Got completion with error 12, vendor code=0, dest rank=47
> 
>      at line 581 in file ibv_channel_manager.c
>     send desc error
>     [21] Abort: [] Got completion with error 12, vendor code=0, dest rank=12
> 
>      at line 581 in file ibv_channel_manager.c
>     send desc error
>     [45] Abort: [] Got completion with error 12, vendor code=0, dest rank=12
> 
>      at line 581 in file ibv_channel_manager.c
>     send desc error
>     [27] Abort: [] Got completion with error 12, vendor code=0, dest rank=21
> 
>      at line 581 in file ibv_channel_manager.c
>     send desc error
>     [9] Abort: [] Got completion with error 12, vendor code=0, dest rank=8
> 
>      at line 581 in file ibv_channel_manager.c
>     send desc error
>     [40] Abort: [] Got completion with error 12, vendor code=0, dest rank=45
> 
>      at line 581 in file ibv_channel_manager.c
>     send desc error
>     [39] Abort: [] Got completion with error 12, vendor code=0, dest rank=14
> 
>      at line 581 in file ibv_channel_manager.c
> 
>     JC
> 
>     sreeram potluri wrote:
> 
>         Hi,
> 
>         Can you please try using MV2_USE_RDMA_FAST_PATH=0 ?
> 
>         Thank you
>         Sreeram Potluri
> 
>         On Thu, Jan 7, 2010 at 1:57 PM, Jean-Christophe Ducom
>         <jcducom at gmail.com <mailto:jcducom at gmail.com>
>         <mailto:jcducom at gmail.com <mailto:jcducom at gmail.com>>> wrote:
> 
>            Sreeman-
>            Thanks for the quick reply.
>            After adding the variables in the .bashrc:
>            mpiexec -machinefile ./machinefile -n 128 -env
>         MV2_VBUF_TOTAL_SIZE
>            2048 -env MV2_NUM_RDMA_BUFFER 4 -env MV2_IBA_EAGER_THRESHOLD
>         2044 cdp_if
>            returns the error message:
> 
>            [96] Abort: fail to register rdma memory, size 8192
> 
>             at line 105 in file ibv_priv.c
>            [98] Abort: fail to register rdma memory, size 8192
> 
>             at line 105 in file ibv_priv.c
>            [97] Abort: fail to register rdma memory, size 8192
> 
>             at line 105 in file ibv_priv.c
>            [92] Abort: fail to register rdma memory, size 8192
> 
>             at line 105 in file ibv_priv.c
>            send desc error
>            [89] Abort: [] Got completion with error 12, vendor code=0,
>         dest rank=92
> 
>             at line 581 in file ibv_channel_manager.c
>            send desc error
>            [67] Abort: [] Got completion with error 12, vendor code=0,
>         dest rank=97
> 
>             at line 581 in file ibv_channel_manager.c
>            send desc error
>            [66] Abort: [] Got completion with error 12, vendor code=0,
>         dest rank=97
> 
>             at line 581 in file ibv_channel_manager.c
>            send desc error
>            [106] Abort: [] Got completion with error 12, vendor code=0, dest
>            rank=98
> 
>             at line 581 in file ibv_channel_manager.c
>            send desc error
>            [24] Abort: [] Got completion with error 12, vendor code=0,
>         dest rank=92
> 
>             at line 581 in file ibv_channel_manager.c
>            send desc error
>            [104] Abort: [] Got completion with error 12, vendor code=0, dest
>            rank=98
> 
>             at line 581 in file ibv_channel_manager.c
>            send desc error
>            [76] Abort: [] Got completion with error 12, vendor code=0,
>         dest rank=83
> 
>             at line 581 in file ibv_channel_manager.c
>            send desc error
>            [80] Abort: [] Got completion with error 12, vendor code=0,
>         dest rank=86
> 
>             at line 581 in file ibv_channel_manager.c
>            send desc error
>            [73] Abort: [] Got completion with error 12, vendor code=0,
>         dest rank=93
> 
>             at line 581 in file ibv_channel_manager.c
> 
> 
>            Just in case:
>            # ulimit -a
>            core file size          (blocks, -c) 0
>            data seg size           (kbytes, -d) unlimited
>            scheduling priority             (-e) 0
>            file size               (blocks, -f) unlimited
>            pending signals                 (-i) 204800
>            max locked memory       (kbytes, -l) unlimited
>            max memory size         (kbytes, -m) unlimited
>            open files                      (-n) 1024
>            pipe size            (512 bytes, -p) 8
>            POSIX message queues     (bytes, -q) 819200
>            real-time priority              (-r) 0
>            stack size              (kbytes, -s) unlimited
>            cpu time               (seconds, -t) unlimited
>            max user processes              (-u) 204800
>            virtual memory          (kbytes, -v) unlimited
>            file locks                      (-x) unlimited
> 
> 
> 
>            JC
> 
> 
>            sreeram potluri wrote:
> 
>                Hi,
> 
>                Please try these parameters and values:
> 
>                MV2_VBUF_TOTAL_SIZE=2048
>                MV2_NUM_RDMA_BUFFER=4
>                MV2_IBA_EAGER_THRESHOLD=2044
>                 Note for any performance degradation and please let us know.
> 
>                Thank you
>                Sreeram Potluri
> 
>                - Show quoted text -
> 
> 
>                On Thu, Jan 7, 2010 at 12:55 AM, Rajeev Thakur
>                <thakur at mcs.anl.gov <mailto:thakur at mcs.anl.gov>
>         <mailto:thakur at mcs.anl.gov <mailto:thakur at mcs.anl.gov>>
>                <mailto:thakur at mcs.anl.gov <mailto:thakur at mcs.anl.gov>
>         <mailto:thakur at mcs.anl.gov <mailto:thakur at mcs.anl.gov>>>> wrote:
> 
>                   I am forwarding your note to the mvapich-discuss
>         mailing list.
> 
>                   Rajeev
> 
>                   -----Original Message-----
>                   From: mpich-discuss-bounces at mcs.anl.gov
>         <mailto:mpich-discuss-bounces at mcs.anl.gov>
>                <mailto:mpich-discuss-bounces at mcs.anl.gov
>         <mailto:mpich-discuss-bounces at mcs.anl.gov>>
>                   <mailto:mpich-discuss-bounces at mcs.anl.gov
>         <mailto:mpich-discuss-bounces at mcs.anl.gov>
>                <mailto:mpich-discuss-bounces at mcs.anl.gov
>         <mailto:mpich-discuss-bounces at mcs.anl.gov>>>
>                   [mailto:mpich-discuss-bounces at mcs.anl.gov
>         <mailto:mpich-discuss-bounces at mcs.anl.gov>
>                <mailto:mpich-discuss-bounces at mcs.anl.gov
>         <mailto:mpich-discuss-bounces at mcs.anl.gov>>
>                   <mailto:mpich-discuss-bounces at mcs.anl.gov
>         <mailto:mpich-discuss-bounces at mcs.anl.gov>
>                <mailto:mpich-discuss-bounces at mcs.anl.gov
>         <mailto:mpich-discuss-bounces at mcs.anl.gov>>>] On Behalf Of
>                   Jean-Christophe Ducom
>                   Sent: Wednesday, January 06, 2010 3:09 PM
>                   To: mpich-discuss at mcs.anl.gov
>         <mailto:mpich-discuss at mcs.anl.gov>
>                <mailto:mpich-discuss at mcs.anl.gov
>         <mailto:mpich-discuss at mcs.anl.gov>>
>                <mailto:mpich-discuss at mcs.anl.gov
>         <mailto:mpich-discuss at mcs.anl.gov>
>                <mailto:mpich-discuss at mcs.anl.gov
>         <mailto:mpich-discuss at mcs.anl.gov>>>
>                   Subject: [mpich-discuss] Abort: fail to register rdma
>         memory
> 
>                   All-
>                   The system is a cluster of  Nehalem 8cores (E5520  @
>         2.27GHz)
>                with
>                   24GB of memory and InfiniPath_QLE7240 cards.
>                   The nodes are running RHEL5.4 with mvapich2/1.4
>         compiled with
>                Intel9.0.
> 
>                   When I run a medium size (16nodes/128cores) CFD
>         simulation,
>                the run
>                   stops with the following error message (it runs fine with
>                   64cores) [...] [49] Abort: fail to register rdma
>         memory, size
>                32768
>                    at line 105 in file ibv_priv.c
>                   [51] Abort: fail to register rdma memory, size 32768
>                    at line 105 in file ibv_priv.c
>                   [47] Abort: fail to register rdma memory, size 32768
>                    at line 105 in file ibv_priv.c
>                   [50] Abort: fail to register rdma memory, size 32768
>                    at line 105 in file ibv_priv.c
>                   send desc error
>                   [58] Abort: send desc error
>                   [60] Abort: [] Got completion with error 12, vendor
>         code=0,
>                dest rank=52
>                    at line 581 in file ibv_channel_manager.c send desc
>         error [62]
>                   Abort: [] Got completion with error 12, vendor code=0,
>         dest
>                rank=52
>                    at line 581 in file ibv_channel_manager.c send desc
>         error [118]
>                   Abort: [] Got completion with error 12, vendor code=0,
>         dest
>                   rank=52
>                    at line 581 in file ibv_channel_manager.c send desc
>         error [65]
>                   Abort: [] Got completion with error 12, vendor code=0,
>         dest
>                rank=52
>                    at line 581 in file ibv_channel_manager.c send desc
>         error [116]
>                   Abort: [] Got completion with error 12, vendor code=0,
>         dest
>                   rank=52
>                    at line 581 in file ibv_channel_manager.c send desc
>         error [64]
>                   Abort: [] Got completion with error 12, vendor code=0,
>         dest
>                rank=52
>                    at line 581 in file ibv_channel_manager.c send desc
>         error [76]
>                   Abort: [] Got completion with error 12, vendor code=0,
>         dest
>                rank=20
>                    at line 581 in file ibv_channel_manager.c [] Got
>         completion with
>                   error 12, vendor code=0, dest rank=52
>                    at line 581 in file ibv_channel_manager.c send desc
>         error [63]
>                   Abort: [] Got completion with error 12, vendor code=0,
>         dest
>                rank=52
>                    at line 581 in file ibv_channel_manager.c send desc
>         error [84]
>                   Abort: [] Got completion with error 12, vendor code=0,
>         dest
>                rank=65
>                    at line 581 in file ibv_channel_manager.c send desc
>         error [85]
>                   Abort: [] Got completion with error 12, vendor code=0,
>         dest
>                rank=20
>                    at line 581 in file ibv_channel_manager.c [...]
> 
>                   Looking at the ibv_priv.c:
>                   mem_handle[i] =  register_memory(vbuf_rdma_buf,
>                                                    rdma_vbuf_total_size *
>                   num_rdma_buffer, i); I believe I need to change the
>         runtime
>                parameters
>                   MV2 VBUF TOTAL SIZE (and then MV2_IBA_EAGER_THRESHOLD)
>                   MV2 NUM RDMA BUFFER
>                   MV2 RDMA VBUF POOL SIZE
> 
>                   Could anyone confirm it and suggest values for them?
>                   Thank you
>                   JC
>                   _______________________________________________
>                   mpich-discuss mailing list
>                   mpich-discuss at mcs.anl.gov
>         <mailto:mpich-discuss at mcs.anl.gov>
>         <mailto:mpich-discuss at mcs.anl.gov
>         <mailto:mpich-discuss at mcs.anl.gov>>
>                <mailto:mpich-discuss at mcs.anl.gov
>         <mailto:mpich-discuss at mcs.anl.gov>
>                <mailto:mpich-discuss at mcs.anl.gov
>         <mailto:mpich-discuss at mcs.anl.gov>>>
> 
>                   https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
> 
>                   _______________________________________________
>                   mvapich-discuss mailing list
>                   mvapich-discuss at cse.ohio-state.edu
>         <mailto:mvapich-discuss at cse.ohio-state.edu>
>                <mailto:mvapich-discuss at cse.ohio-state.edu
>         <mailto:mvapich-discuss at cse.ohio-state.edu>>
>                   <mailto:mvapich-discuss at cse.ohio-state.edu
>         <mailto:mvapich-discuss at cse.ohio-state.edu>
>                <mailto:mvapich-discuss at cse.ohio-state.edu
>         <mailto:mvapich-discuss at cse.ohio-state.edu>>>
> 
>                  
>         http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> 
> 
> 
> 
> 
> 



More information about the mvapich-discuss mailing list