[mvapich-discuss] SIGSEGV in MPIDI_CH3I_MRAIL_Parse_header

Johannes Ziegenbalg johannes.ziegenbalg at zih.tu-dresden.de
Wed Jun 24 16:58:46 EDT 2015


Hello.

I'm a student at Dresden University, working on my diploma thesis. I
have a MPI client-server app (MPI_Comm_connect/accept) and the server
receives a SIGSEGV in the function MPIDI_CH3I_MRAIL_Parse_header, when
the client connects. 
After a little research I found out that it occurs in the macro
SET_CREDIT in the file mpid/ch3/channels/mrail/src/gen2/ibv_recv.c:382 
And a little more research later it has emerged that the array
vc->mrail.srp.credits is still unallocated.
One later "version" of a MPIDI_CH3_VC, on an other Thread, gets a
correct initialization (with MRAILI_Init_vc), but it's not used (see
attachment).

The debug output and the back-traces from GDB are attached.

My configure call is: ./configure --enable-threads=multiple --enable
-fortran=all --with-pm=slurm --with-pmi=pmi1 --with-rdma=gen2 --enable
-hybrid --with-device=ch3:mrail --enable-g=dbg,log,most -q --enable
-debuginfo --disable-fast

Thanks, in advance, for your help!

Regards,
Johannes Ziegenbalg
-------------- next part --------------
==============================================================================
Results from GDB 
==============================================================================

Thread 1/4
#0  MPIDI_CH3_VC_Init (vc=0x613350) at ../src/mpid/ch3/channels/mrail/src/rdma/ch3_init.c:529
#1  0x00007ffff7611a88 in MPIDI_VC_Init (vc=0x613350, pg=0x6131e0, rank=0) at ../src/mpid/ch3/src/mpid_vc.c:863
#2  0x00007ffff7617dd1 in MPIDI_PG_Create (vct_sz=1, pg_id=0x60db40, pg_ptr=0x7fffffffd150) at ../src/mpid/ch3/src/mpidi_pg.c:217
#3  0x00007ffff76031ee in init_pg (argc=0x7fffffffd2fc, argv=0x7fffffffd2f0, has_args=0x7fffffffd27c, has_env=0x7fffffffd278, has_parent=0x7fffffffd208, pg_rank_p=0x7fffffffd1fc, pg_p=0x7fffffffd200) at ../src/mpid/ch3/src/mpid_init.c:724
#4  0x00007ffff760251e in MPID_Init (argc=0x7fffffffd2fc, argv=0x7fffffffd2f0, requested=0, provided=0x7fffffffd274, has_args=0x7fffffffd27c, has_env=0x7fffffffd278) at ../src/mpid/ch3/src/mpid_init.c:311
#5  0x00007ffff74dbf5f in MPIR_Init_thread (argc=0x7fffffffd2fc, argv=0x7fffffffd2f0, required=0, provided=0x7fffffffd2c0) at ../src/mpi/init/initthread.c:512
#6  0x00007ffff74da9f7 in PMPI_Init (argc=0x7fffffffd2fc, argv=0x7fffffffd2f0) at ../src/mpi/init/init.c:195
#7  0x00000000004027c3 in daemon_initalize(int, char**, daemon_data_struct*) ()
#8  0x00000000004029b7 in main ()


Where the client connects to the server.


Thread 3/4
#0  MPIDI_CH3_VC_Init (vc=0x8b9800) at ../src/mpid/ch3/channels/mrail/src/rdma/ch3_init.c:529
#1  0x00007ffff7611a88 in MPIDI_VC_Init (vc=0x8b9800, pg=0x0, rank=0) at ../src/mpid/ch3/src/mpid_vc.c:863
#2  0x00007ffff76a556f in cm_handle_msg (msg=0x62d3a0) at ../src/mpid/ch3/channels/common/src/cm/cm.c:1689
#3  0x00007ffff76a63f0 in cm_completion_handler (arg=0x0) at ../src/mpid/ch3/channels/common/src/cm/cm.c:1972
#4  0x000000399d8079d1 in start_thread () from /lib64/libpthread.so.0
#5  0x000000399cce88fd in clone () from /lib64/libc.so.6

#0  MRAILI_Init_vc (vc=0x8b9800) at ../src/mpid/ch3/channels/mrail/src/gen2/rdma_iba_priv.c:1455
#1  0x00007ffff76a33ae in cm_accept_nopg (vc=0x8b9800, msg=0x62d3a0) at ../src/mpid/ch3/channels/common/src/cm/cm.c:1228
#2  0x00007ffff76a55f3 in cm_handle_msg (msg=0x62d3a0) at ../src/mpid/ch3/channels/common/src/cm/cm.c:1695
#3  0x00007ffff76a63f0 in cm_completion_handler (arg=0x0) at ../src/mpid/ch3/channels/common/src/cm/cm.c:1972
#4  0x000000399d8079d1 in start_thread () from /lib64/libpthread.so.0
#5  0x000000399cce88fd in clone () from /lib64/libc.so.6

#0  MPIDI_CH3_VC_Init (vc=0x7403e0) at ../src/mpid/ch3/channels/mrail/src/rdma/ch3_init.c:529
#1  0x00007ffff7611a88 in MPIDI_VC_Init (vc=0x7403e0, pg=0x0, rank=0) at ../src/mpid/ch3/src/mpid_vc.c:863
#2  0x00007ffff76a556f in cm_handle_msg (msg=0x62d718) at ../src/mpid/ch3/channels/common/src/cm/cm.c:1689
#3  0x00007ffff76a63f0 in cm_completion_handler (arg=0x0) at ../src/mpid/ch3/channels/common/src/cm/cm.c:1972
#4  0x000000399d8079d1 in start_thread () from /lib64/libpthread.so.0
#5  0x000000399cce88fd in clone () from /lib64/libc.so.6


Where the server gets the SIGSEGV.

Thread 1/4
#0  0x00007ffff76512e7 in MPIDI_CH3I_MRAIL_Parse_header (vc=0x613350, v=0x9fb040, pkt=0x7fffffffcf30, header_size=0x7fffffffcf3c) at ../src/mpid/ch3/channels/mrail/src/gen2/ibv_recv.c:382
#1  0x00007ffff7626ac8 in handle_read_individual (vc=0x613350, buffer=0x9fb040, header_type=0x7fffffffcf94) at ../src/mpid/ch3/channels/mrail/src/rdma/ch3_progress.c:1190
#2  0x00007ffff7626997 in handle_read (vc=0x613350, buffer=0x9fb040) at ../src/mpid/ch3/channels/mrail/src/rdma/ch3_progress.c:1132
#3  0x00007ffff76246c2 in MPIDI_CH3I_Progress (is_blocking=1, state=0x7fffffffd080) at ../src/mpid/ch3/channels/mrail/src/rdma/ch3_progress.c:290
#4  0x00007ffff75c74af in MPIDI_Create_inter_root_communicator_accept (port_name=0x6070c4 "tag#0$description#\"#RANK:00000000(00000024:00200055:00000001:00000000)#\"$", comm_pptr=0x7fffffffd158, vc_pptr=0x7fffffffd150)
    at ../src/mpid/ch3/src/ch3u_port.c:211
#5  0x00007ffff75c9664 in MPIDI_Comm_accept (port_name=0x6070c4 "tag#0$description#\"#RANK:00000000(00000024:00200055:00000001:00000000)#\"$", info=0x0, root=0, comm_ptr=0x7ffff7a948a0, newcomm=0x7fffffffd2c0)
    at ../src/mpid/ch3/src/ch3u_port.c:985
#6  0x00007ffff7609be7 in MPID_Comm_accept (port_name=0x6070c4 "tag#0$description#\"#RANK:00000000(00000024:00200055:00000001:00000000)#\"$", info=0x0, root=0, comm=0x7ffff7a948a0, newcomm_ptr=0x7fffffffd2c0)
    at ../src/mpid/ch3/src/mpid_port.c:150
#7  0x00007ffff752f592 in MPIR_Comm_accept_impl (port_name=0x6070c4 "tag#0$description#\"#RANK:00000000(00000024:00200055:00000001:00000000)#\"$", info_ptr=0x0, root=0, comm_ptr=0x7ffff7a948a0, newcomm_ptr=0x7fffffffd2c0)
    at ../src/mpi/spawn/comm_accept.c:36
#8  0x00007ffff752f91d in PMPI_Comm_accept (port_name=0x6070c4 "tag#0$description#\"#RANK:00000000(00000024:00200055:00000001:00000000)#\"$", info=469762048, root=0, comm=1140850688, newcomm=0x7fffffffd308)
    at ../src/mpi/spawn/comm_accept.c:112
#9  0x0000000000402937 in daemon_wait_for_connection(daemon_data_struct*) ()
#10 0x0000000000402a09 in main ()


==============================================================================
Results from Debug-Output (-DDEBUG)
==============================================================================

MV2_DEBUG_CHM_VERBOSE=9 MV2_SUPPORT_DPM=1 PORT_FILE="Server_port.txt" mpi-daemon
[0][../src/mpid/ch3/channels/mrail/src/gen2/rdma_iba_priv.c:258] Active port number = 1, state = Active, lid = 36
[0][../src/mpid/ch3/channels/mrail/src/gen2/rdma_iba_priv.c:258] Active port number = 1, state = Active, lid = 36
[0][../src/mpid/ch3/channels/mrail/src/gen2/vbuf.c:377] Allocating a new vbuf region.
[0][../src/mpid/ch3/channels/mrail/src/gen2/vbuf.c:446] VBUF REGION ALLOCATION SZ 80 TOT 80 FREE 80 NF 0 NG 0
[0][../src/mpid/ch3/channels/mrail/src/gen2/vbuf.c:377] Allocating a new vbuf region.
[0][../src/mpid/ch3/channels/mrail/src/gen2/vbuf.c:446] VBUF REGION ALLOCATION SZ 80 TOT 80 FREE 80 NF 0 NG 0
[0][../src/mpid/ch3/channels/mrail/src/gen2/vbuf.c:377] Allocating a new vbuf region.
[0][../src/mpid/ch3/channels/mrail/src/gen2/vbuf.c:446] VBUF REGION ALLOCATION SZ 80 TOT 80 FREE 80 NF 0 NG 0
[0][../src/mpid/ch3/channels/mrail/src/gen2/vbuf.c:377] Allocating a new vbuf region.
[0][../src/mpid/ch3/channels/mrail/src/gen2/vbuf.c:446] VBUF REGION ALLOCATION SZ 80 TOT 80 FREE 80 NF 0 NG 0
[0][../src/mpid/ch3/channels/mrail/src/gen2/ibv_send.c:618] Posted 80 buffers to SRQ
[0][../src/mpid/ch3/channels/mrail/src/gen2/ibv_priv.c:43] register return mr 0x104b620, buf 0x1029000, len 4096
[0][../src/mpid/ch3/channels/mrail/src/gen2/ibv_priv.c:43] register return mr 0x1160fa0, buf 0x145a000, len 4096
[0][../src/mpid/ch3/channels/mrail/src/gen2/rdma_iba_1sc.c:1293] rank[0] : post flag start before exchange is 0x145b0e8
[0][../src/mpid/ch3/channels/mrail/src/gen2/ibv_priv.c:43] register return mr 0x12e76d0, buf 0x145b000, len 4096
[0][../src/mpid/ch3/channels/mrail/src/gen2/rdma_iba_1sc.c:1313] the rank [0] post_flag rkey before exchange is 3801483b
[0][../src/mpid/ch3/channels/mrail/src/gen2/rdma_iba_1sc.c:1457]  rank is 0 remote rank 0,  post flag addr is 0x145b0e8
================================================================================
== Ready for Connection ========================================================

[0][../src/mpid/ch3/channels/mrail/src/rdma/ch3_progress.c:193] Entering ch3 progress
[0][../src/mpid/ch3/channels/mrail/src/gen2/rdma_iba_priv.c:1555] Cmanager total channel 1, local polling 0
[tauruslogin1:mpi_rank_0][handle_cqe] Received from rank:0 seqnum :0 ack:65535 size:32 type:26 trasport :2 
[tauruslogin1:mpi_rank_0][handle_cqe] [channel manager] get one with exact seqnum
[0][../src/mpid/ch3/channels/mrail/src/rdma/ch3_read_progress.c:98] Get one packet with exact seq num
[0][../src/mpid/ch3/channels/mrail/src/rdma/ch3_progress.c:1120] [handle read] buffer 0x1429040
[0][../src/mpid/ch3/channels/mrail/src/rdma/ch3_progress.c:1188] [handle read] pheader: 0x12e9000
[0][../src/mpid/ch3/channels/mrail/src/gen2/ibv_recv.c:65] [parse header] vbuf address 0x1429040
[0][../src/mpid/ch3/channels/mrail/src/gen2/ibv_recv.c:68] [parse header] header type 26
[0][../src/mpid/ch3/channels/mrail/src/gen2/ibv_recv.c:380] Before set credit, vc: 0x1041ac8, v->rail: 0, pkt: 0x7fffd5fe7090, pheader: 0x12e9000
[tauruslogin1:mpi_rank_0][error_sighandler] Caught error: Segmentation fault (signal 11)
[3]    24115 segmentation fault (core dumped)  MV2_DEBUG_CHM_VERBOSE=9 MV2_SUPPORT_DPM=1 PORT_FILE="Server_port.txt" 



More information about the mvapich-discuss mailing list