[mvapich-discuss] Eager Sending Problem Bugreport

Sayantan Sur surs at cse.ohio-state.edu
Sun Apr 23 12:03:28 EDT 2006


Hello Balint,

Thanks for the detailed report! The test you have attached (eager.c)
works on our cluster. However, we do understand why the behavior is such
on your system. We are working on a fix, in the meantime, could you
replace -DMEMORY_SCALE in your make.mvapich.gen2 to
-DADAPTIVE_RDMA_FAST_PATH? You shouldn't see any performance impact,
since this will also use the eager path.

Please let us know if this works for you for the time being. We will get
back to you as soon as possible with the fix.

Thanks,
Sayantan. 

* On Apr,1 Balint Joo<bjoo at jlab.org> wrote :
> 
> Dear MVAPICH User Community and Maintainers,
> 	Recently we have encountered a problem in MVAPICH version 0.9.7
> downloaded about 2 weeks ago. We report on this difficulty below. Apologies
> for it turning into such a long message, but we wanted to give as full
> and complete a characterisation as possible.
> 	
> 	While we have a workaround for our problem, we would prefer it
> if it were solved. Hopefully we have provided enough information below
> to that the developers/maintainers may suggest a fix.
> 
> 	I would be happy to interact with folks to help fix this problem
> if I can.
> 
> 	With best wishes and thanks,
> 		Balint
> 
> 
> Bugreport:
> ==========
> 
> Our setup is compiled for a single rail IB Gen2. We are using the MVAPICH 
> on a cluster of
> Dual Core Pentiums (Dell Machines) with SDR PCI Express HCAs that have no
> on card memory (if I understand correctly). Our customised make.mvapich.gen2
> script is attached.
> 
> 
> Our problem:
> --------------------------
> 
> We have a routine where we need to read in data from disk and scatter it 
> amongst
> our nodes. Our code does this by initiating multiple point to point sends,
> Essentially a small chunk of data is read from the disk and dispatched
> to its receiving node. This way we distribute a lattice worth of data.
> Each communicated chunk corresponds to 1 lattice site. Our lattice
> typically has dimensions 20x20x20x64 resulting in a large number of
> very short sends. We find that, depending on the number of nodes used
> and the overall lattice size, that we can get segmentation faults from
> this procedure.  (It is debatable whether or not this method of scattering
> is wise, nonetheless, our code calls a library routine to do its
> I/O which has this behaviour, so rewriting the code for better
> scatters is not the most immediate option).
> 
> Why we believe the problem may be in mvapich:
> ------------------------------------------------
> 
> Since we operate a complex code of many layers it took us some time
> to convince ourselves that we believe the problem to be in MVAPICH.
> We believe this to be the case because the same source code did not
> produce errors when compiled over other architectures including an
> older version of mvapich that shipped with the Mellanox IBGD-1.8.0 
> distribution
> Further, we found that on a single node, with mvapich-0.9.7 running
> the job as two separate local processes with SMP communications only
> resulted in no segmentation faults. In other words we hit the bug
> only when running with mvapich-0.9.7 in a mode where there were
> inter-box (actual infiniband) communications.
> 
> Workaround
> -----------
> 
> We find that this problem can be fixed by:
> 	i) commending out the eager sending code in mpid_send.c
> 	ii) by throttling down the eager send threshhold so that
> 	no actual eager sends are done, by setting the environment
> 	variable VIADEV_RENDEZVOUS_THRESHOLD to a size smaller
> 	than our expected smaller packet.
> 
> However, we worry that disabling eager sends may have a
> big performance on globals sums and other collective
> opterations which do use very short messages.
> 
> 
> More details of our bug-track:
> ------------------------------
> 
> We have compiled the mvapich distribution with debugging flags
> enabled (in CFLAGS) and could trap the segmentation fault in a debugger.
> The debugger output from node 0 of our job is below:
> 
> ------------------Begin debug output ---------------------------------
> Ooops ibv_reg_mr failed
> Unable to get memory
> [New Thread -1242162256 (LWP 18214)]
> 
> Program received signal SIGSEGV, Segmentation fault.
> [Switching to Thread -1208658240 (LWP 18211)]
> 0x08473cb3 in vbuf_init_send (v=0x8afb7c0, len=236) at vbuf.c:396
> 396         v->desc.sg_entry.lkey = v->region->mem_handle->lkey;
> Current language:  auto; currently c
> (gdb) print v->region->mem_handle
> $1 = (struct ibv_mr *) 0x0
> (gdb) up
> #1  0x0847ffe3 in MPID_VIA_eager_send (buf=0x85a3fa0, len=192, src_lrank=0,
>      tag=11, context_id=4, dest_grank=1, s=0x8594e28) at viasend.c:520
> 520         vbuf_init_send(v, (sizeof(viadev_packet_eager_start) +
> (gdb) up
> #2  0x084742a3 in MPID_IsendContig (comm_ptr=0x85a7448, buf=0x85a3fa0,
>      len=192, src_lrank=0, tag=11, context_id=4, dest_grank=1,
>      msgrep=MPID_MSGREP_RECEIVER, request=0x8594e28, error_code=0xbfa9fc60)
>      at mpid_send.c:148
> 148                 rc = MPID_VIA_eager_send(buf, len, src_lrank,
> (gdb)
> ------------------------End Debug output --------------------------------
> 
> Our segmentation fault arises from trying to dereference the null pointer
> v->region->mem_handle
> 
> The two previous message:
> Ooops ibv_reg_mr failed
> 
> comes from viapriv.c where I have inserted debugging code into the
> register_memory() routine:
> 
> struct ibv_mr* register_memory(void* buf, int len)
> {
>      struct ibv_mr* ret_val = (struct ibv_mr*)NULL;
>      ret_val = ibv_reg_mr(viadev.ptag, buf, len,
>              IBV_ACCESS_LOCAL_WRITE | IBV_ACCESS_REMOTE_WRITE |
>              IBV_ACCESS_REMOTE_READ);
>      if( ret_val == NULL ) {
>          printf("Ooops ibv_reg_mr failed\n");
>      }
> 
>      return ret_val;
> #if 0
>      /* Original code */
>      return (ibv_reg_mr(viadev.ptag, buf, len,
>              IBV_ACCESS_LOCAL_WRITE | IBV_ACCESS_REMOTE_WRITE |
>              IBV_ACCESS_REMOTE_READ));
> #endif
> }
> 
> This particular problem was only triggered by the case where we tried
> to scatter several very short pieces of data and we have written
> a piece of code to try and recreate the segmentation fault directly
> within MPI (rather than going through layers and layers of our own 
> software).
> I attach this code to the email (file eager.c).
> 
> On our system the attached code reproduces the segmentation fault when the
> number of short messages becomes large (eg: 512000). We know that
> the problem is with sending eager packets from the debug information
> above. Also if we set the VIADEV_EAGER_THRESHOLD to be less than
> or equal to the message size (in this case <= 288) then only
> rendezvous sends are done and the segmentation fault does not
> occur. However, setting the threshold to 289 (just one byte larger
> than our message) trips the segmentation fault on our system.
> 
> This is currently all our diagnostic information. We suspect that
> when sending so many eager packets we may run out of memory
> somewhere. We don't know exactly where. We were wondering whether
> it is because our cards don't have any memory on them. Also
> we find that for PCI Express it seems always OK to send eager
> packets (from viasend.c:)
> 
> int viadev_eager_ok(int len, viadev_connection_t * c)
> {
> #ifdef SRQ
>      return 1; /* Always OK to send with SRQ */
> #else
>      int nvbufs;
>      nvbufs = viadev_calculate_vbufs_expected(len, VIADEV_PROTOCOL_EAGER);
> 
>      if (c->remote_credit - nvbufs >= viadev_credit_preserve) {
>          return 1;
>      }
> 
>      return 0;
> #endif
> }
> 
> In our case SRQ should be defined since we are using the MEMORY_SCALE 
> option.
> I attach our customised make.mvapich.gen2 file to show exactly our build
> customisations.
> 
> ------------ End bugreport 
> --------------------------------------------------
> 
> -- 
> -------------------------------------------------------------------
> Dr Balint Joo              High Performance Computational Scientist
> Jefferson Lab
> 12000 Jefferson Ave, Mail Stop 12B2, Room F217,
> Newport News, VA 23606, USA
> Tel: +1-757-269-5339,      Fax: +1-757-269-5427
> email: bjoo at jlab.org       (old email: bj at ph.ed.ac.uk)
> -------------------------------------------------------------------

> #include <mpi.h>
> #include <stdio.h>
> 
> /* A large local lattice */
> #define NUM_SITES (512000)
> /* 4 SU3 matrices in single precision 4*3*3*2*4 */
> #define SITE_SIZE  (288)
> 
> typedef struct { 
>   char site[SITE_SIZE];
> } Site_t;
> 
> int main(int argc, char *argv[])
> {
>   int rc;
>   /* Site to simulate what  'I just read from disk" */
>   Site_t sendbuf;
> 
>   /* My local lattice */
>   Site_t *recvbuf = NULL;
>   int size=0;
>   int rank=0;
>   int rank_neighbour=0;
>   int loop_counter=0;
> 
>   /* Datatype handle */
>   MPI_Datatype site_type;
> 
>   /* Sent request */
>   /* Recv request */
>   MPI_Request send_req;
>   MPI_Request recv_req;
>   MPI_Status status;
> 
>   rc = MPI_Init(&argc, &argv);
>   if( rc != MPI_SUCCESS ) { 
>     fprintf(stderr, "MPI_Init  failed\n");
>     MPI_Abort(MPI_COMM_WORLD, 1);
>   }
> 
>   rc = MPI_Comm_size(MPI_COMM_WORLD, &size);
>   if( rc != MPI_SUCCESS ) { 
>     fprintf(stderr, "MPI_Comm_size failed\n");
>     MPI_Abort(MPI_COMM_WORLD, 1);
>   }
> 
>   rc = MPI_Comm_rank(MPI_COMM_WORLD, &rank);
>   if( rc != MPI_SUCCESS ) { 
>     fprintf(stderr, "MPI_Comm_rank failed\n");
>     MPI_Abort(MPI_COMM_WORLD, 1);
>   }
> 
>   recvbuf = (Site_t *)malloc(NUM_SITES*sizeof(Site_t));
>   if( recvbuf == (Site_t*) NULL) {
>     fprintf(stderr,"Couldnt get local lattice for sending\n");
>     MPI_Abort(MPI_COMM_WORLD, 1);
>   }
> 
>   if (size != 2) {    
>     fprintf(stderr, "This must be run only on 2 nodes\n");
>     MPI_Abort(MPI_COMM_WORLD, 1);
>   }
> 
>   /* Work out the rank of the neighbour */
>   rank_neighbour = rank == 0 ? 1 : 0;
>   printf("My Rank: %d\t \tMy Rank Neighbour's rank: %d\n", rank, rank_neighbour);
> 
> 
>    rc = MPI_Type_contiguous(sizeof(Site_t), MPI_BYTE, &site_type);
>     if( rc != MPI_SUCCESS ) { 
>       fprintf(stderr, "MPI_Type Contiguous failed\n");
>       MPI_Abort(MPI_COMM_WORLD, 1);
>     }
> 
>   /* Now I want a loop simulating QMP sends - site by site to my neigbour */
>   for(loop_counter=0; loop_counter < NUM_SITES; loop_counter++) {
> 
>     if( rank == 0 ) { 
>     /* Node 0 sets up send to Node 1 */
>       rc = MPI_Send_init((void*) &sendbuf, 
> 			 1, 
> 			 site_type, 
> 			 1, 
> 			 loop_counter, 
> 			 MPI_COMM_WORLD, 
> 			 &send_req);
>     }
>     else { 
>       rc = MPI_Recv_init((void *) &(recvbuf[loop_counter]), 
> 			 1, 
> 			 site_type,
> 			 0, 
> 			 loop_counter, 
> 			 MPI_COMM_WORLD, 
> 			 &recv_req);
>     }
>     if( rc != MPI_SUCCESS ) { 
>       fprintf(stderr, "MPI comm req init failed: %d\n", rank);
>       MPI_Abort(MPI_COMM_WORLD, 1);
>     }
>     
>     if( rank == 1 ) { 
>       rc = MPI_Start(&recv_req);
>     }
>     else {
>       rc = MPI_Start(&send_req);
>     }
>     if( rc != MPI_SUCCESS ) { 
>       fprintf(stderr, "MPI_Start failed: %d\n",rank);
>       MPI_Abort(MPI_COMM_WORLD, 1);
>     }
> 
>     if( rank == 1 ) { 
>            rc = MPI_Wait(&recv_req,&status);
>     }
>     else {
>       rc = MPI_Wait(&send_req,&status);
>     }
> 
>     if( rc != MPI_SUCCESS ) { 
>       fprintf(stderr, "MPI_Start failed: %d\n",rank);
>       MPI_Abort(MPI_COMM_WORLD, 1);
>     }
>     
>  
> 
>   }
> 
> 
>   MPI_Finalize();
> 
> 
> }
> 
> 

> #!/bin/bash
> 
> source ./make.mvapich.def
> arch
> 
> # Mandatory variables.  All are checked except CXX and F90.
> IBHOME=/usr/local/ibg2
> IBHOME_LIB=/usr/local/ibg2/lib
> PREFIX=/home/bjoo/qdp++/install/mvapich-0.9.7
> export CC=gcc
> export CXX=g++
> export F77=g77
> export F90=
> 
> if [ $ARCH = "SOLARIS" ]; then
>     die_setup "MVAPICH GEN2 is not supported on Solaris."
> elif [ $ARCH = "MAC_OSX" ]; then
>     die_setup "MVAPICH GEN2 is not supported on MacOS."
> fi
> 
> # Check mandatory variable settings.
> if [ -z $IBHOME ] || [ -z $PREFIX ] || [ -z $CC ] || [ -z $F77 ]; then
>     die_setup "Please set mandatory variables in this script."
> elif [ ! -d $IBHOME ]; then
>     die_setup "IBHOME directory $IBHOME does not exist."
> fi
> 
> # Optional variables.  Most of these are prompted for if not set.
> #
> 
> # I/O Bus type.
> # Supported: "_PCI_X_" and "_PCI_EX_"
> 
> if [ $ARCH = "_PPC64_" ]; then
> IO_BUS="_PCI_X_"
> else
> IO_BUS=
> fi
> 
> if [ -z "$IO_BUS" ]; then
>     prompt_io_bus
> fi
> 
> # Link speed rate.
> # Supported: "_DDR_" and "_SDR_" (PCI-Express)
> #            "_SDR_" (PCI-X)
> if [ $ARCH = "_PPC64_" ]; then
> 		LINKS="_SDR_"
> else
> 		LINKS=
> fi
> 
> 
> if [ -z "$LINKS" ]; then
>     prompt_link
> fi
> 
> # Whether to use an optimized queue pair exchange scheme.  This is not
> # checked for a setting in in the script.  It must be set here explicitly.
> # Supported: "-DUSE_MPD_RING", "-DUSE_MPD_BASIC" and "" (to disable)
> HAVE_MPD_RING=""
> 
> # Set this to override automatic optimization setting (-03).
> OPT_FLAG=
> 
> if [ -z $OPT_FLAG ]; then
>     OPT_FLAG="-g "
> fi
> 
> export LIBS="-L${IBHOME_LIB} -libverbs -lpthread"
> export FFLAGS="-L${IBHOME_LIB}"
> export CFLAGS="-D${ARCH} -DMEMORY_SCALE -DMEMORY_RELIABLE\
>                -DVIADEV_RPUT_SUPPORT -DCH_GEN2 -D_SMP_ -D_SMP_RNDV_ \
>                $SUPPRESS -D${IO_BUS} -D${LINKS} \
>                ${HAVE_MPD_RING} -I${IBHOME}/include $OPT_FLAG"
> 
> # Prelogue
> make distclean &>/dev/null
> 
> # Configure MVAPICH
> 
> echo "Configuring MVAPICH..."
> 
> ./configure --with-device=ch_gen2 --with-arch=LINUX -prefix=${PREFIX} \
> 	--without-romio --without-mpe \
>    	-lib="-L${IBHOME_LIB} -Wl,-rpath=${IBHOME_LIB} -libverbs -lpthread" \
>     	2>&1 |tee config-mine.log
> ret=$?
> test $ret = 0 ||  die "configuration."
> 
> # Build MVAPICH
> echo "Building MVAPICH..."
> make 2>&1 |tee make-mine.log 
> ret=$?
> test $ret = 0 ||  die "building MVAPICH."
> 
> # Install MVAPICH
> echo "MVAPICH installation..."
> rm -f install-mine.log 
> make install 2>&1 |tee install-mine.log
> ret=$?
> test $ret = 0 ||  die "installing MVAPICH."
> 

> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss


-- 
http://www.cse.ohio-state.edu/~surs


More information about the mvapich-discuss mailing list