[mvapich-discuss] Eager Sending Problem Bugreport

Balint Joo bjoo at jlab.org
Sun Apr 23 11:20:14 EDT 2006


Dear MVAPICH User Community and Maintainers,
	Recently we have encountered a problem in MVAPICH version 0.9.7
downloaded about 2 weeks ago. We report on this difficulty below. Apologies
for it turning into such a long message, but we wanted to give as full
and complete a characterisation as possible.
	
	While we have a workaround for our problem, we would prefer it
if it were solved. Hopefully we have provided enough information below
to that the developers/maintainers may suggest a fix.

	I would be happy to interact with folks to help fix this problem
if I can.

	With best wishes and thanks,
		Balint


Bugreport:
==========

Our setup is compiled for a single rail IB Gen2. We are using the MVAPICH on a cluster of
Dual Core Pentiums (Dell Machines) with SDR PCI Express HCAs that have no
on card memory (if I understand correctly). Our customised make.mvapich.gen2
script is attached.


Our problem:
--------------------------

We have a routine where we need to read in data from disk and scatter it amongst
our nodes. Our code does this by initiating multiple point to point sends,
Essentially a small chunk of data is read from the disk and dispatched
to its receiving node. This way we distribute a lattice worth of data.
Each communicated chunk corresponds to 1 lattice site. Our lattice
typically has dimensions 20x20x20x64 resulting in a large number of
very short sends. We find that, depending on the number of nodes used
and the overall lattice size, that we can get segmentation faults from
this procedure.  (It is debatable whether or not this method of scattering
is wise, nonetheless, our code calls a library routine to do its
I/O which has this behaviour, so rewriting the code for better
scatters is not the most immediate option).

Why we believe the problem may be in mvapich:
------------------------------------------------

Since we operate a complex code of many layers it took us some time
to convince ourselves that we believe the problem to be in MVAPICH.
We believe this to be the case because the same source code did not
produce errors when compiled over other architectures including an
older version of mvapich that shipped with the Mellanox IBGD-1.8.0 distribution
Further, we found that on a single node, with mvapich-0.9.7 running
the job as two separate local processes with SMP communications only
resulted in no segmentation faults. In other words we hit the bug
only when running with mvapich-0.9.7 in a mode where there were
inter-box (actual infiniband) communications.

Workaround
-----------

We find that this problem can be fixed by:
	i) commending out the eager sending code in mpid_send.c
	ii) by throttling down the eager send threshhold so that
	no actual eager sends are done, by setting the environment
	variable VIADEV_RENDEZVOUS_THRESHOLD to a size smaller
	than our expected smaller packet.

However, we worry that disabling eager sends may have a
big performance on globals sums and other collective
opterations which do use very short messages.


More details of our bug-track:
------------------------------

We have compiled the mvapich distribution with debugging flags
enabled (in CFLAGS) and could trap the segmentation fault in a debugger.
The debugger output from node 0 of our job is below:

------------------Begin debug output ---------------------------------
Ooops ibv_reg_mr failed
Unable to get memory
[New Thread -1242162256 (LWP 18214)]

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread -1208658240 (LWP 18211)]
0x08473cb3 in vbuf_init_send (v=0x8afb7c0, len=236) at vbuf.c:396
396         v->desc.sg_entry.lkey = v->region->mem_handle->lkey;
Current language:  auto; currently c
(gdb) print v->region->mem_handle
$1 = (struct ibv_mr *) 0x0
(gdb) up
#1  0x0847ffe3 in MPID_VIA_eager_send (buf=0x85a3fa0, len=192, src_lrank=0,
      tag=11, context_id=4, dest_grank=1, s=0x8594e28) at viasend.c:520
520         vbuf_init_send(v, (sizeof(viadev_packet_eager_start) +
(gdb) up
#2  0x084742a3 in MPID_IsendContig (comm_ptr=0x85a7448, buf=0x85a3fa0,
      len=192, src_lrank=0, tag=11, context_id=4, dest_grank=1,
      msgrep=MPID_MSGREP_RECEIVER, request=0x8594e28, error_code=0xbfa9fc60)
      at mpid_send.c:148
148                 rc = MPID_VIA_eager_send(buf, len, src_lrank,
(gdb)
------------------------End Debug output --------------------------------

Our segmentation fault arises from trying to dereference the null pointer
v->region->mem_handle

The two previous message:
Ooops ibv_reg_mr failed

comes from viapriv.c where I have inserted debugging code into the
register_memory() routine:

struct ibv_mr* register_memory(void* buf, int len)
{
      struct ibv_mr* ret_val = (struct ibv_mr*)NULL;
      ret_val = ibv_reg_mr(viadev.ptag, buf, len,
              IBV_ACCESS_LOCAL_WRITE | IBV_ACCESS_REMOTE_WRITE |
              IBV_ACCESS_REMOTE_READ);
      if( ret_val == NULL ) {
          printf("Ooops ibv_reg_mr failed\n");
      }

      return ret_val;
#if 0
      /* Original code */
      return (ibv_reg_mr(viadev.ptag, buf, len,
              IBV_ACCESS_LOCAL_WRITE | IBV_ACCESS_REMOTE_WRITE |
              IBV_ACCESS_REMOTE_READ));
#endif
}

This particular problem was only triggered by the case where we tried
to scatter several very short pieces of data and we have written
a piece of code to try and recreate the segmentation fault directly
within MPI (rather than going through layers and layers of our own software).
I attach this code to the email (file eager.c).

On our system the attached code reproduces the segmentation fault when the
number of short messages becomes large (eg: 512000). We know that
the problem is with sending eager packets from the debug information
above. Also if we set the VIADEV_EAGER_THRESHOLD to be less than
or equal to the message size (in this case <= 288) then only
rendezvous sends are done and the segmentation fault does not
occur. However, setting the threshold to 289 (just one byte larger
than our message) trips the segmentation fault on our system.

This is currently all our diagnostic information. We suspect that
when sending so many eager packets we may run out of memory
somewhere. We don't know exactly where. We were wondering whether
it is because our cards don't have any memory on them. Also
we find that for PCI Express it seems always OK to send eager
packets (from viasend.c:)

int viadev_eager_ok(int len, viadev_connection_t * c)
{
#ifdef SRQ
      return 1; /* Always OK to send with SRQ */
#else
      int nvbufs;
      nvbufs = viadev_calculate_vbufs_expected(len, VIADEV_PROTOCOL_EAGER);

      if (c->remote_credit - nvbufs >= viadev_credit_preserve) {
          return 1;
      }

      return 0;
#endif
}

In our case SRQ should be defined since we are using the MEMORY_SCALE option.
I attach our customised make.mvapich.gen2 file to show exactly our build
customisations.

------------ End bugreport --------------------------------------------------

-- 
-------------------------------------------------------------------
Dr Balint Joo              High Performance Computational Scientist
Jefferson Lab
12000 Jefferson Ave, Mail Stop 12B2, Room F217,
Newport News, VA 23606, USA
Tel: +1-757-269-5339,      Fax: +1-757-269-5427
email: bjoo at jlab.org       (old email: bj at ph.ed.ac.uk)
-------------------------------------------------------------------
-------------- next part --------------
A non-text attachment was scrubbed...
Name: eager.c
Type: text/x-csrc
Size: 2956 bytes
Desc: not available
Url : http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20060423/5f6e7a83/eager.bin
-------------- next part --------------
#!/bin/bash

source ./make.mvapich.def
arch

# Mandatory variables.  All are checked except CXX and F90.
IBHOME=/usr/local/ibg2
IBHOME_LIB=/usr/local/ibg2/lib
PREFIX=/home/bjoo/qdp++/install/mvapich-0.9.7
export CC=gcc
export CXX=g++
export F77=g77
export F90=

if [ $ARCH = "SOLARIS" ]; then
    die_setup "MVAPICH GEN2 is not supported on Solaris."
elif [ $ARCH = "MAC_OSX" ]; then
    die_setup "MVAPICH GEN2 is not supported on MacOS."
fi

# Check mandatory variable settings.
if [ -z $IBHOME ] || [ -z $PREFIX ] || [ -z $CC ] || [ -z $F77 ]; then
    die_setup "Please set mandatory variables in this script."
elif [ ! -d $IBHOME ]; then
    die_setup "IBHOME directory $IBHOME does not exist."
fi

# Optional variables.  Most of these are prompted for if not set.
#

# I/O Bus type.
# Supported: "_PCI_X_" and "_PCI_EX_"

if [ $ARCH = "_PPC64_" ]; then
IO_BUS="_PCI_X_"
else
IO_BUS=
fi

if [ -z "$IO_BUS" ]; then
    prompt_io_bus
fi

# Link speed rate.
# Supported: "_DDR_" and "_SDR_" (PCI-Express)
#            "_SDR_" (PCI-X)
if [ $ARCH = "_PPC64_" ]; then
		LINKS="_SDR_"
else
		LINKS=
fi


if [ -z "$LINKS" ]; then
    prompt_link
fi

# Whether to use an optimized queue pair exchange scheme.  This is not
# checked for a setting in in the script.  It must be set here explicitly.
# Supported: "-DUSE_MPD_RING", "-DUSE_MPD_BASIC" and "" (to disable)
HAVE_MPD_RING=""

# Set this to override automatic optimization setting (-03).
OPT_FLAG=

if [ -z $OPT_FLAG ]; then
    OPT_FLAG="-g "
fi

export LIBS="-L${IBHOME_LIB} -libverbs -lpthread"
export FFLAGS="-L${IBHOME_LIB}"
export CFLAGS="-D${ARCH} -DMEMORY_SCALE -DMEMORY_RELIABLE\
               -DVIADEV_RPUT_SUPPORT -DCH_GEN2 -D_SMP_ -D_SMP_RNDV_ \
               $SUPPRESS -D${IO_BUS} -D${LINKS} \
               ${HAVE_MPD_RING} -I${IBHOME}/include $OPT_FLAG"

# Prelogue
make distclean &>/dev/null

# Configure MVAPICH

echo "Configuring MVAPICH..."

./configure --with-device=ch_gen2 --with-arch=LINUX -prefix=${PREFIX} \
	--without-romio --without-mpe \
   	-lib="-L${IBHOME_LIB} -Wl,-rpath=${IBHOME_LIB} -libverbs -lpthread" \
    	2>&1 |tee config-mine.log
ret=$?
test $ret = 0 ||  die "configuration."

# Build MVAPICH
echo "Building MVAPICH..."
make 2>&1 |tee make-mine.log 
ret=$?
test $ret = 0 ||  die "building MVAPICH."

# Install MVAPICH
echo "MVAPICH installation..."
rm -f install-mine.log 
make install 2>&1 |tee install-mine.log
ret=$?
test $ret = 0 ||  die "installing MVAPICH."



More information about the mvapich-discuss mailing list