[mvapich-discuss] ictest.c hangs with np > 2

Mark Debbage markdebbage at pathscale.com
Thu Jun 22 18:32:33 EDT 2006


I get an MPI hang when using MVAPICH-0.9.7 and OpenIB on PathScale
adapters when I run the following test with more than 2 processes:

  examples/test/context/ictest.c

The MVAPICH configuration comes from make.mvapich.gen2 with no
significant changes using the settings for SDR and PCI-Express. I think
that the underlying problem comes from the MPICH source code base and is
exposed by the extra code in MVAPICH in the function comm_exch_addr. I
believe that this will always happen when either VIADEV_RPUT_SUPPORT or
VIADEV_RGET_SUPPORT is defined (and multi-rail is not enabled).

My question is - do you see a failure on ictest when run with more than
2 processes?

I debugged the hang quite extensively and came to the following
understanding. This looks like an underlying MPICH bug that is exposed
by MVAPICH. The hang occurs at this line in ictest.c:

    MPI_Intercomm_merge ( mySecondComm, 1, &merge4 );

The value of high is true for all the callers. The
inter/intra-communicator mechanisms go awry in MPIR_Intercomm_high
leading to its MPI_Bcast failing silently because it has been given an
inter-comm and that functionality does not exist. I think that the wrong
communicator is being used here, or more likely that the communicator
has the wrong value for collops, and that this is a long-standing coding
error in MPICH.

The failure of this broadcast means that the value of high is not
uniformly distributed across the intra-communicator. Because of this,
later in MPIR_Comm_make_coll one of the 4 processes (the non-leader in
the "lower" intra) has a bogus idea of the rank labeling. I end up with
the following values:

  world rank        high       local rank   lrank_to_grank array
  0                 0          0            [0, 2, 1, 3]
  1                 1          2            [0, 2, 1, 3]
  2                 1          3            [1, 3, 0, 2]
  3                 1          3            [0, 2, 1, 3]

The row for world rank is wrong because it did not receive the broadcast
value for high of 0 from its intra-comm partner with world rank 0 (its
local leader). This leads to the wrong local rank and wrong
lrank_to_grank mapping. I can see this behavior both with MVAPICH and
with regular MPICH.

However, this doesn't matter in regular MPICH as this information isn't
then used (or even tested it would appear). However, in MVAPICH there is
the extra code that calls comm_exch_addr that relies on the labeling
being correct within the inter-comm leading to a hang in MPI_Sendrecv.
Essentially this process is trying to send/recv with the wrong partner.

I'm not convinced that this test case is particularly important, but I'd
like to understand the failure mode. If I am right, you should see the
same failure too.

Thanks for your time,

Mark.



More information about the mvapich-discuss mailing list