[mvapich-discuss] unrecognized protocol for send/recv over 8KB (fwd)

Brian Budge brian.budge at gmail.com
Tue Jan 8 13:41:27 EST 2008


Hmmm, this is a PCI-Express setup.  Are there some variables I should be
tweaking?

Thanks,
  Brai

On Jan 8, 2008 10:36 AM, Matthew Koop <koop at cse.ohio-state.edu> wrote:

> Brian,
>
> Good to hear that the microbenchmarks are working now. Whether the numbers
> you have are good or not is dependant on the platform. Is this a PCI-X or
> PCI-Express card? You can expect 900 MB/sec for SDR PCI-Express.
>
> Matt
>
> On Tue, 8 Jan 2008, Brian Budge wrote:
>
> > Hi Matt -
> >
> > ibv_rc_pingpong worked, and I decided to try a new clean install, and it
> > seems to be working quite a bit better now.  I must have somehow added
> some
> > nasty stuff to the Makefile during my previous attempts.
> >
> > Here is the output:
> >
> > # OSU MPI Bandwidth Test v3.0
> > # Size        Bandwidth (MB/s)
> > 1                         1.18
> > 2                         2.59
> > 4                         4.92
> > 8                        10.38
> > 16                       20.31
> > 32                       40.12
> > 64                       77.14
> > 128                     144.37
> > 256                     241.72
> > 512                     362.12
> > 1024                    471.01
> > 2048                    546.45
> > 4096                    581.47
> > 8192                    600.65
> > 16384                   611.52
> > 32768                   632.87
> > 65536                   642.27
> > 131072                  646.30
> > 262144                  644.22
> > 524288                  644.15
> > 1048576                 649.36
> > 2097152                 662.55
> > 4194304                 672.55
> >
> > How do these numbers look for a 10 Gb SDR HCA?
> >
> > Thanks for your help!
> >   Brian
> >
> > On Jan 7, 2008 5:12 PM, Matthew Koop <koop at cse.ohio-state.edu> wrote:
> >
> > > Brian,
> > >
> > > Can you try the ibv_rc_pingpong program, which is a low-level
> (non-MPI)
> > > test that ships with OFED? This will make sure that your basic
> InfiniBand
> > > setup is working properly.
> > >
> > > Did any other error message print out other than the one you gave?
> > >
> > > Matt
> > >
> > > On Mon, 7 Jan 2008, Brian Budge wrote:
> > >
> > > > Hi Matt -
> > > >
> > > > I have now done the install from the ofa build file, and I can boot
> and
> > > run
> > > > the ring test, but now when I run the osu_bw.c benchmark, the
> executable
> > >
> > > > dies in MPI_Init().
> > > >
> > > > The things I altered in make.mvapich2.ofa were:
> > > >
> > > > OPEN_IB_HOME=${OPEN_IB_HOME:-/usr}
> > > > SHARED_LIBS=${SHARED_LIBS:-yes}
> > > >
> > > > and on the configure line I added:
> > > >  --disable-f77 --disable-f90
> > > >
> > > > Here is the error message that I am getting:
> > > >
> > > > rank 1 in job 1  burn_60139   caused collective abort of all ranks
> > > >   exit status of rank 1: killed by signal 9
> > > >
> > > > Thanks,
> > > >   Brian
> > > >
> > > > On Jan 7, 2008 1:21 PM, Matthew Koop <koop at cse.ohio-state.edu>
> wrote:
> > > >
> > > > > Brian,
> > > > >
> > > > > The make.mvapich.detect script is just a helper script (not meant
> to
> > > be
> > > > > executed directly). You need to use the make.mvapich.ofa script,
> which
> > > > > will call configure and make for you with the correct arguments.
> > > > >
> > > > > More information can be found in our MVAPICH2 user guide under
> > > > > "4.4.1 Build MVAPICH2 with OpenFabrics Gen2-IB and iWARP"
> > > > >
> > > > > https://mvapich.cse.ohio-state.edu/support/
> > > > >
> > > > > Let us know if you have any other problems.
> > > > >
> > > > > Matt
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > On Mon, 7 Jan 2008, Brian Budge wrote:
> > > > >
> > > > > > Hi Wei -
> > > > > >
> > > > > > I changed from SMALL_CLUSTER to MEDIUM_CLUSTER, but it made no
> > > > > difference.
> > > > > >
> > > > > > When I build with rdma, this adds the following:
> > > > > >         export LIBS="${LIBS} -lrdmacm"
> > > > > >         export CFLAGS="${CFLAGS} -DADAPTIVE_RDMA_FAST_PATH
> > > -DRDMA_CM"
> > > > > >
> > > > > > It seems that I am using the make.mvapich2.detect script to
> build.
> > >  It
> > > > > asks
> > > > > > me for my interface, and gives me the option for the mellanox
> > > interface,
> > > > > > which I choose.
> > > > > >
> > > > > > I just tried a fresh install directly from the tarball instead
> of
> > > using
> > > > > the
> > > > > > gentoo package.  Now the program completes (goes beyond 8K
> message),
> > > but
> > > > > my
> > > > > > bandwidth isn't very good.  Running the osu_bw.c test, I get
> about
> > > 250
> > > > > MB/s
> > > > > > maximum.  It seems like IB isn't being used.
> > > > > >
> > > > > > I did the following:
> > > > > > ./make.mvapich2.detect #, and chose the mellanox option
> > > > > > ./configure --enable-threads=multiple
> > > > > > make
> > > > > > make install
> > > > > >
> > > > > > So it seems that the package is doing something to enable
> infiniband
> > > > > that I
> > > > > > am not doing with the tarball.  Conversely, the tarball can run
> > > without
> > > > > > crashing.
> > > > > >
> > > > > > Advice?
> > > > > >
> > > > > > Thanks,
> > > > > >   Brian
> > > > > >
> > > > > > On Jan 6, 2008 6:38 AM, wei huang < huanwei at cse.ohio-state.edu>
> > > wrote:
> > > > > >
> > > > > > > Hi Brian,
> > > > > > >
> > > > > > > > I am using the openib-mvapich2-1.0.1 package in the
> > > gentoo-science
> > > > > > > overlay
> > > > > > > > addition to the standard gentoo packages.  I have also tried
> > > 1.0with
> > > > > > > the
> > > > > > > > same results.
> > > > > > > >
> > > > > > > > I compiled with multithreading turned on (haven't tried
> without
> > > > > this,
> > > > > > > but
> > > > > > > > the sample codes I am initially testing are not
> multithreaded,
> > > > > although
> > > > > > > my
> > > > > > > > application is).  I also tried with or without rdma with no
> > > change.
> > > > >  The
> > > > > > >
> > > > > > > > script seems to be setting the build for SMALL_CLUSTER.
> > > > > > >
> > > > > > > So you are using make.mvapich2.ofa to compile the package? I
> am a
> > > bit
> > > > > > > confused about ''I also tried with or without rdma with no
> > > change''.
> > > > > What
> > > > > > > exact change you made here? Also, SMALL_CLUSTER is obsolete
> for
> > > ofa
> > > > > > > stack...
> > > > > > >
> > > > > > > -- Wei
> > > > > > >
> > > > > > > >
> > > > > > > > Let me know what other information would be useful.
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > >   Brian
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > On Jan 4, 2008 6:12 PM, wei huang <
> huanwei at cse.ohio-state.edu>
> > > > > wrote:
> > > > > > > >
> > > > > > > > > Hi Brian,
> > > > > > > > >
> > > > > > > > > Thanks for letting us know this problem. Would you please
> let
> > > us
> > > > > know
> > > > > > > some
> > > > > > > > > more details to help us locate the issue.
> > > > > > > > >
> > > > > > > > > 1) More details on your platform.
> > > > > > > > >
> > > > > > > > > 2) Exact version of mvapich2 you are using. Is it from
> OFED
> > > > > package?
> > > > > > > or
> > > > > > > > > some version from our website.
> > > > > > > > >
> > > > > > > > > 3) If it is from our website, did you change anything from
> the
> > > > > default
> > > > > > >
> > > > > > > > > compiling scripts?
> > > > > > > > >
> > > > > > > > > Thanks.
> > > > > > > > >
> > > > > > > > > -- Wei
> > > > > > > > > > I'm new to the list here... hi!  I have been using
> OpenMPI
> > > for a
> > > > > > > while,
> > > > > > > > > and
> > > > > > > > > > LAM before that, but new requirements keep pushing me to
> new
> > > > > > > > > > implementations.  In particular, I was interested in
> using
> > > > > > > infiniband
> > > > > > > > > (using
> > > > > > > > > > OFED 1.2.5.1) in a multi-threaded environment.  It seems
> > > that
> > > > > > > MVAPICH is
> > > > > > > > > the
> > > > > > > > > > library for that particular combination :)
> > > > > > > > > >
> > > > > > > > > > In any case, I installed MVAPICH, and I can boot the
> > > daemons,
> > > > > and
> > > > > > > run
> > > > > > > > > the
> > > > > > > > > > ring speed test with no problems.  When I run any
> programs
> > > with
> > > > > > > mpirun,
> > > > > > > > > > however, I get an error when sending or receiving more
> than
> > > 8192
> > > > > > > bytes.
> > > > > > > > > >
> > > > > > > > > > For example, if I run the bandwidth test from the
> benchmarks
> > > > > page
> > > > > > > > > > (osu_bw.c), I get the following:
> > > > > > > > > >
> > > ---------------------------------------------------------------
> > > > > > > > > > budge at burn:~/tests/testMvapich2> mpirun -np 2 ./a.out
> > > > > > > > > > Thursday 06:16:00
> > > > > > > > > > burn
> > > > > > > > > > burn-3
> > > > > > > > > > # OSU MPI Bandwidth Test v3.0
> > > > > > > > > > # Size        Bandwidth (MB/s)
> > > > > > > > > > 1                         1.24
> > > > > > > > > > 2                         2.72
> > > > > > > > > > 4                         5.44
> > > > > > > > > > 8                         10.18
> > > > > > > > > > 16                       19.09
> > > > > > > > > > 32                       29.69
> > > > > > > > > > 64                       65.01
> > > > > > > > > > 128                     147.31
> > > > > > > > > > 256                     244.61
> > > > > > > > > > 512                     354.32
> > > > > > > > > > 1024                    367.91
> > > > > > > > > > 2048                     451.96
> > > > > > > > > > 4096                     550.66
> > > > > > > > > > 8192                    598.35
> > > > > > > > > > [1][ch3_rndvtransfer.c:112] Unknown protocol 0 type from
> > > rndv
> > > > > req to
> > > > > > > > > send
> > > > > > > > > > Internal Error: invalid error code ffffffff (Ring Index
> out
> > > of
> > > > > > > range) in
> > > > > > > > > > MPIDI_CH3_RndvSend:263
> > > > > > > > > > Fatal error in MPI_Waitall:
> > > > > > > > > > Other MPI error, error stack:
> > > > > > > > > > MPI_Waitall(242): MPI_Waitall(count=64,
> req_array=0xdb21a0,
> > > > > > > > > > status_array=0xdb3140) failed
> > > > > > > > > > (unknown)(): Other MPI error
> > > > > > > > > > rank 1 in job 4  burn_37156   caused collective abort of
> all
> > >
> > > > > ranks
> > > > > > > > > >   exit status of rank 1: killed by signal 9
> > > > > > > > > >
> > > ---------------------------------------------------------------
> > > > > > > > > >
> > > > > > > > > > I get a similar problem with the latency test, however,
> the
> > > > > protocol
> > > > > > > > > that is
> > > > > > > > > > complained about is different:
> > > > > > > > > >
> > > > >
> --------------------------------------------------------------------
> > > > > > >
> > > > > > > > > > budge at burn:~/tests/testMvapich2> mpirun -np 2 ./a.out
> > > > > > > > > > Thursday 09:21:20
> > > > > > > > > > # OSU MPI Latency Test v3.0
> > > > > > > > > > # Size            Latency (us)
> > > > > > > > > > 0                         3.93
> > > > > > > > > > 1                         4.07
> > > > > > > > > > 2                         4.06
> > > > > > > > > > 4                         3.82
> > > > > > > > > > 8                         3.98
> > > > > > > > > > 16                         4.03
> > > > > > > > > > 32                        4.00
> > > > > > > > > > 64                        4.28
> > > > > > > > > > 128                       5.22
> > > > > > > > > > 256                       5.88
> > > > > > > > > > 512                       8.65
> > > > > > > > > > 1024                      9.11
> > > > > > > > > > 2048                     11.53
> > > > > > > > > > 4096                     16.17
> > > > > > > > > > 8192                     25.67
> > > > > > > > > > [1][ch3_rndvtransfer.c:112] Unknown protocol 8126589
> type
> > > from
> > > > > rndv
> > > > > > > req
> > > > > > > > > to
> > > > > > > > > > send
> > > > > > > > > > Internal Error: invalid error code ffffffff (Ring Index
> out
> > > of
> > > > > > > range) in
> > > > > > > > > > MPIDI_CH3_RndvSend:263
> > > > > > > > > > Fatal error in MPI_Recv:
> > > > > > > > > > Other MPI error, error stack:
> > > > > > > > > > MPI_Recv(186): MPI_Recv(buf=0xa8ff80, count=16384,
> MPI_CHAR,
> > > > > src=0,
> > > > > > > > > tag=1,
> > > > > > > > > > MPI_COMM_WORLD, status=0x7fff14c7bde0) failed
> > > > > > > > > > (unknown)(): Other MPI error
> > > > > > > > > > rank 1 in job 5  burn_37156   caused collective abort of
> all
> > > > > ranks
> > > > > > > > > >
> > > > >
> --------------------------------------------------------------------
> > > > > > > > > >
> > > > > > > > > > The protocols (0 and 8126589) are consistent if I run
> the
> > > > > program
> > > > > > > > > multiple
> > > > > > > > > > times.
> > > > > > > > > >
> > > > > > > > > > Anyone have any ideas?  If you need more info, please
> let me
> > >
> > > > > know.
> > > > > > > > > >
> > > > > > > > > > Thanks,
> > > > > > > > > >   Brian
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > > >
> > > >
> > >
> > >
> >
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20080108/f181b9c1/attachment-0001.html


More information about the mvapich-discuss mailing list