[mvapich-discuss] OSU bw test hangs

Shan-ho Tsai shtsai at uga.edu
Thu Jan 26 15:46:23 EST 2012


Hello,
I compiled and installed mvapich2 1.7 using gcc 4.1.2,
gcc 4.4.4 and PGI 11.8 on a 64-bit Linux RHEL5.7 node.
Our Infiniband is from Qlogic, and we use the default Open 
Fabrics software distributed with RHEL5.7. 

The steps used to build it were, e.g.

./configure --prefix=/usr/local/mvapich2/1.7/gcc444 --with-rdma=gen2 
--enable-f77 --enable-fc --enable-cxx --enable-shared -
-enable-sharedlibs=gcc CC=gcc44 F77=gfortran44 FC=gfortran44 
CXX=g++44

(or with the compilers replaced by pgcc, pgCC, pgf77 and pgf90, etc)

make
make install

In each case there were no errors in the build. 

The osu_benchmark tests (osu_latency and osu_bw) work
fine within a node. But when I use 2 nodes, osu_latency
works fine, but osu_bw just hangs after printing

# OSU MPI Bandwidth Test v3.4
# Size        Bandwidth (MB/s)

The command used was

/usr/local/mvapich2/1.7-r5140/gcc444-gen2/bin/mpiexec -n 2 -f host /usr/local/mvapich2/1.7-r5140/gcc444-gen2/libexec/osu-micro-benchmarks/osu_bw 

where 'host' has two lines with the node names

nodeA
nodeB

Running the above with strace stops at

read(6, "# OSU MPI Bandwidth Test v3.4\n# "..., 61) = 61
write(1, "# OSU MPI Bandwidth Test v3.4\n# "..., 61# OSU MPI Bandwidth Test v3.4# Size        Bandwidth (MB/s)
) = 61
poll([{fd=3, events=POLLIN}, {fd=5, events=POLLIN}, {fd=8, events=POLLIN}, {fd=10, events=POLLIN}, {fd=11, events=POLLIN}, {fd=13, events=POLLIN}, {fd=6, events=POLLIN}, {fd=0, events=POLLIN}, {fd=7, events=POLLIN}], 9, -1

And 'top' on the nodes shows osu_bw using cpu time,
but the test just hangs there.

I also tried to build without the --with-rdma=gen2 option
in config, but the same problem with the osu_bw test
occurs. It also occurs on an older cluster (64-bit RHEL4
Linux, with OFED 1.4). The problem also occurs with
mvapich2 1.7-r5140 (downloaded on 1/25/12).

Interestingly, mvapich2 1.6 built as above, appears to
work fine (osu_bw gave reasonable results) on this 
cluster.

Any ideas what I might be doing wrong in the installation
and testing? Any suggestions how I can troubleshoot this?
I'll appreciate any help.

Thank you very much!
Shan-Ho

----------------------------------------------------
Shan-Ho Tsai
University of Georgia, Athens GA




More information about the mvapich-discuss mailing list