[mvapich-discuss] Crashes with mpirun_rsh in MVAPICH2 1.5

Thompson, Matthew A. (GSFC-610.1)[SCIENCE APPLICATIONS INTL CORP] matthew.thompson at nasa.gov
Tue Aug 10 10:24:55 EDT 2010


In order to better compare a test box I build on with a larger cluster
that's using MVAPICH2 and mpirun_rsh, I recently downloaded version 1.5
and tried to build it on the test node. This is a single, dual Nehalem
node with no specialized networking hardware running RHEL 5.5.

The compiler I'm using is PGI 10.6 and, following assistance from PGI, I
configured and built the MVAPICH tarball with no errors using:

./configure --with-device=ch3:sock --prefix=$HOME/mvapich2 CC=pgcc
FC=pgfortran F77=pgfortran CXX=pgcpp

The issue arises when I try and run something as simple as hellow.c from
the examples directory using mpirun_rsh. First, let us see if MPICH is
working:

$ ~/mvapich2/bin/mpicc -o hellow hellow.c
$ ~/mvapich2/bin/mpdboot
$ ~/mvapich2/bin/mpirun -np 2 ./hellow
Hello world from process 0 of 2
Hello world from process 1 of 2
$ ~/mvapich2/bin/mpdallexit

Okay, that seems to be working. But the errors abound when I try and use
mpirun_rsh. First, if I try to just run it with mpirun_rsh (where
host_file_name contains just the hostname of the computer):

$ ~/mvapich2/bin/mpirun_rsh -np 2 -hostfile host_file_name ./hellow
*** glibc detected *** ./hellow: double free or corruption (fasttop):
0x000000001655e090 ***
*** glibc detected *** ./hellow: double free or corruption (fasttop):
0x000000000c772090 ***
======= Backtrace: =========
/lib64/libc.so.6[0x3984a7230f]
/lib64/libc.so.6(cfree+0x4b)[0x3984a7276b]
./hellow[0x429ddd]
======= Memory map: ========
<snip>
(I can provide the full map and corefiles if wanted.)

Looking around the internet, this seems to be "fixable" with
MALLOC_CHECK_, though I'm concerned I'm seeing it. Still, I try that and
I get:

$ env MALLOC_CHECK_=0 ~/mvapich2/bin/mpirun_rsh -np 2 -hostfile
host_file_name ./hellow
Fatal error in MPI_Init: Invalid buffer pointer, error stack:
MPIR_Init_thread(411): Initialization failed
(unknown)(): Invalid buffer pointer
Fatal error in MPI_Init: Invalid buffer pointer, error stack:
MPIR_Init_thread(411): Initialization failed
(unknown)(): Invalid buffer pointer
[cli_1]: aborting job:
Fatal error in MPI_Init: Invalid buffer pointer, error stack:
MPIR_Init_thread(411): Initialization failed
(unknown)(): Invalid buffer pointer
[cli_0]: aborting job:
Fatal error in MPI_Init: Invalid buffer pointer, error stack:
MPIR_Init_thread(411): Initialization failed
(unknown)(): Invalid buffer pointer
MPI process (rank: 1) terminated unexpectedly on janus
Exit code -5 signaled from janus

As said above, I have tried to figure this out with PGI, and the forum
post for this is located here:

http://www.pgroup.com/userforum/viewtopic.php?t=2068

I even tried the "unlimiting memorylocked" as well, but there is no joy.
But the very confusing thing is that PGI built MVAPICH2 on a RHEL 5.5
machine and got *no errors*:

http://www.pgroup.com/userforum/viewtopic.php?p=7845#7845

Have you ever seen this behavior with mpirun_rsh?

Thank you for any help,
Matt Thompson
-- 
Matthew Thompson, SAIC, Sr Scientific Software Engr
NASA GSFC,  Global Modeling and Assimilation Office
Code 610.1, 8800 Greenbelt Rd, Greenbelt, MD  20771
Phone: 301-614-6712               Fax: 301-614-6246



More information about the mvapich-discuss mailing list