[mvapich-discuss] Errors spawning processes with mpirun_rsh

Rafael Arco Arredondo rafaarco at ugr.es
Mon Mar 16 10:13:16 EDT 2009


Hi Jaidev,

Sorry for the delay. I had some other business to deal with :). I just
found out the problem goes away when MVAPICH/MVAPICH2 are compiled with
GCC instead of PathScale. I also tried compiling with PathScale and -O2
instead of the default -O3, but it also crashes.

Here is the backtrace for MVAPICH (PathScale and -O3):
#0  0x00002b6b1da52094 in _int_free (av=0x0, mem=0x2b6b1d8c8e40) at
ptmalloc2/malloc.c:4346
#1  0x00002b6b1da509d7 in free (mem=0x50f950) at ptmalloc2/malloc.c:3473
#2  0x00002b6b1e05f942 in fclose@@GLIBC_2.2.5 () from /lib64/libc.so.6
#3  0x00002b6b1e0ced7e in __res_vinit () from /lib64/libc.so.6
#4  0x00002b6b1e0d0325 in __res_maybe_init () from /lib64/libc.so.6
#5  0x00002b6b1e0d1ace in __nss_hostname_digits_dots ()
from /lib64/libc.so.6
#6  0x00002b6b1e0d6530 in gethostbyname () from /lib64/libc.so.6
#7  0x00002b6b1da622a9 in pmgr_open ()
at /tmp/mvapich-1.1/mpid/ch_gen2/process/pmgr_collective_client.c:859
#8  0x0000000049be4cf0 in ?? ()
#9  0x00000000000a2f1d in ?? ()
#10 0x0000000000000002 in ?? ()
#11 0x00002b6b1da92bd0 in ?? ()
from /usr/local/apps/mpi/mvapich-1.1_psc_dbg/lib/shared/libmpich.so.1.0
#12 0x0000000000000002 in ?? ()
#13 0x0000000000000002 in ?? ()
#14 0x1999999999999999 in ?? ()
#15 0x00007fff8d1fac70 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt
stack?)

And here the one for MVAPICH2:
#0  0x00002ac12a18610e in _int_free (av=0x2ac12a3e6050, mem=0x501010)

at /tmp/mvapich2-1.2p1/src/mpid/ch3/channels/mrail/src/memory/ptmalloc2/mvapich_malloc.c:4387
#1  0x00002ac12a1842de in free (mem=0x501010)
at /tmp/mvapich2-1.2p1/src/mpid/ch3/channels/mrail/src/memory/ptmalloc2/mvapich_malloc.c:3476
#2  0x00002ac12a111814 in DLOOP_Dataloop_create_basic_all_bytes_struct
(count=2, blklens=0x7fff80b90690, disps=0x7fff80b90680,
oldtypes=0x7fff80b90670, 
    dlp_p=0x2ac12a3d4ce0, dlsz_p=0x2ac12a3d4ce8,
dldepth_p=0x2ac12a3d4cec, flag=0)

at /tmp/mvapich2-1.2p1/src/mpid/common/datatype/dataloop/dataloop_create_struct.c:527
#3  0x00002ac12a11094e in MPID_Dataloop_create_struct (count=2,
blklens=0x7fff80b90690, disps=0x7fff80b90680, oldtypes=0x7fff80b90670, 
    dlp_p=0x2ac12a3d4ce0, dlsz_p=0x2ac12a3d4ce8,
dldepth_p=0x2ac12a3d4cec, flag=0)

at /tmp/mvapich2-1.2p1/src/mpid/common/datatype/dataloop/dataloop_create_struct.c:225
#4  0x00002ac12a11039e in MPID_Dataloop_create_pairtype
(type=-1946157056, dlp_p=0x2ac12a3d4ce0, dlsz_p=0x2ac12a3d4ce8,
dldepth_p=0x2ac12a3d4cec, flag=0)

at /tmp/mvapich2-1.2p1/src/mpid/common/datatype/dataloop/dataloop_create_pairtype.c:74
#5  0x00002ac12a176b8f in MPID_Type_create_pairtype (type=-1946157056,
new_dtp=0x2ac12a3d4c70)

at /tmp/mvapich2-1.2p1/src/mpid/common/datatype/mpid_type_create_pairtype.c:177
#6  0x00002ac12a216c3a in MPIR_Datatype_init ()
at /tmp/mvapich2-1.2p1/src/mpi/datatype/typeutil.c:133
#7  0x00002ac12a156cdb in MPIR_Init_thread (argc=0x7fff80b908c0,
argv=0x7fff80b908c8, required=0, provided=0x0)
    at /tmp/mvapich2-1.2p1/src/mpi/init/initthread.c:287
#8  0x00002ac12a156216 in PMPI_Init (argc=0x7fff80b908c0,
argv=0x7fff80b908c8) at /tmp/mvapich2-1.2p1/src/mpi/init/init.c:135
#9  0x00000000004007be in main (argc=1, argv=0x7fff80b909b8)
at /SCRATCH/rafaarco/mpi/mpihello.c:10

Anyway, the version compiled with GCC seems the work fine.

Thanks again for your help and best regards,

Rafa

El lun, 23-02-2009 a las 12:25 -0500, Jaidev Sridhar escribió:
> Hi Rafael,
> 
> On Mon, 2009-02-23 at 18:08 +0100, Rafael Arco Arredondo wrote:
> > Hi Jaidev,
> > 
> > Thank you for your prompt reply.
> > 
> > > The message indicates that the application terminated  with a non zero 
> > > error code or crashed after launching. Can you check if it leaves any 
> > > core files? You may need to set ulimit to  unlimited. For example, add 
> > > ulimit -c unlimited in your ~/.bashrc.
> > 
> > Yes, a core file is generated after adding 'ulimit -c unlimited' to
> > $HOME/.bashrc.
> 
> Can you send us the backtrace from this core file -
> 	$ gdb ./mpihello core.xyz
> 	(gdb) bt
> 
> If you have core files from both mvapich and mvapich2 runs, we'd like to
> see them. This will provide more insights.
> 
> It'll be more useful if you can compile the libraries and your
> application with debug symbols:
>   * For mvapich2, configure the libraries with --enable-g=dbg and
>     compile your application with mpicc -g
>   * For mvapich, edit make.mvapich.gen2, add -g to CFLAGS and compile
>     your application with mpicc -g
> 
> -Jaidev
> 
> > 
> > > Can you also give us details of the cluster and any options you've 
> > > enabled with MVAPICH / MVAPICH2?
> > 
> > It is a cluster of servers with AMD64 Opteron processors, an Infiniband
> > network and Sun Grid Engine 6.2 as batch scheduler (anyway this error is
> > reported both when SGE controls the jobs and when it doesn't, when
> > mpirun_rsh is directly executed from the command line).
> > 
> > In order to compile MVAPICH, the PathScale compiler was used (for which
> > the make.mvapich.gen2 script was accordingly edited), shared library
> > support was enabled and the flag -DXRC was removed. The rest of the
> > options, including the configuration files in $MVAPICH_HOME/etc, wasn't
> > modified (i.e., default values are used).
> > 
> > As for MVAPICH2, it was compiled by invoking the configure script this
> > way:
> > 
> > ./configure --enable-sharedlibs=gcc CC=pathcc F77=pathf90 F90=pathf90
> > CXX=pathCC
> > 
> > And then plain 'make' and 'make install'. Again, the other options
> > weren't changed.
> > 
> > MVAPICH and MVAPICH2 compile with no problems, so do programs compiled
> > with mpicc. However, programs crash on the initialization stage after
> > launching as you said.
> > 
> > Any ideas?
> > 
> > Thanks again,
> > 
> > Rafa
> > 
> > > On 02/23/2009 04:45 AM, Rafael Arco Arredondo wrote:
> > > > Hello,
> > > > 
> > > > I'm having some issues with mpirun_rsh within both MVAPICH 1.1 and
> > > > MVAPICH2 1.2p1. As I commented in another email to the list some time
> > > > ago, mpirun_rsh is the only mechanism we can use to create MPI processes
> > > > in our configuration.
> > > > 
> > > > The command issued is:
> > > > mpirun_rsh -ssh -np 2 -hostfile ./machines ./mpihello
> > > > 
> > > > And the error reported by mpirun_rsh is:
> > > > 
> > > > Exit code -5 signaled from localhost
> > > > MPI process terminated unexpectedly
> > > > Killing remote processes...DONE
> > > > 
> > > > We also got this on some of our machines:
> > > > 
> > > > Child exited abnormally!
> > > > Killing remote processes...DONE
> > > > 
> > > > mpihello is a simple hello world and this happens even when the
> > > > processes are launched on localhost only.
> > > > 
> > > > OFED 1.2 is used as the underlying Infiniband libraries, and both
> > > > MVAPICH and MVAPICH2 were compiled with the OpenFabrics/Gen2 single-rail
> > > > option, without XRC as indicated in the user's guide for OFED libraries
> > > > prior to version 1.3.
> > > > 
> > > > Any help will be kindly appreciated.
> > > > 
> > > > Thank you in advance,
> > > > 
> > > > Rafa
> > > >
> > > > 
> > 
> > 




More information about the mvapich-discuss mailing list