[mvapich-discuss] mpiexec.hydra hangs

Mon Jul 8 18:10:34 EDT 2013

Hi Jonathan,

thanks for your reply.

which mpiexec.hydra
/usr/local/bin/mpiexec.hydra

which mpirun_rsh
/usr/local/bin/mpirun_rsh

ldd /usr/local/bin/mpiexec.hydra
        linux-vdso.so.1 =>  (0x00007fff93d2e000)
        libnsl.so.1 => /lib64/libnsl.so.1 (0x0000003a2ce00000)
        libpthread.so.0 => /lib64/libpthread.so.0 (0x0000003a1cc00000)
        libm.so.6 => /lib64/libm.so.6 (0x0000003a1d000000)
        libc.so.6 => /lib64/libc.so.6 (0x0000003a1c800000)
        /lib64/ld-linux-x86-64.so.2 (0x0000003a1c000000)

ldd /usr/local/bin/mpirun_rsh
        linux-vdso.so.1 =>  (0x00007fff6179d000)
        libcudart.so.5.5 => /usr/local/cuda/lib64/libcudart.so.5.5 (0x00007f415c7b2000)
        libcuda.so.1 => /usr/lib64/libcuda.so.1 (0x00007f415bb34000)
        libibmad.so.5 => /usr/lib64/libibmad.so.5 (0x0000003a1d000000)
        libibumad.so.3 => /usr/lib64/libibumad.so.3 (0x0000003a1d800000)
        libibverbs.so.1 => /usr/lib64/libibverbs.so.1 (0x0000003a1d400000)
        libdl.so.2 => /lib64/libdl.so.2 (0x0000003a1c400000)
        librt.so.1 => /lib64/librt.so.1 (0x00007f415b92b000)
        libm.so.6 => /lib64/libm.so.6 (0x00007f415b6a6000)
        libpthread.so.0 => /lib64/libpthread.so.0 (0x0000003a1cc00000)
        libc.so.6 => /lib64/libc.so.6 (0x0000003a1c800000)
        libstdc++.so.6 => /usr/lib64/libstdc++.so.6 (0x0000003a23400000)
        libz.so.1 => /lib64/libz.so.1 (0x00007f415b48f000)
        /lib64/ld-linux-x86-64.so.2 (0x0000003a1c000000)
        libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x0000003a20c00000)

On the working system, mpiexec.hydra gives the same output, but mpirun_rsh doesn't seem to have libstdc++.so.6 and libgcc_s.so.1.. Also, if I do ldd mpirun_rsh as root, I don't get stdc++ and gcc_s either...

The other thing that I noticed is, if I tell my program not to create a context on the GPUs, mpiexec.hydra still hangs, but doesn't produce all those messages if I cancel it with Ctrl+C. I am not setting MV2_USE_CUDA, so it shouldn't matter?

And the last thing you may need to know is, for some reason unknown to me, root on the compute node can't modify files generated by Active Directory users.. so I copied the tarball from the shared location (owned by active directory user) to root's home on the compute node, where I did configure; make; make install as root. It hangs for both root and AD user though.. on the node where everything is OK, I first tried to configure and make as AD user, then failed to make install because I can't modify /usr/... as AD user. I also failed to make install as root, because I couldn't modify some files in the make dir. So I ended up copying everything to root's home on the compute node and doing everything from there.. would that be a problem, you think?

I never seem to have quite understood how mpirun_rsh works.. it gives me "incorrect number of arguments" when I do mpirun_rsh -n 7 <prog + args>.

Igor Podladtchikov
Spectraseis
1899 Wynkoop St, Suite 350
Denver, CO 80202
Tel. +1 303 658 9172 (direct)
Tel. +1 303 330 8296 (cell)
www.spectraseis.com

________________________________________
From: Jonathan Perkins
Sent: Monday, July 08, 2013 3:42 PM
To: Igor Podladtchikov
Cc: mvapich-discuss at cse.ohio-state.edu
Subject: Re: [mvapich-discuss] mpiexec.hydra hangs

Hello Igor.  I believe that this could be an installation issue.  To
help narrow down the cause can you provide the output of the following
commands:

    which mpiexec.hydra
    which mpirun_rsh
    ldd /path/to/mpi/program

You can also try running the program using mpirun_rsh instead of
mpiexec.hydra.

Regarding the cuda path stuff, I've grown accustomed to just using
CPPFLAGS and LDFLAGS directly for more complicated situations.

On Mon, Jul 08, 2013 at 08:03:18PM +0000, Igor Podladtchikov wrote:
> Hello,
>
> I installed mvapich2-1.9 (05/06/13) on one of our compute nodes, and mpiexec.hydra hangs sometime after MPI_Init. If I Ctrl+C I get this:
>
> ^C[mpiexec at glowfish1.spectraseis.biz] Sending Ctrl-C to processes as requested
> [mpiexec at glowfish1.spectraseis.biz] Press Ctrl-C again to force abort
> ^CCtrl-C caught... cleaning up processes
> [proxy:0:0 at glowfish1.spectraseis.biz] HYD_pmcd_pmip_control_cmd_cb (./pm/pmiserv/pmip_cb.c:913): assert (!closed) failed
> [proxy:0:0 at glowfish1.spectraseis.biz] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status
> [proxy:0:0 at glowfish1.spectraseis.biz] main (./pm/pmiserv/pmip.c:206): demux engine error waiting for event
> ^C[mpiexec at glowfish1.spectraseis.biz] HYDT_bscu_wait_for_completion (./tools/bootstrap/utils/bscu_wait.c:76): one of the processes terminated badly; aborting
> [mpiexec at glowfish1.spectraseis.biz] HYDT_bsci_wait_for_completion (./tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting for completion
> [mpiexec at glowfish1.spectraseis.biz] ui_cmd_cb (./pm/pmiserv/pmiserv_pmci.c:71): launcher returned error waiting for completion
> [mpiexec at glowfish1.spectraseis.biz] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status
> [mpiexec at glowfish1.spectraseis.biz] HYD_pmci_wait_for_completion (./pm/pmiserv/pmiserv_pmci.c:197): error waiting for event
> [mpiexec at glowfish1.spectraseis.biz] main (./ui/mpich/mpiexec.c:331): process manager error waiting for completion
>
> If I run the exact same command using intel mpi, the program runs through without problems.
>
> The weird thing is, I installed the same tarball on another node before this one, and there, everything works. The nodes are "identical" in that they both received the same centos "yum update", nvidia driver 319.32 and " yum groupinstall "Infiniband Support" ". But they are not perfectly identical, I don't know why exactly.. Anyway, I wonder if we can figure out why one works but not the other?
>
> On BOTH nodes:
>
> uname -r
> 2.6.32-358.11.1.el6.x86_64
>
> 7 GPUs (Nvidia Tesla M2070) with driver 319.32
>
> Please let me know if you need any other system info.
>
> I compared config.log on both systems, and the one that's hanging had "opa_primitives.h: No such file or directory", which the working one didn't..
> Also, I could "locate opa_primitives.h" on the working system, but not on the hanging system.
> The other difference was that the working system had
>   "/usr/local/include/primitives/opa_gcc_intel_32_64_ops.h:38: note: expected 'struct OPA_int_t *' but argument is of type 'OPA_int_t'"
> in it's config.log, while the hanging one did not.
> I noticed that the hanging one also didn't have /usr/local/include/primitives/opa_gcc_intel_32_64_ops.h.
> I did yum provides "*opa_gcc_intel_32_64_ops.h" and found "mpich2-devel-1.2.1-2.3.el6.x86_64".
> Even though the working system didn't have mpich2-devel installed, I installed it in the hanging system.
> I now had the opa_gcc_intel...ops.h file, but still the same differences in the config.log.
> I figured out that running configure; make; make install TWICE makes that error go away for the second configure, so I must have done that on the working system...
> I reproduced the gcc error from config.log: opa_primitives.h: No such file or directory
> If I do configure; make; make install, the error goes away.
> After all this, mpiexec.hydra still hangs, so I'm writing you in the hope that you can shine some light on the matter...
> I'm using intel mpi but I really want to use mvapich, keep hearing so many good things about it but haven't managed to install it so far, unfortunately.
>
> Cheers
>
> P.S. I also had some problems with --enable-cuda.. Cuda 5.5 puts it's include and files into /usr/local/cuda-5.5, and makes a softlink /usr/local/cuda (which I believe always has been the standard location). The 319 driver installs itself in /usr/lib64/nvidia. If I simply do --enable-cuda, it doesn't find anything. If I do --with-cuda=/usr/local/cuda, it doesn't find -lcuda. If I do --with-cuda-libpath=/usr/lib64/nvidia, it doesn't find -lcudart. I guess the problem is that the runtime doesn't have the driver, and the driver doesn't have the runtime.. and I wasn't able to give configure several paths separated by colon. Anyway, what I ended up doing is making a softlink to /usr/lib64/nvidia/libcuda.so in /usr/local/cuda/lib64, surely not how things were meant to be. This stuff is kind of annoying and I guess it's new with cuda 5.5 where you can get it over "yum install cuda". yum will do /usr/local/cuda -> /usr/local/cuda-5.5 and /usr/lib64/nvidia/libcuda.so, and the .run scripts will do /usr/lib64/libcuda.so... bless their hearts.. I don't know how much you guys care about this stuff but some coordination on standard location of cuda with nvidia would be nice.. I lost some time on figuring this out.
>
> Igor Podladtchikov
> Spectraseis
> 1899 Wynkoop St, Suite 350
> Denver, CO 80202
> Tel. +1 303 658 9172 (direct)
> Tel. +1 303 330 8296 (cell)
> www.spectraseis.com<../../owa/redir.aspx?C=T7tMpFAL7UCexpluR4PqqoCIwR-6UM4IpFLjCgDsXkTxuE35EpUjnCXhCOI4gUxQJB127PTepuc.&URL=http%3a%2f%2fwww.spectraseis.com%2f>

> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss

--
Jonathan Perkins
http://www.cse.ohio-state.edu/~perkinjo