[mvapich-discuss] mpiexec.hydra hangs

Jonathan Perkins perkinjo at cse.ohio-state.edu
Mon Jul 8 17:42:31 EDT 2013


Hello Igor.  I believe that this could be an installation issue.  To
help narrow down the cause can you provide the output of the following
commands:

    which mpiexec.hydra
    which mpirun_rsh
    ldd /path/to/mpi/program

You can also try running the program using mpirun_rsh instead of
mpiexec.hydra.

Regarding the cuda path stuff, I've grown accustomed to just using
CPPFLAGS and LDFLAGS directly for more complicated situations.

On Mon, Jul 08, 2013 at 08:03:18PM +0000, Igor Podladtchikov wrote:
> Hello,
> 
> I installed mvapich2-1.9 (05/06/13) on one of our compute nodes, and mpiexec.hydra hangs sometime after MPI_Init. If I Ctrl+C I get this:
> 
> ^C[mpiexec at glowfish1.spectraseis.biz] Sending Ctrl-C to processes as requested
> [mpiexec at glowfish1.spectraseis.biz] Press Ctrl-C again to force abort
> ^CCtrl-C caught... cleaning up processes
> [proxy:0:0 at glowfish1.spectraseis.biz] HYD_pmcd_pmip_control_cmd_cb (./pm/pmiserv/pmip_cb.c:913): assert (!closed) failed
> [proxy:0:0 at glowfish1.spectraseis.biz] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status
> [proxy:0:0 at glowfish1.spectraseis.biz] main (./pm/pmiserv/pmip.c:206): demux engine error waiting for event
> ^C[mpiexec at glowfish1.spectraseis.biz] HYDT_bscu_wait_for_completion (./tools/bootstrap/utils/bscu_wait.c:76): one of the processes terminated badly; aborting
> [mpiexec at glowfish1.spectraseis.biz] HYDT_bsci_wait_for_completion (./tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting for completion
> [mpiexec at glowfish1.spectraseis.biz] ui_cmd_cb (./pm/pmiserv/pmiserv_pmci.c:71): launcher returned error waiting for completion
> [mpiexec at glowfish1.spectraseis.biz] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status
> [mpiexec at glowfish1.spectraseis.biz] HYD_pmci_wait_for_completion (./pm/pmiserv/pmiserv_pmci.c:197): error waiting for event
> [mpiexec at glowfish1.spectraseis.biz] main (./ui/mpich/mpiexec.c:331): process manager error waiting for completion
> 
> If I run the exact same command using intel mpi, the program runs through without problems.
> 
> The weird thing is, I installed the same tarball on another node before this one, and there, everything works. The nodes are "identical" in that they both received the same centos "yum update", nvidia driver 319.32 and " yum groupinstall "Infiniband Support" ". But they are not perfectly identical, I don't know why exactly.. Anyway, I wonder if we can figure out why one works but not the other?
> 
> On BOTH nodes:
> 
> uname -r
> 2.6.32-358.11.1.el6.x86_64
> 
> 7 GPUs (Nvidia Tesla M2070) with driver 319.32
> 
> Please let me know if you need any other system info.
> 
> I compared config.log on both systems, and the one that's hanging had "opa_primitives.h: No such file or directory", which the working one didn't..
> Also, I could "locate opa_primitives.h" on the working system, but not on the hanging system.
> The other difference was that the working system had
>   "/usr/local/include/primitives/opa_gcc_intel_32_64_ops.h:38: note: expected 'struct OPA_int_t *' but argument is of type 'OPA_int_t'"
> in it's config.log, while the hanging one did not.
> I noticed that the hanging one also didn't have /usr/local/include/primitives/opa_gcc_intel_32_64_ops.h.
> I did yum provides "*opa_gcc_intel_32_64_ops.h" and found "mpich2-devel-1.2.1-2.3.el6.x86_64".
> Even though the working system didn't have mpich2-devel installed, I installed it in the hanging system.
> I now had the opa_gcc_intel...ops.h file, but still the same differences in the config.log.
> I figured out that running configure; make; make install TWICE makes that error go away for the second configure, so I must have done that on the working system...
> I reproduced the gcc error from config.log: opa_primitives.h: No such file or directory
> If I do configure; make; make install, the error goes away.
> After all this, mpiexec.hydra still hangs, so I'm writing you in the hope that you can shine some light on the matter...
> I'm using intel mpi but I really want to use mvapich, keep hearing so many good things about it but haven't managed to install it so far, unfortunately.
> 
> Cheers
> 
> P.S. I also had some problems with --enable-cuda.. Cuda 5.5 puts it's include and files into /usr/local/cuda-5.5, and makes a softlink /usr/local/cuda (which I believe always has been the standard location). The 319 driver installs itself in /usr/lib64/nvidia. If I simply do --enable-cuda, it doesn't find anything. If I do --with-cuda=/usr/local/cuda, it doesn't find -lcuda. If I do --with-cuda-libpath=/usr/lib64/nvidia, it doesn't find -lcudart. I guess the problem is that the runtime doesn't have the driver, and the driver doesn't have the runtime.. and I wasn't able to give configure several paths separated by colon. Anyway, what I ended up doing is making a softlink to /usr/lib64/nvidia/libcuda.so in /usr/local/cuda/lib64, surely not how things were meant to be. This stuff is kind of annoying and I guess it's new with cuda 5.5 where you can get it over "yum install cuda". yum will do /usr/local/cuda -> /usr/local/cuda-5.5 and /usr/lib64/nvidia/libcuda.so, and the .run scripts will do /usr/lib64/libcuda.so... bless their hearts.. I don't know how much you guys care about this stuff but some coordination on standard location of cuda with nvidia would be nice.. I lost some time on figuring this out.
> 
> Igor Podladtchikov
> Spectraseis
> 1899 Wynkoop St, Suite 350
> Denver, CO 80202
> Tel. +1 303 658 9172 (direct)
> Tel. +1 303 330 8296 (cell)
> www.spectraseis.com<../../owa/redir.aspx?C=T7tMpFAL7UCexpluR4PqqoCIwR-6UM4IpFLjCgDsXkTxuE35EpUjnCXhCOI4gUxQJB127PTepuc.&URL=http%3a%2f%2fwww.spectraseis.com%2f>

> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss


-- 
Jonathan Perkins
http://www.cse.ohio-state.edu/~perkinjo




More information about the mvapich-discuss mailing list