[mvapich-discuss] Multi-nodes runtime error

Jonathan Perkins perkinjo at cse.ohio-state.edu
Sat Apr 16 08:06:55 EDT 2016


Hello.  Sorry that you're experiencing some trouble.  Can you try a debug
build to see if any more information is printed?

http://mvapich.cse.ohio-state.edu/static/media/mvapich/mvapich2-2.2rc1-userguide.html#x1-1270009.1.14

You may also want to try using mpirun_rsh to see if it helps.

http://mvapich.cse.ohio-state.edu/static/media/mvapich/mvapich2-2.2rc1-userguide.html#x1-260005.2.1

Please let us know what output you see with this new build and mpirun_rsh.

On Fri, Apr 15, 2016 at 8:32 PM Kai Yang <white_yk at utexas.edu> wrote:

> I recently built a multi-nodes cluster, and each node has 16 cores with
> total 256 GB memory. I installed Intel composer 2016, mvapich2-2.2b, and
> FFTW-2.1.5 on the cluster.
>
> I tested some codes that have been successfully run on TACC before
> on the new cluster. When I ran a large scale problem on two nodes of the
> new cluster, I got the following errors. But I was able to run the same
> case on a single node successfully. Also, when I tried small scale
> problems, all the simulations ran well on either single node or two nodes.
> I used mpirun -hosts server1,server2 -np 16 ./Job to run the
> simulation without job scheduler.
>
> It seems to me there are some problems in the large-scale communication
> between different nodes. I was wondering if you have any ideas about these
> errors. Or is there any special settings on the communications among
> multiple nodes based on the IB switch because the same code can run on the
> multiple nodes on TACC?
>
> FYI, I set the stack size to be unlimited. The IB switch is Mellanox
> InfiniScale IV QDR. When I configured mvapich2 2.2b, I used the default
> settings.
>
> Thanks!
> Kai
>
> [proxy:0:0 at server01] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:912):
> assert (!closed) failed
> [proxy:0:0 at server01] HYDT_dmxu_poll_wait_for_event
> (tools/demux/demux_poll.c:76): callback returned error status
> [proxy:0:0 at server01] main (pm/pmiserv/pmip.c:206): demux engine
> error waiting for event
> [mpiexec at server01] HYDT_bscu_wait_for_completion
>  (tools/bootstrap/utils/bscu_wait.c:76): one of the processes terminated
> badly; aborting
> [mpiexec at server01] HYDT_bsci_wait_for_completion
> (tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting for
> completion
> [mpiexec at server01] HYD_pmci_wait_for_completion
> (pm/pmiserv/pmiserv_pmci.c:218): launcher returned error waiting
> for  completion
> [mpiexec at server01] main (ui/mpich/mpiexec.c:344): process manager
> error waiting for completion
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20160416/a83ce72f/attachment.html>


More information about the mvapich-discuss mailing list