[mvapich-discuss] Multi-nodes runtime error

Kai Yang white_yk at utexas.edu
Thu Apr 14 23:27:46 EDT 2016


I recently built a multi-nodes cluster, and each node has 16 cores with
total 256 GB memory. I installed Intel composer 2016, mvapich2-2.2b, and
FFTW-2.1.5 on the cluster.

I tested some codes that have been successfully run on TACC before
on the new cluster. When I ran a large scale problem on two nodes of the
new cluster, I got the following errors. But I was able to run the same
case on a single node successfully. Also, when I tried small scale
problems, all the simulations ran well on either single node or two nodes.
I used mpirun -hosts server1,server2 -np 16 ./Job to run the
simulation without job scheduler.

It seems to me there are some problems in the large-scale communication
between different nodes. I was wondering if you have any ideas about these
errors. Or is there any special settings on the communications among
multiple nodes based on the IB switch because the same code can run on the
multiple nodes on TACC?

FYI, I set the stack size to be unlimited. The IB switch is Mellanox
InfiniScale IV QDR. When I configured mvapich2 2.2b, I used the default
settings.

Thanks!
Kai

[proxy:0:0 at server01] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:912):
assert (!closed) failed
[proxy:0:0 at server01] HYDT_dmxu_poll_wait_for_event
(tools/demux/demux_poll.c:76): callback returned error status
[proxy:0:0 at server01] main (pm/pmiserv/pmip.c:206): demux engine
error waiting for event
[mpiexec at server01] HYDT_bscu_wait_for_completion
 (tools/bootstrap/utils/bscu_wait.c:76): one of the processes terminated
badly; aborting
[mpiexec at server01] HYDT_bsci_wait_for_completion
(tools/bootstrap/src/bsci_wait.c:23):
launcher returned error waiting for completion
[mpiexec at server01] HYD_pmci_wait_for_completion
(pm/pmiserv/pmiserv_pmci.c:218):
launcher returned error waiting for  completion
[mpiexec at server01] main (ui/mpich/mpiexec.c:344): process manager
error waiting for completion
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20160414/9232623b/attachment.html>


More information about the mvapich-discuss mailing list