[mvapich-discuss] process manager problem

teng ma xiaok1981 at gmail.com
Sun Jul 31 12:41:43 EDT 2011


I used mvapich2-1.7rc.

I run IMB's bcast test on 20 nodes (24 core/node, 480 processes in all).
It complains like follows:

mpiexec -n 480 -f ~/rankfile ./IMB-MPI1 Bcast -npmin 480
[parapluie-33.rennes.grid5000.fr:mpi_rank_1][error_sighandler] Caught error:
Bus error (signal 7)
[parapluie-33.rennes.grid5000.fr:mpi_rank_2][error_sighandler] Caught error:
Bus error (signal 7)
[parapluie-33.rennes.grid5000.fr:mpi_rank_10][error_sighandler] Caught
error: Bus error (signal 7)
[parapluie-33.rennes.grid5000.fr:mpi_rank_18][error_sighandler] Caught
error: Bus error (signal 7)
[parapluie-33.rennes.grid5000.fr:mpi_rank_21][error_sighandler] Caught
error: Bus error (signal 7)
[parapluie-33.rennes.grid5000.fr:mpi_rank_23][error_sighandler] Caught
error: Bus error (signal 7)

=====================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   EXIT CODE: 7
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
=====================================================================================
[proxy:0:1 at parapluie-22.rennes.grid5000.fr] HYD_pmcd_pmip_control_cmd_cb
(./pm/pmiserv/pmip_cb.c:906): assert (!closed) failed
[proxy:0:1 at parapluie-22.rennes.grid5000.fr] HYDT_dmxu_poll_wait_for_event
(./tools/demux/demux_poll.c:77): callback returned error status
[proxy:0:1 at parapluie-22.rennes.grid5000.fr] main (./pm/pmiserv/pmip.c:226):
demux engine error waiting for event
[proxy:0:2 at parapluie-7.rennes.grid5000.fr] HYD_pmcd_pmip_control_cmd_cb
(./pm/pmiserv/pmip_cb.c:906): assert (!closed) failed
[proxy:0:2 at parapluie-7.rennes.grid5000.fr] HYDT_dmxu_poll_wait_for_event
(./tools/demux/demux_poll.c:77): callback returned error status
[proxy:0:2 at parapluie-7.rennes.grid5000.fr] main (./pm/pmiserv/pmip.c:226):
demux engine error waiting for event
[proxy:0:3 at parapluie-31.rennes.grid5000.fr] HYD_pmcd_pmip_control_cmd_cb
(./pm/pmiserv/pmip_cb.c:906): assert (!closed) failed
[proxy:0:3 at parapluie-31.rennes.grid5000.fr] HYDT_dmxu_poll_wait_for_event
(./tools/demux/demux_poll.c:77): callback returned error status
[proxy:0:3 at parapluie-31.rennes.grid5000.fr] main (./pm/pmiserv/pmip.c:226):
demux engine error waiting for event
[proxy:0:4 at parapluie-20.rennes.grid5000.fr] HYD_pmcd_pmip_control_cmd_cb
(./pm/pmiserv/pmip_cb.c:906): assert (!closed) failed
[proxy:0:4 at parapluie-20.rennes.grid5000.fr] HYDT_dmxu_poll_wait_for_event
(./tools/demux/demux_poll.c:77): callback returned error status
[proxy:0:4 at parapluie-20.rennes.grid5000.fr] main (./pm/pmiserv/pmip.c:226):
demux engine error waiting for event
[proxy:0:5 at parapluie-5.rennes.grid5000.fr] HYD_pmcd_pmip_control_cmd_cb
(./pm/pmiserv/pmip_cb.c:906): assert (!closed) failed
[proxy:0:5 at parapluie-5.rennes.grid5000.fr] HYDT_dmxu_poll_wait_for_event
(./tools/demux/demux_poll.c:77): callback returned error status
[proxy:0:5 at parapluie-5.rennes.grid5000.fr] main (./pm/pmiserv/pmip.c:226):
demux engine error waiting for event
[proxy:0:6 at parapluie-3.rennes.grid5000.fr] HYD_pmcd_pmip_control_cmd_cb
(./pm/pmiserv/pmip_cb.c:906): assert (!closed) failed
[proxy:0:6 at parapluie-3.rennes.grid5000.fr] HYDT_dmxu_poll_wait_for_event
(./tools/demux/demux_poll.c:77): callback returned error status
[proxy:0:6 at parapluie-3.rennes.grid5000.fr] main (./pm/pmiserv/pmip.c:226):
demux engine error waiting for event
[proxy:0:7 at parapluie-19.rennes.grid5000.fr] HYD_pmcd_pmip_control_cmd_cb
(./pm/pmiserv/pmip_cb.c:906): assert (!closed) failed
[proxy:0:7 at parapluie-19.rennes.grid5000.fr] HYDT_dmxu_poll_wait_for_event
(./tools/demux/demux_poll.c:77): callback returned error status
[proxy:0:7 at parapluie-19.rennes.grid5000.fr] main (./pm/pmiserv/pmip.c:226):
demux engine error waiting for event
[proxy:0:8 at parapluie-38.rennes.grid5000.fr] HYD_pmcd_pmip_control_cmd_cb
(./pm/pmiserv/pmip_cb.c:906): assert (!closed) failed
[proxy:0:8 at parapluie-38.rennes.grid5000.fr] HYDT_dmxu_poll_wait_for_event
(./tools/demux/demux_poll.c:77): callback returned error status
[proxy:0:8 at parapluie-38.rennes.grid5000.fr] main (./pm/pmiserv/pmip.c:226):
demux engine error waiting for event
[proxy:0:9 at parapluie-28.rennes.grid5000.fr] HYD_pmcd_pmip_control_cmd_cb
(./pm/pmiserv/pmip_cb.c:906): assert (!closed) failed
[proxy:0:9 at parapluie-28.rennes.grid5000.fr] HYDT_dmxu_poll_wait_for_event
(./tools/demux/demux_poll.c:77): callback returned error status
[proxy:0:9 at parapluie-28.rennes.grid5000.fr] main (./pm/pmiserv/pmip.c:226):
demux engine error waiting for event
[proxy:0:10 at parapluie-16.rennes.grid5000.fr] HYD_pmcd_pmip_control_cmd_cb
(./pm/pmiserv/pmip_cb.c:906): assert (!closed) failed
[proxy:0:10 at parapluie-16.rennes.grid5000.fr] HYDT_dmxu_poll_wait_for_event
(./tools/demux/demux_poll.c:77): callback returned error status
[proxy:0:10 at parapluie-16.rennes.grid5000.fr] main (./pm/pmiserv/pmip.c:226):
demux engine error waiting for event
[proxy:0:11 at parapluie-25.rennes.grid5000.fr] HYD_pmcd_pmip_control_cmd_cb
(./pm/pmiserv/pmip_cb.c:906): assert (!closed) failed
[proxy:0:11 at parapluie-25.rennes.grid5000.fr] HYDT_dmxu_poll_wait_for_event
(./tools/demux/demux_poll.c:77): callback returned error status
[proxy:0:11 at parapluie-25.rennes.grid5000.fr] main (./pm/pmiserv/pmip.c:226):
demux engine error waiting for event
[proxy:0:12 at parapluie-14.rennes.grid5000.fr] HYD_pmcd_pmip_control_cmd_cb
(./pm/pmiserv/pmip_cb.c:906): assert (!closed) failed
[proxy:0:12 at parapluie-14.rennes.grid5000.fr] HYDT_dmxu_poll_wait_for_event
(./tools/demux/demux_poll.c:77): callback returned error status
[proxy:0:12 at parapluie-14.rennes.grid5000.fr] main (./pm/pmiserv/pmip.c:226):
demux engine error waiting for event
[proxy:0:13 at parapluie-34.rennes.grid5000.fr] HYD_pmcd_pmip_control_cmd_cb
(./pm/pmiserv/pmip_cb.c:906): assert (!closed) failed
[proxy:0:13 at parapluie-34.rennes.grid5000.fr] HYDT_dmxu_poll_wait_for_event
(./tools/demux/demux_poll.c:77): callback returned error status
[proxy:0:13 at parapluie-34.rennes.grid5000.fr] main (./pm/pmiserv/pmip.c:226):
demux engine error waiting for event
[proxy:0:14 at parapluie-23.rennes.grid5000.fr] HYD_pmcd_pmip_control_cmd_cb
(./pm/pmiserv/pmip_cb.c:906): assert (!closed) failed
[proxy:0:14 at parapluie-23.rennes.grid5000.fr] HYDT_dmxu_poll_wait_for_event
(./tools/demux/demux_poll.c:77): callback returned error status
[proxy:0:14 at parapluie-23.rennes.grid5000.fr] main (./pm/pmiserv/pmip.c:226):
demux engine error waiting for event
[proxy:0:15 at parapluie-8.rennes.grid5000.fr] HYD_pmcd_pmip_control_cmd_cb
(./pm/pmiserv/pmip_cb.c:906): assert (!closed) failed
[proxy:0:15 at parapluie-8.rennes.grid5000.fr] HYDT_dmxu_poll_wait_for_event
(./tools/demux/demux_poll.c:77): callback returned error status
[proxy:0:15 at parapluie-8.rennes.grid5000.fr] main (./pm/pmiserv/pmip.c:226):
demux engine error waiting for event
[proxy:0:16 at parapluie-32.rennes.grid5000.fr] HYD_pmcd_pmip_control_cmd_cb
(./pm/pmiserv/pmip_cb.c:906): assert (!closed) failed
[proxy:0:16 at parapluie-32.rennes.grid5000.fr] HYDT_dmxu_poll_wait_for_event
(./tools/demux/demux_poll.c:77): callback returned error status
[proxy:0:16 at parapluie-32.rennes.grid5000.fr] main (./pm/pmiserv/pmip.c:226):
demux engine error waiting for event
[proxy:0:17 at parapluie-21.rennes.grid5000.fr] HYD_pmcd_pmip_control_cmd_cb
(./pm/pmiserv/pmip_cb.c:906): assert (!closed) failed
[proxy:0:17 at parapluie-21.rennes.grid5000.fr] HYDT_dmxu_poll_wait_for_event
(./tools/demux/demux_poll.c:77): callback returned error status
[proxy:0:17 at parapluie-21.rennes.grid5000.fr] main (./pm/pmiserv/pmip.c:226):
demux engine error waiting for event
[proxy:0:18 at parapluie-6.rennes.grid5000.fr] HYD_pmcd_pmip_control_cmd_cb
(./pm/pmiserv/pmip_cb.c:906): assert (!closed) failed
[proxy:0:18 at parapluie-6.rennes.grid5000.fr] HYDT_dmxu_poll_wait_for_event
(./tools/demux/demux_poll.c:77): callback returned error status
[proxy:0:18 at parapluie-6.rennes.grid5000.fr] main (./pm/pmiserv/pmip.c:226):
demux engine error waiting for event
[proxy:0:19 at parapluie-30.rennes.grid5000.fr] HYD_pmcd_pmip_control_cmd_cb
(./pm/pmiserv/pmip_cb.c:906): assert (!closed) failed
[proxy:0:19 at parapluie-30.rennes.grid5000.fr] HYDT_dmxu_poll_wait_for_event
(./tools/demux/demux_poll.c:77): callback returned error status
[proxy:0:19 at parapluie-30.rennes.grid5000.fr] main (./pm/pmiserv/pmip.c:226):
demux engine error waiting for event
[mpiexec at parapluie-2.rennes.grid5000.fr] HYDT_bscu_wait_for_completion
(./tools/bootstrap/utils/bscu_wait.c:70): one of the processes terminated
badly; aborting
[mpiexec at parapluie-2.rennes.grid5000.fr] HYDT_bsci_wait_for_completion
(./tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting for
completion
[mpiexec at parapluie-2.rennes.grid5000.fr] HYD_pmci_wait_for_completion
(./pm/pmiserv/pmiserv_pmci.c:189): launcher returned error waiting for
completion
[mpiexec at parapluie-2.rennes.grid5000.fr] main (./ui/mpich/mpiexec.c:397):
process manager error waiting for completion


this is the rankfile:
parapluie-33.rennes.grid5000.fr:24
parapluie-22.rennes.grid5000.fr:24
parapluie-7.rennes.grid5000.fr:24
parapluie-31.rennes.grid5000.fr:24
parapluie-20.rennes.grid5000.fr:24
parapluie-5.rennes.grid5000.fr:24
parapluie-3.rennes.grid5000.fr:24
parapluie-19.rennes.grid5000.fr:24
parapluie-38.rennes.grid5000.fr:24
parapluie-28.rennes.grid5000.fr:24
parapluie-16.rennes.grid5000.fr:24
parapluie-25.rennes.grid5000.fr:24
parapluie-14.rennes.grid5000.fr:24
parapluie-34.rennes.grid5000.fr:24
parapluie-23.rennes.grid5000.fr:24
parapluie-8.rennes.grid5000.fr:24
parapluie-32.rennes.grid5000.fr:24
parapluie-21.rennes.grid5000.fr:24
parapluie-6.rennes.grid5000.fr:24
parapluie-30.rennes.grid5000.fr:24
parapluie-2.rennes.grid5000.fr:24
parapluie-40.rennes.grid5000.fr:24
parapluie-29.rennes.grid5000.fr:24
parapluie-37.rennes.grid5000.fr:24
parapluie-27.rennes.grid5000.fr:24
parapluie-15.rennes.grid5000.fr:24
parapluie-35.rennes.grid5000.fr:24
parapluie-24.rennes.grid5000.fr:24
parapluie-13.rennes.grid5000.fr:24

If nodes in the test are smaller than 19, it  works perfectly.  If bigger
and equal to 20(480 processes), it complains.
I did the same test with mpich2-1.4 on another cluster(42 nodes,
24core/node). It does not show up any binding problem no matter how many
nodes in the communicator.


Teng Ma
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20110731/ae011844/attachment-0001.html


More information about the mvapich-discuss mailing list