[mvapich-discuss] process manager problem

Krishna Kandalla kandalla at cse.ohio-state.edu
Sun Jul 31 16:58:38 EDT 2011


Hi Teng,
         Thank you for reporting this issue. We just tried IMB Bcast on a
cluster with 32 cores per node, across 20 nodes and we are unable to
reproduce this problem. Could you please indicate the configuration flags
that you have used to build and install mvapich2?  It would also help if you
could get hold of the back-trace/code-dump information leading upto this
failure.

Regards,
Krishna

On Sun, Jul 31, 2011 at 9:41 AM, teng ma <xiaok1981 at gmail.com> wrote:

> I used mvapich2-1.7rc.
>
> I run IMB's bcast test on 20 nodes (24 core/node, 480 processes in all).
> It complains like follows:
>
> mpiexec -n 480 -f ~/rankfile ./IMB-MPI1 Bcast -npmin 480
> [parapluie-33.rennes.grid5000.fr:mpi_rank_1][error_sighandler] Caught
> error: Bus error (signal 7)
> [parapluie-33.rennes.grid5000.fr:mpi_rank_2][error_sighandler] Caught
> error: Bus error (signal 7)
> [parapluie-33.rennes.grid5000.fr:mpi_rank_10][error_sighandler] Caught
> error: Bus error (signal 7)
> [parapluie-33.rennes.grid5000.fr:mpi_rank_18][error_sighandler] Caught
> error: Bus error (signal 7)
> [parapluie-33.rennes.grid5000.fr:mpi_rank_21][error_sighandler] Caught
> error: Bus error (signal 7)
> [parapluie-33.rennes.grid5000.fr:mpi_rank_23][error_sighandler] Caught
> error: Bus error (signal 7)
>
>
> =====================================================================================
> =   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
> =   EXIT CODE: 7
> =   CLEANING UP REMAINING PROCESSES
> =   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
>
> =====================================================================================
> [proxy:0:1 at parapluie-22.rennes.grid5000.fr] HYD_pmcd_pmip_control_cmd_cb
> (./pm/pmiserv/pmip_cb.c:906): assert (!closed) failed
> [proxy:0:1 at parapluie-22.rennes.grid5000.fr] HYDT_dmxu_poll_wait_for_event
> (./tools/demux/demux_poll.c:77): callback returned error status
> [proxy:0:1 at parapluie-22.rennes.grid5000.fr] main
> (./pm/pmiserv/pmip.c:226): demux engine error waiting for event
> [proxy:0:2 at parapluie-7.rennes.grid5000.fr] HYD_pmcd_pmip_control_cmd_cb
> (./pm/pmiserv/pmip_cb.c:906): assert (!closed) failed
> [proxy:0:2 at parapluie-7.rennes.grid5000.fr] HYDT_dmxu_poll_wait_for_event
> (./tools/demux/demux_poll.c:77): callback returned error status
> [proxy:0:2 at parapluie-7.rennes.grid5000.fr] main (./pm/pmiserv/pmip.c:226):
> demux engine error waiting for event
> [proxy:0:3 at parapluie-31.rennes.grid5000.fr] HYD_pmcd_pmip_control_cmd_cb
> (./pm/pmiserv/pmip_cb.c:906): assert (!closed) failed
> [proxy:0:3 at parapluie-31.rennes.grid5000.fr] HYDT_dmxu_poll_wait_for_event
> (./tools/demux/demux_poll.c:77): callback returned error status
> [proxy:0:3 at parapluie-31.rennes.grid5000.fr] main
> (./pm/pmiserv/pmip.c:226): demux engine error waiting for event
> [proxy:0:4 at parapluie-20.rennes.grid5000.fr] HYD_pmcd_pmip_control_cmd_cb
> (./pm/pmiserv/pmip_cb.c:906): assert (!closed) failed
> [proxy:0:4 at parapluie-20.rennes.grid5000.fr] HYDT_dmxu_poll_wait_for_event
> (./tools/demux/demux_poll.c:77): callback returned error status
> [proxy:0:4 at parapluie-20.rennes.grid5000.fr] main
> (./pm/pmiserv/pmip.c:226): demux engine error waiting for event
> [proxy:0:5 at parapluie-5.rennes.grid5000.fr] HYD_pmcd_pmip_control_cmd_cb
> (./pm/pmiserv/pmip_cb.c:906): assert (!closed) failed
> [proxy:0:5 at parapluie-5.rennes.grid5000.fr] HYDT_dmxu_poll_wait_for_event
> (./tools/demux/demux_poll.c:77): callback returned error status
> [proxy:0:5 at parapluie-5.rennes.grid5000.fr] main (./pm/pmiserv/pmip.c:226):
> demux engine error waiting for event
> [proxy:0:6 at parapluie-3.rennes.grid5000.fr] HYD_pmcd_pmip_control_cmd_cb
> (./pm/pmiserv/pmip_cb.c:906): assert (!closed) failed
> [proxy:0:6 at parapluie-3.rennes.grid5000.fr] HYDT_dmxu_poll_wait_for_event
> (./tools/demux/demux_poll.c:77): callback returned error status
> [proxy:0:6 at parapluie-3.rennes.grid5000.fr] main (./pm/pmiserv/pmip.c:226):
> demux engine error waiting for event
> [proxy:0:7 at parapluie-19.rennes.grid5000.fr] HYD_pmcd_pmip_control_cmd_cb
> (./pm/pmiserv/pmip_cb.c:906): assert (!closed) failed
> [proxy:0:7 at parapluie-19.rennes.grid5000.fr] HYDT_dmxu_poll_wait_for_event
> (./tools/demux/demux_poll.c:77): callback returned error status
> [proxy:0:7 at parapluie-19.rennes.grid5000.fr] main
> (./pm/pmiserv/pmip.c:226): demux engine error waiting for event
> [proxy:0:8 at parapluie-38.rennes.grid5000.fr] HYD_pmcd_pmip_control_cmd_cb
> (./pm/pmiserv/pmip_cb.c:906): assert (!closed) failed
> [proxy:0:8 at parapluie-38.rennes.grid5000.fr] HYDT_dmxu_poll_wait_for_event
> (./tools/demux/demux_poll.c:77): callback returned error status
> [proxy:0:8 at parapluie-38.rennes.grid5000.fr] main
> (./pm/pmiserv/pmip.c:226): demux engine error waiting for event
> [proxy:0:9 at parapluie-28.rennes.grid5000.fr] HYD_pmcd_pmip_control_cmd_cb
> (./pm/pmiserv/pmip_cb.c:906): assert (!closed) failed
> [proxy:0:9 at parapluie-28.rennes.grid5000.fr] HYDT_dmxu_poll_wait_for_event
> (./tools/demux/demux_poll.c:77): callback returned error status
> [proxy:0:9 at parapluie-28.rennes.grid5000.fr] main
> (./pm/pmiserv/pmip.c:226): demux engine error waiting for event
> [proxy:0:10 at parapluie-16.rennes.grid5000.fr] HYD_pmcd_pmip_control_cmd_cb
> (./pm/pmiserv/pmip_cb.c:906): assert (!closed) failed
> [proxy:0:10 at parapluie-16.rennes.grid5000.fr] HYDT_dmxu_poll_wait_for_event
> (./tools/demux/demux_poll.c:77): callback returned error status
> [proxy:0:10 at parapluie-16.rennes.grid5000.fr] main
> (./pm/pmiserv/pmip.c:226): demux engine error waiting for event
> [proxy:0:11 at parapluie-25.rennes.grid5000.fr] HYD_pmcd_pmip_control_cmd_cb
> (./pm/pmiserv/pmip_cb.c:906): assert (!closed) failed
> [proxy:0:11 at parapluie-25.rennes.grid5000.fr] HYDT_dmxu_poll_wait_for_event
> (./tools/demux/demux_poll.c:77): callback returned error status
> [proxy:0:11 at parapluie-25.rennes.grid5000.fr] main
> (./pm/pmiserv/pmip.c:226): demux engine error waiting for event
> [proxy:0:12 at parapluie-14.rennes.grid5000.fr] HYD_pmcd_pmip_control_cmd_cb
> (./pm/pmiserv/pmip_cb.c:906): assert (!closed) failed
> [proxy:0:12 at parapluie-14.rennes.grid5000.fr] HYDT_dmxu_poll_wait_for_event
> (./tools/demux/demux_poll.c:77): callback returned error status
> [proxy:0:12 at parapluie-14.rennes.grid5000.fr] main
> (./pm/pmiserv/pmip.c:226): demux engine error waiting for event
> [proxy:0:13 at parapluie-34.rennes.grid5000.fr] HYD_pmcd_pmip_control_cmd_cb
> (./pm/pmiserv/pmip_cb.c:906): assert (!closed) failed
> [proxy:0:13 at parapluie-34.rennes.grid5000.fr] HYDT_dmxu_poll_wait_for_event
> (./tools/demux/demux_poll.c:77): callback returned error status
> [proxy:0:13 at parapluie-34.rennes.grid5000.fr] main
> (./pm/pmiserv/pmip.c:226): demux engine error waiting for event
> [proxy:0:14 at parapluie-23.rennes.grid5000.fr] HYD_pmcd_pmip_control_cmd_cb
> (./pm/pmiserv/pmip_cb.c:906): assert (!closed) failed
> [proxy:0:14 at parapluie-23.rennes.grid5000.fr] HYDT_dmxu_poll_wait_for_event
> (./tools/demux/demux_poll.c:77): callback returned error status
> [proxy:0:14 at parapluie-23.rennes.grid5000.fr] main
> (./pm/pmiserv/pmip.c:226): demux engine error waiting for event
> [proxy:0:15 at parapluie-8.rennes.grid5000.fr] HYD_pmcd_pmip_control_cmd_cb
> (./pm/pmiserv/pmip_cb.c:906): assert (!closed) failed
> [proxy:0:15 at parapluie-8.rennes.grid5000.fr] HYDT_dmxu_poll_wait_for_event
> (./tools/demux/demux_poll.c:77): callback returned error status
> [proxy:0:15 at parapluie-8.rennes.grid5000.fr] main
> (./pm/pmiserv/pmip.c:226): demux engine error waiting for event
> [proxy:0:16 at parapluie-32.rennes.grid5000.fr] HYD_pmcd_pmip_control_cmd_cb
> (./pm/pmiserv/pmip_cb.c:906): assert (!closed) failed
> [proxy:0:16 at parapluie-32.rennes.grid5000.fr] HYDT_dmxu_poll_wait_for_event
> (./tools/demux/demux_poll.c:77): callback returned error status
> [proxy:0:16 at parapluie-32.rennes.grid5000.fr] main
> (./pm/pmiserv/pmip.c:226): demux engine error waiting for event
> [proxy:0:17 at parapluie-21.rennes.grid5000.fr] HYD_pmcd_pmip_control_cmd_cb
> (./pm/pmiserv/pmip_cb.c:906): assert (!closed) failed
> [proxy:0:17 at parapluie-21.rennes.grid5000.fr] HYDT_dmxu_poll_wait_for_event
> (./tools/demux/demux_poll.c:77): callback returned error status
> [proxy:0:17 at parapluie-21.rennes.grid5000.fr] main
> (./pm/pmiserv/pmip.c:226): demux engine error waiting for event
> [proxy:0:18 at parapluie-6.rennes.grid5000.fr] HYD_pmcd_pmip_control_cmd_cb
> (./pm/pmiserv/pmip_cb.c:906): assert (!closed) failed
> [proxy:0:18 at parapluie-6.rennes.grid5000.fr] HYDT_dmxu_poll_wait_for_event
> (./tools/demux/demux_poll.c:77): callback returned error status
> [proxy:0:18 at parapluie-6.rennes.grid5000.fr] main
> (./pm/pmiserv/pmip.c:226): demux engine error waiting for event
> [proxy:0:19 at parapluie-30.rennes.grid5000.fr] HYD_pmcd_pmip_control_cmd_cb
> (./pm/pmiserv/pmip_cb.c:906): assert (!closed) failed
> [proxy:0:19 at parapluie-30.rennes.grid5000.fr] HYDT_dmxu_poll_wait_for_event
> (./tools/demux/demux_poll.c:77): callback returned error status
> [proxy:0:19 at parapluie-30.rennes.grid5000.fr] main
> (./pm/pmiserv/pmip.c:226): demux engine error waiting for event
> [mpiexec at parapluie-2.rennes.grid5000.fr] HYDT_bscu_wait_for_completion
> (./tools/bootstrap/utils/bscu_wait.c:70): one of the processes terminated
> badly; aborting
> [mpiexec at parapluie-2.rennes.grid5000.fr] HYDT_bsci_wait_for_completion
> (./tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting for
> completion
> [mpiexec at parapluie-2.rennes.grid5000.fr] HYD_pmci_wait_for_completion
> (./pm/pmiserv/pmiserv_pmci.c:189): launcher returned error waiting for
> completion
> [mpiexec at parapluie-2.rennes.grid5000.fr] main (./ui/mpich/mpiexec.c:397):
> process manager error waiting for completion
>
>
> this is the rankfile:
> parapluie-33.rennes.grid5000.fr:24
> parapluie-22.rennes.grid5000.fr:24
> parapluie-7.rennes.grid5000.fr:24
> parapluie-31.rennes.grid5000.fr:24
> parapluie-20.rennes.grid5000.fr:24
> parapluie-5.rennes.grid5000.fr:24
> parapluie-3.rennes.grid5000.fr:24
> parapluie-19.rennes.grid5000.fr:24
> parapluie-38.rennes.grid5000.fr:24
> parapluie-28.rennes.grid5000.fr:24
> parapluie-16.rennes.grid5000.fr:24
> parapluie-25.rennes.grid5000.fr:24
> parapluie-14.rennes.grid5000.fr:24
> parapluie-34.rennes.grid5000.fr:24
> parapluie-23.rennes.grid5000.fr:24
> parapluie-8.rennes.grid5000.fr:24
> parapluie-32.rennes.grid5000.fr:24
> parapluie-21.rennes.grid5000.fr:24
> parapluie-6.rennes.grid5000.fr:24
> parapluie-30.rennes.grid5000.fr:24
> parapluie-2.rennes.grid5000.fr:24
> parapluie-40.rennes.grid5000.fr:24
> parapluie-29.rennes.grid5000.fr:24
> parapluie-37.rennes.grid5000.fr:24
> parapluie-27.rennes.grid5000.fr:24
> parapluie-15.rennes.grid5000.fr:24
> parapluie-35.rennes.grid5000.fr:24
> parapluie-24.rennes.grid5000.fr:24
> parapluie-13.rennes.grid5000.fr:24
>
> If nodes in the test are smaller than 19, it  works perfectly.  If bigger
> and equal to 20(480 processes), it complains.
> I did the same test with mpich2-1.4 on another cluster(42 nodes,
> 24core/node). It does not show up any binding problem no matter how many
> nodes in the communicator.
>
>
> Teng Ma
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20110731/3ebc60ef/attachment-0001.html


More information about the mvapich-discuss mailing list