[mvapich-discuss] Cyclic ranking error in mvapich2-2.3.4

Mohsen Gavahi gavahi.hw at gmail.com
Wed Jul 15 15:05:23 EDT 2020


Hello,

I am a grad student in Computer Science at FSU.
I tested the versions of mvapich2-2.3.2 and  mvapich2-2.3.4 for a
specific host file as this:

inv38:1
inv34:1
inv35:1
inv36:1
inv38:1
inv34:1
inv35:1
inv36:1
inv38:1
inv34:1
inv35:1
inv36:1
inv38:1
inv34:1
inv35:1
inv36:1
inv38:1
inv34:1
inv35:1
inv36:1
inv38:1
inv34:1
inv35:1
inv36:1
inv38:1
inv34:1
inv35:1
inv36:1
inv38:1
inv34:1
inv35:1
inv36:1

In both versions, I used the hellow.c file located at mvapich/examples
directory.
In *mvapich2-2.3.2*, it works successfully.
But the  *mvapich2-2.3.4* sometimes runs with an unrealistic high
latency, or it shows a Warning message then the errors.
The output for both versions pasted at the end of this text.


gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-36)

Linux inv32 3.10.0-957.21.3.el7.x86_64 #1 SMP Tue Jun 18 16:35:19 UTC 2019
x86_64 x86_64 x86_64 GNU/Linux

hca_id: mlx4_0
        transport:                      InfiniBand (0)
        fw_ver:                         2.9.1000
        node_guid:                      0002:c903:000e:8bd8
        sys_image_guid:                 0002:c903:000e:8bdb
        vendor_id:                      0x02c9
        vendor_part_id:                 26428
        hw_ver:                         0xB0
        board_id:                       MT_0FC0110009
        phys_port_cnt:                  2
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             4096 (5)
                        sm_lid:                 13
                        port_lid:               26
                        port_lmc:               0x00
                        link_layer:             InfiniBand

                port:   2
                        state:                  PORT_DOWN (1)
                        max_mtu:                4096 (5)
                        active_mtu:             4096 (5)
                        sm_lid:                 0
                        port_lid:               0
                        port_lmc:               0x00
                        link_layer:             InfiniBand



/***********************************************************************/

/mvapich2-2.3.2/install/bin/mpiexec -n 32 -f 32-nole ./hellow

Hello world from process 11 of 32
Hello world from process 0 of 32
Hello world from process 12 of 32
Hello world from process 5 of 32
Hello world from process 14 of 32
Hello world from process 20 of 32
Hello world from process 18 of 32
Hello world from process 24 of 32
Hello world from process 28 of 32
Hello world from process 6 of 32
Hello world from process 2 of 32
Hello world from process 13 of 32
Hello world from process 9 of 32
Hello world from process 3 of 32
Hello world from process 7 of 32
Hello world from process 15 of 32
Hello world from process 8 of 32
Hello world from process 4 of 32
Hello world from process 1 of 32
Hello world from process 10 of 32
Hello world from process 21 of 32
Hello world from process 30 of 32
Hello world from process 25 of 32
Hello world from process 17 of 32
Hello world from process 16 of 32
Hello world from process 29 of 32
Hello world from process 22 of 32
Hello world from process 31 of 32
Hello world from process 26 of 32
Hello world from process 27 of 32
Hello world from process 23 of 32
Hello world from process 19 of 32

/***********************************************************************/

/mvapich2-2.3.4/install/bin/mpiexec -n 32 -f 32-nole ./hellow

[ns01:mpi_rank_0][rdma_open_hca] [Warning] Setting the  multirail policy to
MV2_MRAIL_SHARING since RDMA_CM based multicast  is enabled.
[ns02:mpi_rank_5][error_sighandler] Caught error: Bus error (signal 7)
[ns01:mpi_rank_0][error_sighandler] Caught error: Bus error (signal 7)
[ns01:mpi_rank_16][error_sighandler] Caught error: Bus error (signal 7)

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 454834 RUNNING AT ns02
=   EXIT CODE: 7
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
[proxy:0:0 at ns01] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:911):
assert (!closed) failed
[proxy:0:0 at ns01] HYDT_dmxu_poll_wait_for_event
(tools/demux/demux_poll.c:76): callback returned error status
[proxy:0:0 at ns01] main (pm/pmiserv/pmip.c:202): demux engine error waiting
for event
[proxy:0:1 at ns02] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:911):
assert (!closed) failed
[proxy:0:1 at ns02] HYDT_dmxu_poll_wait_for_event
(tools/demux/demux_poll.c:76): callback returned error status
[proxy:0:1 at ns02] main (pm/pmiserv/pmip.c:202): demux engine error waiting
for event
[proxy:0:2 at ns03] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:911):
assert (!closed) failed
[proxy:0:2 at ns03] HYDT_dmxu_poll_wait_for_event
(tools/demux/demux_poll.c:76): callback returned error status
[proxy:0:2 at ns03] main (pm/pmiserv/pmip.c:202): demux engine error waiting
for event
[proxy:0:3 at ns04] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:911):
assert (!closed) failed
[proxy:0:3 at ns04] HYDT_dmxu_poll_wait_for_event
(tools/demux/demux_poll.c:76): callback returned error status
[proxy:0:3 at ns04] main (pm/pmiserv/pmip.c:202): demux engine error waiting
for event
[proxy:0:4 at ns01] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:911):
assert (!closed) failed
[proxy:0:4 at ns01] HYDT_dmxu_poll_wait_for_event
(tools/demux/demux_poll.c:76): callback returned error status
[proxy:0:4 at ns01] main (pm/pmiserv/pmip.c:202): demux engine error waiting
for event
[proxy:0:6 at ns03] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:911):
assert (!closed) failed
[proxy:0:6 at ns03] HYDT_dmxu_poll_wait_for_event
(tools/demux/demux_poll.c:76): callback returned error status
[proxy:0:6 at ns03] main (pm/pmiserv/pmip.c:202): demux engine error waiting
for event
[proxy:0:7 at ns04] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:911):
assert (!closed) failed
[proxy:0:7 at ns04] HYDT_dmxu_poll_wait_for_event
(tools/demux/demux_poll.c:76): callback returned error status
[proxy:0:7 at ns04] main (pm/pmiserv/pmip.c:202): demux engine error waiting
for event
[proxy:0:8 at ns01] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:911):
assert (!closed) failed
[proxy:0:8 at ns01] HYDT_dmxu_poll_wait_for_event
(tools/demux/demux_poll.c:76): callback returned error status
[proxy:0:8 at ns01] main (pm/pmiserv/pmip.c:202): demux engine error waiting
for event
[proxy:0:9 at ns02] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:911):
assert (!closed) failed
[proxy:0:9 at ns02] HYDT_dmxu_poll_wait_for_event
(tools/demux/demux_poll.c:76): callback returned error status
[proxy:0:9 at ns02] main (pm/pmiserv/pmip.c:202): demux engine error waiting
for event
[proxy:0:10 at ns03] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:911):
assert (!closed) failed
[proxy:0:10 at ns03] HYDT_dmxu_poll_wait_for_event
(tools/demux/demux_poll.c:76): callback returned error status
[proxy:0:10 at ns03] main (pm/pmiserv/pmip.c:202): demux engine error waiting
for event
[proxy:0:11 at ns04] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:911):
assert (!closed) failed
[proxy:0:11 at ns04] HYDT_dmxu_poll_wait_for_event
(tools/demux/demux_poll.c:76): callback returned error status
[proxy:0:11 at ns04] main (pm/pmiserv/pmip.c:202): demux engine error waiting
for event
[proxy:0:12 at ns01] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:911):
assert (!closed) failed
[proxy:0:12 at ns01] HYDT_dmxu_poll_wait_for_event
(tools/demux/demux_poll.c:76): callback returned error status
[proxy:0:12 at ns01] main (pm/pmiserv/pmip.c:202): demux engine error waiting
for event
[proxy:0:13 at ns02] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:911):
assert (!closed) failed
[proxy:0:13 at ns02] HYDT_dmxu_poll_wait_for_event
(tools/demux/demux_poll.c:76): callback returned error status
[proxy:0:13 at ns02] main (pm/pmiserv/pmip.c:202): demux engine error waiting
for event
[proxy:0:14 at ns03] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:911):
assert (!closed) failed
[proxy:0:14 at ns03] HYDT_dmxu_poll_wait_for_event
(tools/demux/demux_poll.c:76): callback returned error status
[proxy:0:14 at ns03] main (pm/pmiserv/pmip.c:202): demux engine error waiting
for event
[proxy:0:15 at ns04] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:911):
assert (!closed) failed
[proxy:0:15 at ns04] HYDT_dmxu_poll_wait_for_event
(tools/demux/demux_poll.c:76): callback returned error status
[proxy:0:15 at ns04] main (pm/pmiserv/pmip.c:202): demux engine error waiting
for event
[proxy:0:16 at ns01] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:911):
assert (!closed) failed
[proxy:0:16 at ns01] HYDT_dmxu_poll_wait_for_event
(tools/demux/demux_poll.c:76): callback returned error status
[proxy:0:16 at ns01] main (pm/pmiserv/pmip.c:202): demux engine error waiting
for event
[proxy:0:17 at ns02] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:911):
assert (!closed) failed
[proxy:0:17 at ns02] HYDT_dmxu_poll_wait_for_event
(tools/demux/demux_poll.c:76): callback returned error status
[proxy:0:17 at ns02] main (pm/pmiserv/pmip.c:202): demux engine error waiting
for event
[proxy:0:18 at ns03] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:911):
assert (!closed) failed
[proxy:0:18 at ns03] HYDT_dmxu_poll_wait_for_event
(tools/demux/demux_poll.c:76): callback returned error status
[proxy:0:18 at ns03] main (pm/pmiserv/pmip.c:202): demux engine error waiting
for event
[proxy:0:19 at ns04] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:911):
assert (!closed) failed
[proxy:0:19 at ns04] HYDT_dmxu_poll_wait_for_event
(tools/demux/demux_poll.c:76): callback returned error status
[proxy:0:19 at ns04] main (pm/pmiserv/pmip.c:202): demux engine error waiting
for event
[proxy:0:20 at ns01] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:911):
assert (!closed) failed
[proxy:0:20 at ns01] HYDT_dmxu_poll_wait_for_event
(tools/demux/demux_poll.c:76): callback returned error status
[proxy:0:20 at ns01] main (pm/pmiserv/pmip.c:202): demux engine error waiting
for event
[proxy:0:21 at ns02] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:911):
assert (!closed) failed
[proxy:0:21 at ns02] HYDT_dmxu_poll_wait_for_event
(tools/demux/demux_poll.c:76): callback returned error status
[proxy:0:21 at ns02] main (pm/pmiserv/pmip.c:202): demux engine error waiting
for event
[proxy:0:22 at ns03] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:911):
assert (!closed) failed
[proxy:0:22 at ns03] HYDT_dmxu_poll_wait_for_event
(tools/demux/demux_poll.c:76): callback returned error status
[proxy:0:22 at ns03] main (pm/pmiserv/pmip.c:202): demux engine error waiting
for event
[proxy:0:23 at ns04] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:911):
assert (!closed) failed
[proxy:0:23 at ns04] HYDT_dmxu_poll_wait_for_event
(tools/demux/demux_poll.c:76): callback returned error status
[proxy:0:23 at ns04] main (pm/pmiserv/pmip.c:202): demux engine error waiting
for event
[proxy:0:24 at ns01] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:911):
assert (!closed) failed
[proxy:0:24 at ns01] HYDT_dmxu_poll_wait_for_event
(tools/demux/demux_poll.c:76): callback returned error status
[proxy:0:24 at ns01] main (pm/pmiserv/pmip.c:202): demux engine error waiting
for event
[proxy:0:25 at ns02] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:911):
assert (!closed) failed
[proxy:0:25 at ns02] HYDT_dmxu_poll_wait_for_event
(tools/demux/demux_poll.c:76): callback returned error status
[proxy:0:25 at ns02] main (pm/pmiserv/pmip.c:202): demux engine error waiting
for event
[proxy:0:26 at ns03] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:911):
assert (!closed) failed
[proxy:0:26 at ns03] HYDT_dmxu_poll_wait_for_event
(tools/demux/demux_poll.c:76): callback returned error status
[proxy:0:26 at ns03] main (pm/pmiserv/pmip.c:202): demux engine error waiting
for event
[proxy:0:27 at ns04] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:911):
assert (!closed) failed
[proxy:0:27 at ns04] HYDT_dmxu_poll_wait_for_event
(tools/demux/demux_poll.c:76): callback returned error status
[proxy:0:27 at ns04] main (pm/pmiserv/pmip.c:202): demux engine error waiting
for event
[proxy:0:28 at ns01] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:911):
assert (!closed) failed
[proxy:0:28 at ns01] HYDT_dmxu_poll_wait_for_event
(tools/demux/demux_poll.c:76): callback returned error status
[proxy:0:28 at ns01] main (pm/pmiserv/pmip.c:202): demux engine error waiting
for event
[proxy:0:29 at ns02] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:911):
assert (!closed) failed
[proxy:0:29 at ns02] HYDT_dmxu_poll_wait_for_event
(tools/demux/demux_poll.c:76): callback returned error status
[proxy:0:29 at ns02] main (pm/pmiserv/pmip.c:202): demux engine error waiting
for event
[proxy:0:30 at ns03] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:911):
assert (!closed) failed
[proxy:0:30 at ns03] HYDT_dmxu_poll_wait_for_event
(tools/demux/demux_poll.c:76): callback returned error status
[proxy:0:30 at ns03] main (pm/pmiserv/pmip.c:202): demux engine error waiting
for event
[proxy:0:31 at ns04] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:911):
assert (!closed) failed
[proxy:0:31 at ns04] HYDT_dmxu_poll_wait_for_event
(tools/demux/demux_poll.c:76): callback returned error status
[proxy:0:31 at ns04] main (pm/pmiserv/pmip.c:202): demux engine error waiting
for event
[mpiexec at inv32] HYDT_bscu_wait_for_completion
(tools/bootstrap/utils/bscu_wait.c:76): one of the processes terminated
badly; aborting
[mpiexec at inv32] HYDT_bsci_wait_for_completion
(tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting for
completion
[mpiexec at inv32] HYD_pmci_wait_for_completion
(pm/pmiserv/pmiserv_pmci.c:218): launcher returned error waiting for
completion

Thank You!
Mohsen
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20200715/9bbcf88e/attachment-0001.html>


More information about the mvapich-discuss mailing list