[mvapich-discuss] Cyclic ranking error in mvapich2-2.3.4
Mohsen Gavahi
gavahi.hw at gmail.com
Wed Jul 15 15:05:23 EDT 2020
Hello,
I am a grad student in Computer Science at FSU.
I tested the versions of mvapich2-2.3.2 and mvapich2-2.3.4 for a
specific host file as this:
inv38:1
inv34:1
inv35:1
inv36:1
inv38:1
inv34:1
inv35:1
inv36:1
inv38:1
inv34:1
inv35:1
inv36:1
inv38:1
inv34:1
inv35:1
inv36:1
inv38:1
inv34:1
inv35:1
inv36:1
inv38:1
inv34:1
inv35:1
inv36:1
inv38:1
inv34:1
inv35:1
inv36:1
inv38:1
inv34:1
inv35:1
inv36:1
In both versions, I used the hellow.c file located at mvapich/examples
directory.
In *mvapich2-2.3.2*, it works successfully.
But the *mvapich2-2.3.4* sometimes runs with an unrealistic high
latency, or it shows a Warning message then the errors.
The output for both versions pasted at the end of this text.
gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-36)
Linux inv32 3.10.0-957.21.3.el7.x86_64 #1 SMP Tue Jun 18 16:35:19 UTC 2019
x86_64 x86_64 x86_64 GNU/Linux
hca_id: mlx4_0
transport: InfiniBand (0)
fw_ver: 2.9.1000
node_guid: 0002:c903:000e:8bd8
sys_image_guid: 0002:c903:000e:8bdb
vendor_id: 0x02c9
vendor_part_id: 26428
hw_ver: 0xB0
board_id: MT_0FC0110009
phys_port_cnt: 2
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 4096 (5)
sm_lid: 13
port_lid: 26
port_lmc: 0x00
link_layer: InfiniBand
port: 2
state: PORT_DOWN (1)
max_mtu: 4096 (5)
active_mtu: 4096 (5)
sm_lid: 0
port_lid: 0
port_lmc: 0x00
link_layer: InfiniBand
/***********************************************************************/
/mvapich2-2.3.2/install/bin/mpiexec -n 32 -f 32-nole ./hellow
Hello world from process 11 of 32
Hello world from process 0 of 32
Hello world from process 12 of 32
Hello world from process 5 of 32
Hello world from process 14 of 32
Hello world from process 20 of 32
Hello world from process 18 of 32
Hello world from process 24 of 32
Hello world from process 28 of 32
Hello world from process 6 of 32
Hello world from process 2 of 32
Hello world from process 13 of 32
Hello world from process 9 of 32
Hello world from process 3 of 32
Hello world from process 7 of 32
Hello world from process 15 of 32
Hello world from process 8 of 32
Hello world from process 4 of 32
Hello world from process 1 of 32
Hello world from process 10 of 32
Hello world from process 21 of 32
Hello world from process 30 of 32
Hello world from process 25 of 32
Hello world from process 17 of 32
Hello world from process 16 of 32
Hello world from process 29 of 32
Hello world from process 22 of 32
Hello world from process 31 of 32
Hello world from process 26 of 32
Hello world from process 27 of 32
Hello world from process 23 of 32
Hello world from process 19 of 32
/***********************************************************************/
/mvapich2-2.3.4/install/bin/mpiexec -n 32 -f 32-nole ./hellow
[ns01:mpi_rank_0][rdma_open_hca] [Warning] Setting the multirail policy to
MV2_MRAIL_SHARING since RDMA_CM based multicast is enabled.
[ns02:mpi_rank_5][error_sighandler] Caught error: Bus error (signal 7)
[ns01:mpi_rank_0][error_sighandler] Caught error: Bus error (signal 7)
[ns01:mpi_rank_16][error_sighandler] Caught error: Bus error (signal 7)
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= PID 454834 RUNNING AT ns02
= EXIT CODE: 7
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
[proxy:0:0 at ns01] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:911):
assert (!closed) failed
[proxy:0:0 at ns01] HYDT_dmxu_poll_wait_for_event
(tools/demux/demux_poll.c:76): callback returned error status
[proxy:0:0 at ns01] main (pm/pmiserv/pmip.c:202): demux engine error waiting
for event
[proxy:0:1 at ns02] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:911):
assert (!closed) failed
[proxy:0:1 at ns02] HYDT_dmxu_poll_wait_for_event
(tools/demux/demux_poll.c:76): callback returned error status
[proxy:0:1 at ns02] main (pm/pmiserv/pmip.c:202): demux engine error waiting
for event
[proxy:0:2 at ns03] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:911):
assert (!closed) failed
[proxy:0:2 at ns03] HYDT_dmxu_poll_wait_for_event
(tools/demux/demux_poll.c:76): callback returned error status
[proxy:0:2 at ns03] main (pm/pmiserv/pmip.c:202): demux engine error waiting
for event
[proxy:0:3 at ns04] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:911):
assert (!closed) failed
[proxy:0:3 at ns04] HYDT_dmxu_poll_wait_for_event
(tools/demux/demux_poll.c:76): callback returned error status
[proxy:0:3 at ns04] main (pm/pmiserv/pmip.c:202): demux engine error waiting
for event
[proxy:0:4 at ns01] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:911):
assert (!closed) failed
[proxy:0:4 at ns01] HYDT_dmxu_poll_wait_for_event
(tools/demux/demux_poll.c:76): callback returned error status
[proxy:0:4 at ns01] main (pm/pmiserv/pmip.c:202): demux engine error waiting
for event
[proxy:0:6 at ns03] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:911):
assert (!closed) failed
[proxy:0:6 at ns03] HYDT_dmxu_poll_wait_for_event
(tools/demux/demux_poll.c:76): callback returned error status
[proxy:0:6 at ns03] main (pm/pmiserv/pmip.c:202): demux engine error waiting
for event
[proxy:0:7 at ns04] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:911):
assert (!closed) failed
[proxy:0:7 at ns04] HYDT_dmxu_poll_wait_for_event
(tools/demux/demux_poll.c:76): callback returned error status
[proxy:0:7 at ns04] main (pm/pmiserv/pmip.c:202): demux engine error waiting
for event
[proxy:0:8 at ns01] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:911):
assert (!closed) failed
[proxy:0:8 at ns01] HYDT_dmxu_poll_wait_for_event
(tools/demux/demux_poll.c:76): callback returned error status
[proxy:0:8 at ns01] main (pm/pmiserv/pmip.c:202): demux engine error waiting
for event
[proxy:0:9 at ns02] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:911):
assert (!closed) failed
[proxy:0:9 at ns02] HYDT_dmxu_poll_wait_for_event
(tools/demux/demux_poll.c:76): callback returned error status
[proxy:0:9 at ns02] main (pm/pmiserv/pmip.c:202): demux engine error waiting
for event
[proxy:0:10 at ns03] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:911):
assert (!closed) failed
[proxy:0:10 at ns03] HYDT_dmxu_poll_wait_for_event
(tools/demux/demux_poll.c:76): callback returned error status
[proxy:0:10 at ns03] main (pm/pmiserv/pmip.c:202): demux engine error waiting
for event
[proxy:0:11 at ns04] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:911):
assert (!closed) failed
[proxy:0:11 at ns04] HYDT_dmxu_poll_wait_for_event
(tools/demux/demux_poll.c:76): callback returned error status
[proxy:0:11 at ns04] main (pm/pmiserv/pmip.c:202): demux engine error waiting
for event
[proxy:0:12 at ns01] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:911):
assert (!closed) failed
[proxy:0:12 at ns01] HYDT_dmxu_poll_wait_for_event
(tools/demux/demux_poll.c:76): callback returned error status
[proxy:0:12 at ns01] main (pm/pmiserv/pmip.c:202): demux engine error waiting
for event
[proxy:0:13 at ns02] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:911):
assert (!closed) failed
[proxy:0:13 at ns02] HYDT_dmxu_poll_wait_for_event
(tools/demux/demux_poll.c:76): callback returned error status
[proxy:0:13 at ns02] main (pm/pmiserv/pmip.c:202): demux engine error waiting
for event
[proxy:0:14 at ns03] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:911):
assert (!closed) failed
[proxy:0:14 at ns03] HYDT_dmxu_poll_wait_for_event
(tools/demux/demux_poll.c:76): callback returned error status
[proxy:0:14 at ns03] main (pm/pmiserv/pmip.c:202): demux engine error waiting
for event
[proxy:0:15 at ns04] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:911):
assert (!closed) failed
[proxy:0:15 at ns04] HYDT_dmxu_poll_wait_for_event
(tools/demux/demux_poll.c:76): callback returned error status
[proxy:0:15 at ns04] main (pm/pmiserv/pmip.c:202): demux engine error waiting
for event
[proxy:0:16 at ns01] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:911):
assert (!closed) failed
[proxy:0:16 at ns01] HYDT_dmxu_poll_wait_for_event
(tools/demux/demux_poll.c:76): callback returned error status
[proxy:0:16 at ns01] main (pm/pmiserv/pmip.c:202): demux engine error waiting
for event
[proxy:0:17 at ns02] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:911):
assert (!closed) failed
[proxy:0:17 at ns02] HYDT_dmxu_poll_wait_for_event
(tools/demux/demux_poll.c:76): callback returned error status
[proxy:0:17 at ns02] main (pm/pmiserv/pmip.c:202): demux engine error waiting
for event
[proxy:0:18 at ns03] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:911):
assert (!closed) failed
[proxy:0:18 at ns03] HYDT_dmxu_poll_wait_for_event
(tools/demux/demux_poll.c:76): callback returned error status
[proxy:0:18 at ns03] main (pm/pmiserv/pmip.c:202): demux engine error waiting
for event
[proxy:0:19 at ns04] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:911):
assert (!closed) failed
[proxy:0:19 at ns04] HYDT_dmxu_poll_wait_for_event
(tools/demux/demux_poll.c:76): callback returned error status
[proxy:0:19 at ns04] main (pm/pmiserv/pmip.c:202): demux engine error waiting
for event
[proxy:0:20 at ns01] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:911):
assert (!closed) failed
[proxy:0:20 at ns01] HYDT_dmxu_poll_wait_for_event
(tools/demux/demux_poll.c:76): callback returned error status
[proxy:0:20 at ns01] main (pm/pmiserv/pmip.c:202): demux engine error waiting
for event
[proxy:0:21 at ns02] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:911):
assert (!closed) failed
[proxy:0:21 at ns02] HYDT_dmxu_poll_wait_for_event
(tools/demux/demux_poll.c:76): callback returned error status
[proxy:0:21 at ns02] main (pm/pmiserv/pmip.c:202): demux engine error waiting
for event
[proxy:0:22 at ns03] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:911):
assert (!closed) failed
[proxy:0:22 at ns03] HYDT_dmxu_poll_wait_for_event
(tools/demux/demux_poll.c:76): callback returned error status
[proxy:0:22 at ns03] main (pm/pmiserv/pmip.c:202): demux engine error waiting
for event
[proxy:0:23 at ns04] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:911):
assert (!closed) failed
[proxy:0:23 at ns04] HYDT_dmxu_poll_wait_for_event
(tools/demux/demux_poll.c:76): callback returned error status
[proxy:0:23 at ns04] main (pm/pmiserv/pmip.c:202): demux engine error waiting
for event
[proxy:0:24 at ns01] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:911):
assert (!closed) failed
[proxy:0:24 at ns01] HYDT_dmxu_poll_wait_for_event
(tools/demux/demux_poll.c:76): callback returned error status
[proxy:0:24 at ns01] main (pm/pmiserv/pmip.c:202): demux engine error waiting
for event
[proxy:0:25 at ns02] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:911):
assert (!closed) failed
[proxy:0:25 at ns02] HYDT_dmxu_poll_wait_for_event
(tools/demux/demux_poll.c:76): callback returned error status
[proxy:0:25 at ns02] main (pm/pmiserv/pmip.c:202): demux engine error waiting
for event
[proxy:0:26 at ns03] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:911):
assert (!closed) failed
[proxy:0:26 at ns03] HYDT_dmxu_poll_wait_for_event
(tools/demux/demux_poll.c:76): callback returned error status
[proxy:0:26 at ns03] main (pm/pmiserv/pmip.c:202): demux engine error waiting
for event
[proxy:0:27 at ns04] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:911):
assert (!closed) failed
[proxy:0:27 at ns04] HYDT_dmxu_poll_wait_for_event
(tools/demux/demux_poll.c:76): callback returned error status
[proxy:0:27 at ns04] main (pm/pmiserv/pmip.c:202): demux engine error waiting
for event
[proxy:0:28 at ns01] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:911):
assert (!closed) failed
[proxy:0:28 at ns01] HYDT_dmxu_poll_wait_for_event
(tools/demux/demux_poll.c:76): callback returned error status
[proxy:0:28 at ns01] main (pm/pmiserv/pmip.c:202): demux engine error waiting
for event
[proxy:0:29 at ns02] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:911):
assert (!closed) failed
[proxy:0:29 at ns02] HYDT_dmxu_poll_wait_for_event
(tools/demux/demux_poll.c:76): callback returned error status
[proxy:0:29 at ns02] main (pm/pmiserv/pmip.c:202): demux engine error waiting
for event
[proxy:0:30 at ns03] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:911):
assert (!closed) failed
[proxy:0:30 at ns03] HYDT_dmxu_poll_wait_for_event
(tools/demux/demux_poll.c:76): callback returned error status
[proxy:0:30 at ns03] main (pm/pmiserv/pmip.c:202): demux engine error waiting
for event
[proxy:0:31 at ns04] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:911):
assert (!closed) failed
[proxy:0:31 at ns04] HYDT_dmxu_poll_wait_for_event
(tools/demux/demux_poll.c:76): callback returned error status
[proxy:0:31 at ns04] main (pm/pmiserv/pmip.c:202): demux engine error waiting
for event
[mpiexec at inv32] HYDT_bscu_wait_for_completion
(tools/bootstrap/utils/bscu_wait.c:76): one of the processes terminated
badly; aborting
[mpiexec at inv32] HYDT_bsci_wait_for_completion
(tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting for
completion
[mpiexec at inv32] HYD_pmci_wait_for_completion
(pm/pmiserv/pmiserv_pmci.c:218): launcher returned error waiting for
completion
Thank You!
Mohsen
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20200715/9bbcf88e/attachment-0001.html>
More information about the mvapich-discuss
mailing list