[Mvapich-discuss] ERROR:running osu_benchmark on multiple nodes

Subramoni, Hari subramoni.1 at osu.edu
Sat Sep 24 11:27:40 EDT 2022


Hello.

Could you please let us know what MPI library you are using here?

Thx,
Hari.

From: Mvapich-discuss <mvapich-discuss-bounces at lists.osu.edu> On Behalf Of s2633806413 at 126.com via Mvapich-discuss
Sent: Friday, September 23, 2022 3:23 AM
To: mvapich-discuss at lists.osu.edu
Subject: [Mvapich-discuss] ERROR:running osu_benchmark on multiple nodes

Hello, when I running osu_benchmark on eight nodes. The command used us: ./mpirun -np 8 -ppn 1 -host 10. 3. 1. 1,10. 3. 1. 2,10. 3. 1. 3,10. 3. 1. 4,10. 3. 1. 5,10. 3. 1. 6,10. 3. 1. 7,10. 3. 1. 9 ./osu_reduce Get wrong result: Caught error: Segmentation fault(signal
ZjQcmQRYFpfptBannerStart
This Message Is From an External Sender
This message came from outside your organization.
    Report Suspicious  <https://us-phishalarm-ewt.proofpoint.com/EWT/v1/KGKeukY!se0ZFkwrqC6NivkQP-DE99PagLIZu8Kks6rkWo20OBm_qXht5Bf2U8yS0tTovqjUKM50UsSbnHekuyq__Rg77xPGZYGcn60$>   ‌
ZjQcmQRYFpfptBannerEnd
Hello, when I running osu_benchmark on eight nodes.
The command used us:
./mpirun -np 8 -ppn 1 -host 10.3.1.1,10.3.1.2,10.3.1.3,10.3.1.4,10.3.1.5,10.3.1.6,10.3.1.7,10.3.1.9 ./osu_reduce
Get wrong result:

Caught error:Segmentation fault(signal 11)

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 725097 RUNNING AT yuanhe
=   EXIT CODE: 139
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================

[proxy:0:0 at swat3-01] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:911): assert (!closed) failed
[proxy:0:0 at swat3-01] HYDT_dmxu_poll_wait_for_event (tools/demux/demux_poll.c:76): callback returned error status
[proxy:0:0 at swat3-01] main (pm/pmiserv/pmip.c:202): demux engine error waiting for event
[proxy:0:0 at swat3-03] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:911): assert (!closed) failed
[proxy:0:0 at swat3-03] HYDT_dmxu_poll_wait_for_event (tools/demux/demux_poll.c:76): callback returned error status
[proxy:0:0 at swat3-03] main (pm/pmiserv/pmip.c:202): demux engine error waiting for event
[proxy:0:0 at swat3-05] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:911): assert (!closed) failed
[proxy:0:0 at swat3-05] HYDT_dmxu_poll_wait_for_event (tools/demux/demux_poll.c:76): callback returned error status
[proxy:0:0 at swat3-05] main (pm/pmiserv/pmip.c:202): demux engine error waiting for event
[mpiexec at swat3-01] HYDT_bscu_wait_for_completion (tools/bootstrap/utils/bscu_wait.c:76): one of the processes terminated badly; aborting
[mpiexec at swat3-01] HYDT_bsci_wait_for_completion (tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting for completion
[mpiexec at swat3-01] HYD_pmci_wait_for_completion (pm/pmiserv/pmiserv_pmci.c:218): launcher returned error waiting for completion


[mpiexec at swat3-01] main (ui/mpich/mpiexec.c:344): process manager error waiting for completion

This above is the error log.

I also test :

1.

./mpirun -np 7 -ppn 1 -host 10.3.1.1,10.3.1.2,10.3.1.3,10.3.1.4,10.3.1.5,10.3.1.6,10.3.1.7 ./osu_reduce

---->have a current result

2.

./mpirun -np 7 -ppn 1 -host 10.3.1.1,10.3.1.2,10.3.1.3,10.3.1.4,10.3.1.5,10.3.1.6,10.3.1.9 ./osu_reduce

---->have a current result

3.

./mpirun -np 8 -ppn 1 -host 10.3.1.1,10.3.1.2,10.3.1.3,10.3.1.4,10.3.1.5,10.3.1.6,10.3.1.7,10.3.1.9 ./osu_reduce

---->have a wrong result

Thanks to anyone who may shed light into this.

-sirui
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osu.edu/pipermail/mvapich-discuss/attachments/20220924/fa92b66a/attachment-0014.html>


More information about the Mvapich-discuss mailing list