[Mvapich-discuss] FW: OSU_alltoall fail to complete when servers have extra ConnectX-5

Wed Jan 26 10:40:18 EST 2022

Good day OSU team,

I’ve been debugging an issue with our system that just came up after installing new extra ConnectX-5 in some servers (12 out of 48) our system. When adding a single rank from a server hosting an extra CX-5 we start seeing problem as below:
In this case, I’ve the disabled the PCIe slot hosting the extra CX-5 on all server except one and still getting the error. If I remove the offending server from the “all_cards.cfg” host file,  I can now use all host and the maximum number of ranks. The CX5 were added a month ago and I initially suspected I did make a mistake the way I build the latest code, but I’ve tried multiple versions release to Rockport and still getting into this state. Depending on the version been used I’m getting different error (which is strange). The problem starts exhibiting when the cables were added to the CX-5 and the switches. Unless I disabled the PCIe slot hosting the extra card I cannot run the simple test below. Removing the “MV2_HOMOGENOUS_CLUSTER=1” do not make any difference and explicitly specifying the “MV2_IBA_HCA=mlx5_0” doesn’t help either.

Note: The interfaces were not configured at this state, and I have not used the card at all. They are CX-5 VPI cards, and they are still configured for IB. I couldn’t find any info in the User’s Guide related to the problem I’m seeing. Likely a configuration issue at my end but I don’t know what I’m missing here. Unfortunately, I had to disable all cards for running test but if required I can reconfigure some of the server to reproduce the problem and capture extra information.

I did test using the official mvapich2-2.3.6, and the 3 drops we received for Rockport.

/opt/bm/hpc/mvapich2-2.3.7pre-rockportqos-nov15/bin/mpiexec -np $((1*47)) -f all_cards.cfg -env MV2_USE_RDMA_CM=1 -env MV2_HOMOGENEOUS_CLUSTER=1 -env MV2_HYBRID_ENABLE_THRESHOLD=102400 -env MV2_NDREG_ENTRIES_MAX=100000 -env MV2_NDREG_ENTRIES=50000 -env MV2_IBA_HCA=mlx5_0  /opt/bm/hpc/mvapich2-latest/libexec/osu-micro-benchmarks/mpi/collective/osu_alltoall -f -i 100
ssh: connect to host 172.20.141.148 port 22: No route to host
^C[mpiexec at dell-s13-h1] Sending Ctrl-C to processes as requested
[mpiexec at dell-s13-h1] Press Ctrl-C again to force abort
[mpiexec at dell-s13-h1] HYDU_sock_write (../../../../src/pm/hydra/utils/sock/sock.c:303): write error (Bad file descriptor)
[mpiexec at dell-s13-h1] HYD_pmcd_pmiserv_send_signal (../../../../src/pm/hydra/pm/pmiserv/pmiserv_cb.c:177): unable to write data to proxy
[mpiexec at dell-s13-h1] ui_cmd_cb (../../../../src/pm/hydra/pm/pmiserv/pmiserv_pmci.c:79): unable to send signal downstream
[mpiexec at dell-s13-h1] HYDT_dmxu_poll_wait_for_event (../../../../src/pm/hydra/tools/demux/demux_poll.c:76): callback returned error status
[mpiexec at dell-s13-h1] HYD_pmci_wait_for_completion (../../../../src/pm/hydra/pm/pmiserv/pmiserv_pmci.c:198): error waiting for event
[mpiexec at dell-s13-h1] main (../../../../src/pm/hydra/ui/mpich/mpiexec.c:340): process manager error waiting for completion
[user at dell-s13-h1

Regards,
Nicolas Gagnon
Principal Designer/Architect, Engineering
ngagnon at rockportnetworks.com<mailto:ngagnon at rockportnetworks.com>
Rockport | Simplify the Network

[signature_849490256]<https://urldefense.com/v3/__https://rockportnetworks.com/__;!!KGKeukY!npzMFArZLrvAybsrQuuOZLE6oGSuIcSNhWOKrbm4z1Ai_cfMNXqwegbZ8CO0eM0pu27RLw8KVQ$ >

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osu.edu/pipermail/mvapich-discuss/attachments/20220126/5f618b76/attachment-0020.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image001.png
Type: image/png
Size: 6093 bytes
Desc: image001.png
URL: <http://lists.osu.edu/pipermail/mvapich-discuss/attachments/20220126/5f618b76/attachment-0020.png>