[mvapich-discuss] mvapich2-2.3b on heterogeneous cluster
Bob Soliday
soliday at anl.gov
Thu Oct 5 15:10:13 EDT 2017
We recently added 4 nodes to our cluster. The older nodes all have 1 IB
device:
device node GUID
------ ----------------
mlx4_0 0002c90300043e94
The new nodes have 2 IB devices:
device node GUID
------ ----------------
mlx4_0 248a070300fc15d0
mlx5_0 a4bf01030018c34c
The mlx4_0 device on the new nodes are listed as Ethernet link layers
using ibv_devinfo. The mlx5_0 device with the InifiniBand link layer is
what we are using. Setting MV2_NUM_HCAS=2 seemed to solve the problem of
finding the active device. This also doesn't seem to cause problems when
using the older nodes.
If I add enough nodes to the job, eventually it will crash with:
[weed5.cluster:mpi_rank_5][async_thread]
src/mpid/ch3/channels/mrail/src/gen2/ibv_channel_manager.c:1112: Got
FATAL event 3
[weed7.cluster:mpi_rank_7][handle_cqe] Send desc error in msg to 5,
wc_opcode=0
[weed7.cluster:mpi_rank_7][handle_cqe] Msg from 5: wc.status=10,
wc.wr_id=0x2ffa040, wc.opcode=0, vbuf->phead->type=24 =
MPIDI_CH3_PKT_ADDRESS
[weed7.cluster:mpi_rank_7][handle_cqe]
src/mpid/ch3/channels/mrail/src/gen2/ibv_channel_manager.c:548: [] Got
completion with error 10, vendor code=0x88, dest rank=5
: Numerical argument out of domain (33)
[weed7.cluster:mpispawn_6][readline] Unexpected End-Of-File on file
descriptor 6. MPI process died?
[weed7.cluster:mpispawn_6][mtpmi_processops] Error while reading PMI
socket. MPI process died?
[weed11.cluster:mpi_rank_11][handle_cqe] Send desc error in msg to 9,
wc_opcode=0
[weed11.cluster:mpi_rank_11][handle_cqe] Msg from 9: wc.status=10,
wc.wr_id=0x2f80040, wc.opcode=0, vbuf->phead->type=24 =
MPIDI_CH3_PKT_ADDRESS
[weed9.cluster:mpi_rank_9][async_thread]
src/mpid/ch3/channels/mrail/src/gen2/ibv_channel_manager.c:1112: Got
FATAL event 3
[weed11.cluster:mpi_rank_11][handle_cqe]
src/mpid/ch3/channels/mrail/src/gen2/ibv_channel_manager.c:548: [] Got
completion with error 10, vendor code=0x88, dest rank=9
: Numerical argument out of domain (33)
[weed7.cluster:mpispawn_6][child_handler] MPI process (rank: 7, pid:
4508) exited with status 252
[weed9.cluster:mpispawn_8][readline] Unexpected End-Of-File on file
descriptor 6. MPI process died?
[weed9.cluster:mpispawn_8][mtpmi_processops] Error while reading PMI
socket. MPI process died?
[weed11.cluster:mpispawn_10][readline] Unexpected End-Of-File on file
descriptor 6. MPI process died?
[weed11.cluster:mpispawn_10][mtpmi_processops] Error while reading PMI
socket. MPI process died?
[weed5.cluster:mpispawn_4][readline] Unexpected End-Of-File on file
descriptor 6. MPI process died?
[weed5.cluster:mpispawn_4][mtpmi_processops] Error while reading PMI
socket. MPI process died?
[weed9.cluster:mpispawn_8][child_handler] MPI process (rank: 9, pid:
3645) exited with status 255
[weed11.cluster:mpispawn_10][child_handler] MPI process (rank: 11, pid:
16975) exited with status 252
[weed5.cluster:mpispawn_4][child_handler] MPI process (rank: 5, pid:
4656) exited with status 255
[soliday at weed124 Pelegant_ringTracking1]$
[weed124.cluster:mpispawn_0][read_size] Unexpected End-Of-File on file
descriptor 11. MPI process died?
[weed124.cluster:mpispawn_0][read_size] Unexpected End-Of-File on file
descriptor 11. MPI process died?
[weed124.cluster:mpispawn_0][handle_mt_peer] Error while reading PMI
socket. MPI process died?
[weed2.cluster:mpispawn_1][read_size] Unexpected End-Of-File on file
descriptor 7. MPI process died?
[weed2.cluster:mpispawn_1][read_size] Unexpected End-Of-File on file
descriptor 7. MPI process died?
[weed2.cluster:mpispawn_1][handle_mt_peer] Error while reading PMI
socket. MPI process died?
[weed6.cluster:mpispawn_5][read_size] Unexpected End-Of-File on file
descriptor 5. MPI process died?
[weed6.cluster:mpispawn_5][read_size] Unexpected End-Of-File on file
descriptor 5. MPI process died?
[weed6.cluster:mpispawn_5][handle_mt_peer] Error while reading PMI
socket. MPI process died?
[weed3.cluster:mpispawn_2][read_size] Unexpected End-Of-File on file
descriptor 5. MPI process died?
[weed3.cluster:mpispawn_2][read_size] Unexpected End-Of-File on file
descriptor 5. MPI process died?
[weed3.cluster:mpispawn_2][handle_mt_peer] Error while reading PMI
socket. MPI process died?
[weed8.cluster:mpispawn_7][read_size] Unexpected End-Of-File on file
descriptor 5. MPI process died?
[weed8.cluster:mpispawn_7][read_size] Unexpected End-Of-File on file
descriptor 5. MPI process died?
[weed8.cluster:mpispawn_7][handle_mt_peer] Error while reading PMI
socket. MPI process died?
[weed4.cluster:mpispawn_3][read_size] Unexpected End-Of-File on file
descriptor 5. MPI process died?
[weed4.cluster:mpispawn_3][read_size] Unexpected End-Of-File on file
descriptor 5. MPI process died?
[weed4.cluster:mpispawn_3][handle_mt_peer] Error while reading PMI
socket. MPI process died?
The launch command:
/lustre/3rdPartySoftware/mvapich2-2.3b/bin/mpirun_rsh -rsh \
-hostfile /lustre/soliday/ElegantTests/Pelegant_ringTracking1/machines \
-np 12 MV2_ENABLE_AFFINITY=0 MV2_ON_DEMAND_THRESHOLD=5000 \
MV2_SHOW_HCA_BINDING=2 MV2_NUM_HCAS=2 \
/home/soliday/oag/apps/src/elegant/O.linux-x86_64/Pelegant
manyParticles_p.ele
machine file (weed124 is the only new node in the list):
weed124
weed124
weed2
weed3
weed4
weed5
weed6
weed7
weed8
weed9
weed10
weed11
mpichversion:
MVAPICH2 Version: 2.3b
MVAPICH2 Release date: Thu Aug 10 22:00:00 EST 2017
MVAPICH2 Device: ch3:mrail
MVAPICH2 configure: --prefix=/lustre/3rdPartySoftware/mvapich2-2.3b
--with-device=ch3:mrail --with-rdma=gen2 --disable-shared --enable-romio
--with-file-system=lustre+nfs
MVAPICH2 CC: gcc -DNDEBUG -DNVALGRIND -O2
MVAPICH2 CXX: g++ -DNDEBUG -DNVALGRIND -O2
MVAPICH2 F77: gfortran -L/lib -L/lib -O2
MVAPICH2 FC: gfortran -O2
Hopefully someone knows what I am doing wrong.
--Bob Soliday
More information about the mvapich-discuss
mailing list