[mvapich-discuss] Collectives deadlock for >8192 processes on QLogic/PSM since MVAPICH2 2.0.1

Wed Apr 8 07:27:43 EDT 2015

Hello,

We have been using MVAPICH for quite some time and recently I have been
trying to move to MVAPICH2 2.1, however I noticed some issues on our
QLogic-based cluster. It seems everything is fine up 8192 processes, but
for 8193+ some collectives seem to cause a deadlock.

In particular, I wrote a very short C application that only calls
MPI_File_read_all for 1 INT and compiled it with various MVAPICH versions.
Different versions of both MVAPICH and the test application are all
compiled with the same flags, same environment, same Intel 15 compilers. I
got the following results:

·         1.8.1, np=8100 - runs ok

·         1.8.1, np=8200 - runs ok

·         1.9, np=8100 - runs ok

·         1.9, np=8200 - runs ok

·         2.0.1, np=8100 - runs ok

·         2.0.1, np=8200 - ipath_update_tid_err: failed: Bad address,
Failed to update 33 tids (err=23)

·         2.1, np=8100 - runs ok

·         2.1, np=8200 - deadlock (seems to be PMPI_File_read_all ->
MPIOI_File_read_all -> ADIOI_GEN_ReadStridedColl -> ADIOI_Calc_others_req
-> PMPI_Alltoall -> MPIR_Alltoall_impl -> MPIR_Alltoall_MV2 ->
MPIR_Alltoall_RD_MV2 -> MPIC_Sendrecv -> MPIC_Wait -> psm_progress_wait)

I tried to adjust various environment variables (MV2_CM_RECV_BUFFERS,
MV2_USE_SHMEM_COLL, MV2_SRQ_MAX_SIZE etc.) that seemed to be related to my
problem but they did not make a difference. I suspect something in PSM
interface has been changed or some optimizations for collectives for
np>8192 do not work correctly for PSM.

Please let me know if you have any ideas on what can I try to resolve that
issue or maybe someone can test a recent version on QLogic/PSM equipment?

Thank you.

Regards,

Marcin Rogowski

Saudi Aramco
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20150408/224dfd71/attachment.html>