[mvapich-discuss] mpirun hangs only for multiple nodes job (infiniband)

Zhiwei Liu z.liu at usciences.edu
Thu Jun 22 11:50:19 EDT 2017


Dear all,

I recently upgraded my cluster from ubuntu 12.04 LTS to 16.04 LTS. My system has Infiniband over Mellanox using the mlx4 driver, rdma (Infiniband/iWARP), ib_umad, ib_ipoib etc.

I downloaded mvapich2-2.3a and it compiled fine.

However, when I test it, mpirun runs fine with single node job, but hangs for multi nodes job (I tried the latency test, a simple hello world and pmemd.MPI from amber, all the same no matter the program). Basically, I think the launcher works fine as I do see processes being started on the nodes, but with a status of S instead of R, and it stays there forever, with a CPU% very low like 1 to 5, with a MEM% 0.0 (by using the top command).

I tried add MV2_SHOW_ENV_INFO=2 but nothing got written out.

I suspect it is the issue with infiniband/memory allocation and tried to play with the modprobe.d/mlx.conf  setting as suggested by the user guide, but nothing changes when I add

options mlx4_core log_num_mtt=24 in the mlx.conf file and restarted the computer nodes.

I am running dry on ideas, would appreciate any help or pointing to the right direction to look.

My infiniband is working fine and a performance test with iperf returns a 29Gbits/s healthy speed.

Please help.

Zhiwei
At the University of the Sciences in Phildelphia



More information about the mvapich-discuss mailing list