[mvapich-discuss] mpirun hangs only for multiple nodes job (infiniband)

Panda, Dhabaleswar panda at cse.ohio-state.edu
Fri Jun 23 16:10:04 EDT 2017


Hi,

Sorry to know that you are seeing this issue with 2.3a. Several of the MVAPICH team members were at ISC '17 conference in Frankfurt this entire week and returning back during the weekend. We will take a look at this issue and get back to you soon. 

Thanks in advance for your patience. 

DK

Sent from my iPhone

> On Jun 23, 2017, at 3:54 PM, Zhiwei Liu <z.liu at usciences.edu> wrote:
> 
> Update: mvapich2-2.2 works fine.
> 
> zhiwei
> 
> From: Zhiwei Liu <z.liu at usciences.edu<mailto:z.liu at usciences.edu>>
> Date: Thursday, June 22, 2017 at 11:50 AM
> To: "mvapich-discuss at cse.ohio-state.edu<mailto:mvapich-discuss at cse.ohio-state.edu>" <mvapich-discuss at cse.ohio-state.edu<mailto:mvapich-discuss at cse.ohio-state.edu>>
> Subject: mpirun hangs only for multiple nodes job (infiniband)
> 
> Dear all,
> 
> I recently upgraded my cluster from ubuntu 12.04 LTS to 16.04 LTS. My system has Infiniband over Mellanox using the mlx4 driver, rdma (Infiniband/iWARP), ib_umad, ib_ipoib etc.
> 
> I downloaded mvapich2-2.3a and it compiled fine.
> 
> However, when I test it, mpirun runs fine with single node job, but hangs for multi nodes job (I tried the latency test, a simple hello world and pmemd.MPI from amber, all the same no matter the program). Basically, I think the launcher works fine as I do see processes being started on the nodes, but with a status of S instead of R, and it stays there forever, with a CPU% very low like 1 to 5, with a MEM% 0.0 (by using the top command).
> 
> I tried add MV2_SHOW_ENV_INFO=2 but nothing got written out.
> 
> I suspect it is the issue with infiniband/memory allocation and tried to play with the modprobe.d/mlx.conf  setting as suggested by the user guide, but nothing changes when I add
> 
> options mlx4_core log_num_mtt=24 in the mlx.conf file and restarted the computer nodes.
> 
> I am running dry on ideas, would appreciate any help or pointing to the right direction to look.
> 
> My infiniband is working fine and a performance test with iperf returns a 29Gbits/s healthy speed.
> 
> Please help.
> 
> Zhiwei
> At the University of the Sciences in Phildelphia
> 
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
-------------- next part --------------
A non-text attachment was scrubbed...
Name: winmail.dat
Type: application/ms-tnef
Size: 4576 bytes
Desc: not available
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20170623/a13797a6/attachment.bin>


More information about the mvapich-discuss mailing list