[mvapich-discuss] mpirun hangs only for multiple nodes job (infiniband)
Zhiwei Liu
z.liu at usciences.edu
Fri Jun 23 16:17:28 EDT 2017
Thanks for the reply.
I actually just sent an update. It seems that this only happens with 2.3a,
mvapich2-2.2 seems working fine on my system. In the meantime, I will be
fine with version 2.2.
zhiwei
On 6/23/17, 4:10 PM, "Panda, Dhabaleswar" <panda at cse.ohio-state.edu> wrote:
>Hi,
>
>Sorry to know that you are seeing this issue with 2.3a. Several of the
>MVAPICH team members were at ISC '17 conference in Frankfurt this entire
>week and returning back during the weekend. We will take a look at this
>issue and get back to you soon.
>
>Thanks in advance for your patience.
>
>DK
>
>Sent from my iPhone
>
>> On Jun 23, 2017, at 3:54 PM, Zhiwei Liu <z.liu at usciences.edu> wrote:
>>
>> Update: mvapich2-2.2 works fine.
>>
>> zhiwei
>>
>> From: Zhiwei Liu <z.liu at usciences.edu<mailto:z.liu at usciences.edu>>
>> Date: Thursday, June 22, 2017 at 11:50 AM
>> To:
>>"mvapich-discuss at cse.ohio-state.edu<mailto:mvapich-discuss at cse.ohio-state
>>.edu>"
>><mvapich-discuss at cse.ohio-state.edu<mailto:mvapich-discuss at cse.ohio-state
>>.edu>>
>> Subject: mpirun hangs only for multiple nodes job (infiniband)
>>
>> Dear all,
>>
>> I recently upgraded my cluster from ubuntu 12.04 LTS to 16.04 LTS. My
>>system has Infiniband over Mellanox using the mlx4 driver, rdma
>>(Infiniband/iWARP), ib_umad, ib_ipoib etc.
>>
>> I downloaded mvapich2-2.3a and it compiled fine.
>>
>> However, when I test it, mpirun runs fine with single node job, but
>>hangs for multi nodes job (I tried the latency test, a simple hello
>>world and pmemd.MPI from amber, all the same no matter the program).
>>Basically, I think the launcher works fine as I do see processes being
>>started on the nodes, but with a status of S instead of R, and it stays
>>there forever, with a CPU% very low like 1 to 5, with a MEM% 0.0 (by
>>using the top command).
>>
>> I tried add MV2_SHOW_ENV_INFO=2 but nothing got written out.
>>
>> I suspect it is the issue with infiniband/memory allocation and tried
>>to play with the modprobe.d/mlx.conf setting as suggested by the user
>>guide, but nothing changes when I add
>>
>> options mlx4_core log_num_mtt=24 in the mlx.conf file and restarted the
>>computer nodes.
>>
>> I am running dry on ideas, would appreciate any help or pointing to the
>>right direction to look.
>>
>> My infiniband is working fine and a performance test with iperf returns
>>a 29Gbits/s healthy speed.
>>
>> Please help.
>>
>> Zhiwei
>> At the University of the Sciences in Phildelphia
>>
>> _______________________________________________
>> mvapich-discuss mailing list
>> mvapich-discuss at cse.ohio-state.edu
>> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
More information about the mvapich-discuss
mailing list