[mvapich-discuss] mpirun hangs only for multiple nodes job (infiniband)

Zhiwei Liu z.liu at usciences.edu
Fri Jun 23 16:17:28 EDT 2017


Thanks for the reply.

I actually just sent an update. It seems that this only happens with 2.3a,
mvapich2-2.2 seems working fine on my system. In the meantime, I will be
fine with version 2.2.

zhiwei

On 6/23/17, 4:10 PM, "Panda, Dhabaleswar" <panda at cse.ohio-state.edu> wrote:

>Hi,
>
>Sorry to know that you are seeing this issue with 2.3a. Several of the
>MVAPICH team members were at ISC '17 conference in Frankfurt this entire
>week and returning back during the weekend. We will take a look at this
>issue and get back to you soon.
>
>Thanks in advance for your patience.
>
>DK
>
>Sent from my iPhone
>
>> On Jun 23, 2017, at 3:54 PM, Zhiwei Liu <z.liu at usciences.edu> wrote:
>> 
>> Update: mvapich2-2.2 works fine.
>> 
>> zhiwei
>> 
>> From: Zhiwei Liu <z.liu at usciences.edu<mailto:z.liu at usciences.edu>>
>> Date: Thursday, June 22, 2017 at 11:50 AM
>> To: 
>>"mvapich-discuss at cse.ohio-state.edu<mailto:mvapich-discuss at cse.ohio-state
>>.edu>" 
>><mvapich-discuss at cse.ohio-state.edu<mailto:mvapich-discuss at cse.ohio-state
>>.edu>>
>> Subject: mpirun hangs only for multiple nodes job (infiniband)
>> 
>> Dear all,
>> 
>> I recently upgraded my cluster from ubuntu 12.04 LTS to 16.04 LTS. My
>>system has Infiniband over Mellanox using the mlx4 driver, rdma
>>(Infiniband/iWARP), ib_umad, ib_ipoib etc.
>> 
>> I downloaded mvapich2-2.3a and it compiled fine.
>> 
>> However, when I test it, mpirun runs fine with single node job, but
>>hangs for multi nodes job (I tried the latency test, a simple hello
>>world and pmemd.MPI from amber, all the same no matter the program).
>>Basically, I think the launcher works fine as I do see processes being
>>started on the nodes, but with a status of S instead of R, and it stays
>>there forever, with a CPU% very low like 1 to 5, with a MEM% 0.0 (by
>>using the top command).
>> 
>> I tried add MV2_SHOW_ENV_INFO=2 but nothing got written out.
>> 
>> I suspect it is the issue with infiniband/memory allocation and tried
>>to play with the modprobe.d/mlx.conf  setting as suggested by the user
>>guide, but nothing changes when I add
>> 
>> options mlx4_core log_num_mtt=24 in the mlx.conf file and restarted the
>>computer nodes.
>> 
>> I am running dry on ideas, would appreciate any help or pointing to the
>>right direction to look.
>> 
>> My infiniband is working fine and a performance test with iperf returns
>>a 29Gbits/s healthy speed.
>> 
>> Please help.
>> 
>> Zhiwei
>> At the University of the Sciences in Phildelphia
>> 
>> _______________________________________________
>> mvapich-discuss mailing list
>> mvapich-discuss at cse.ohio-state.edu
>> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss




More information about the mvapich-discuss mailing list