[mvapich-discuss] 答复: 答复: 答复: benchmark osu_bws run failed, on mvapich2-2.0rc1: gethostbyname: Unknown server error

Jonathan Perkins perkinjo at cse.ohio-state.edu
Mon Mar 31 11:41:23 EDT 2014


Forgot to mention.  Before rebuilding your library, please follow
Hari's advice as it may have been two separate issues that you're
facing.

On Mon, Mar 31, 2014 at 11:38 AM, Jonathan Perkins
<perkinjo at cse.ohio-state.edu> wrote:
> You may consult our userguide
> (http://mvapich.cse.ohio-state.edu/support/user_guide_mvapich2-2.0rc1.html).
>  You may be interested in the troubleshooting section as the error
> message you're receiving is not very descriptive at this point.
> You'll probably want to rebuild with the --disable-fast and
> --enable-g=dbg options
> (mvapich.cse.ohio-state.edu/support/user_guide_mvapich2-2.0rc1.html).
>
> As a guess, it seems that gethostbyname is not working for you.
> mpirun_rsh uses this whereas it looks like mpiexec does not.  If this
> is the case, some usage of gethostbyname inside our library *might* be
> failing and causing the Other MPI error.
>
> Please let us know the output from your debug build when you get a chance.
>
> On Mon, Mar 31, 2014 at 11:28 AM, Wang,Yanfei(SYS)
> <wangyanfei01 at baidu.com> wrote:
>> Hi,
>> It seem that the error goes further, old error has expired! Are there some online materials about this, I would like to consult that as well, to try to fix this issue by myself.
>>
>> Before the iptables have prohibited the connection for mpirun_rsh, which has been removed.
>>
>> [root at bb-nsi-ib04 pt2pt]# mpiexec -n 2 -f hosts_mvapich ./osu_bw
>> [cli_1]: aborting job:
>> Fatal error in MPI_Init:
>> Other MPI error
>>
>>
>> ===================================================================================
>> =   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
>> =   EXIT CODE: 1
>> =   CLEANING UP REMAINING PROCESSES
>> =   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
>> ===================================================================================
>> [cli_0]: aborting job:
>> Fatal error in MPI_Init:
>> Other MPI error
>>
>>
>> ===================================================================================
>> =   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
>> =   EXIT CODE: 1
>> =   CLEANING UP REMAINING PROCESSES
>> =   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
>> ===================================================================================
>> [root at bb-nsi-ib04 pt2pt]# mpirun_rsh -np 2 --hostfile hosts_mvapich  ./osu_latency
>> gethostbyname: Unknown server error
>> [bb-nsi-ib04.#com:mpirun_rsh][child_handler] Error in init phase, aborting! (0/2 mpispawn connections)
>> gethostbyname: Unknown server error
>> [root at bb-nsi-ib04 pt2pt]#
>>
>>
>> Thanks
>> Yanfei
>>
>> -----邮件原件-----
>> 发件人: mvapich-discuss [mailto:mvapich-discuss-bounces at cse.ohio-state.edu] 代表 Wang,Yanfei(SYS)
>> 发送时间: 2014年3月31日 23:06
>> 收件人: Jonathan Perkins
>> 抄送: mvapich-discuss
>> 主题: [mvapich-discuss] 答复: 答复: benchmark osu_bws run failed, on mvapich2-2.0rc1: gethostbyname: Unknown server error
>>
>> Hi,
>>
>> Result:
>>
>> Mpiexec run fails.
>> 1. mpiexec
>> [root at bb-nsi-ib04 pt2pt]# mpiexec -n 2 -f hosts_mvapich osu_bw [proxy:0:1 at bb-nsi-ib04.*com] HYDU_create_process (./utils/launch/launch.c:75): execvp error on file osu_bw (No such file or directory)
>>
>> ===================================================================================
>> =   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
>> =   EXIT CODE: 255
>> =   CLEANING UP REMAINING PROCESSES
>> =   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
>> ===================================================================================
>> [proxy:0:0 at bb-nsi-ib03*.com] HYDU_create_process (./utils/launch/launch.c:75): execvp error on file osu_bw (No such file or directory)
>>
>> ===================================================================================
>> =   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
>> =   EXIT CODE: 255
>> =   CLEANING UP REMAINING PROCESSES
>> =   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
>> ===================================================================================
>>
>> 2. mpirun_rsh with RoCE parameter
>> [root at bb-nsi-ib04 pt2pt]# mpirun_rsh -np 2 --hostfile hosts_mvapich MV2_USE_RoCE=1 osu_latency
>> gethostbyname: Unknown server error
>> [bb-nsi-ib04.*com:mpirun_rsh][child_handler] Error in init phase, aborting! (0/2 mpispawn connections)
>> gethostbyname: Unknown server error
>> [root at bb-nsi-ib04 pt2pt]#
>>
>> 3. mpirun_rsh
>> [root at bb-nsi-ib04 pt2pt]# mpirun_rsh -np 2 --hostfile hosts_mvapich  osu_latency
>> gethostbyname: Unknown server error
>> [bb-nsi-ib04.*com:mpirun_rsh][child_handler] Error in init phase, aborting! (0/2 mpispawn connections)
>> gethostbyname: Unknown server error
>> [root at bb-nsi-ib04 pt2pt]#
>>
>> BR
>>
>> Thanks
>> Yanfei
>>
>> -----邮件原件-----
>> 发件人: Jonathan Perkins [mailto:perkinjo at cse.ohio-state.edu]
>> 发送时间: 2014年3月31日 22:53
>> 收件人: Wang,Yanfei(SYS)
>> 抄送: Jonathan Perkins; mvapich-discuss
>> 主题: Re: 答复: [mvapich-discuss] benchmark osu_bws run failed, on mvapich2-2.0rc1: gethostbyname: Unknown server error
>>
>> Before debugging further, I would like to know whether the following works for you...
>>
>> mpiexec -n 2 -f hosts_mvapich osu_bw
>>
>>
>> On Mon, Mar 31, 2014 at 10:12 AM, Wang,Yanfei(SYS) <wangyanfei01 at baidu.com> wrote:
>>> Hi,
>>>
>>>
>>>
>>> Each node in cluster has same /etc/hosts, which is like:
>>>
>>> [root at bb-nsi-ib04 pt2pt]# cat /etc/hosts
>>>
>>> 192.168.71.3 ib03
>>>
>>> 192.168.71.4 ib04
>>>
>>> Currently, we have only 2 nodes available in RoCE cluster, IB03 and IB04.
>>>
>>>
>>>
>>> BR
>>>
>>>
>>>
>>> Thanks
>>>
>>> Yanfei
>>>
>>>
>>>
>>>
>>>
>>> 发件人: Jonathan Perkins [mailto:perkinjo at cse.ohio-state.edu]
>>> 发送时间: 2014年3月31日 21:40
>>> 收件人: Wang,Yanfei(SYS)
>>> 抄送: mvapich-discuss
>>> 主题: Re: [mvapich-discuss] benchmark osu_bws run failed, on mvapich2-2.0rc1:
>>> gethostbyname: Unknown server error
>>>
>>>
>>>
>>> Can you share the contents of the /etc/hosts file from each machine
>>> including the machine that you launch from?
>>>
>>> On Mar 31, 2014 9:33 AM, "Wang,Yanfei(SYS)" <wangyanfei01 at baidu.com> wrote:
>>>
>>> Hi,
>>>
>>>
>>>
>>> I am a fresh learner of MPI, and just try to do some verification on
>>> mVAPICH2 library on RoCE armed with mvapich2-2.0rc1 on
>>> MLNX_OFED_LINUX-2.1-1.0.6-rhel6.3-x86_64.
>>>
>>>
>>>
>>> Could you give me some tips to fix this following issues.
>>>
>>>
>>>
>>> Configuration:
>>>
>>> [root at bb-nsi-ib04 pt2pt]# cat hosts_mvapich
>>>
>>> ib03:1
>>>
>>> ib04:1
>>>
>>> [root at bb-nsi-ib04 pt2pt]# cat /etc/hosts
>>>
>>> 192.168.71.3 ib03
>>>
>>> 192.168.71.4 ib04
>>>
>>>
>>>
>>> ERROR:
>>>
>>> [root at bb-nsi-ib04 pt2pt]# mpirun_rsh -np 2 --hostfile hosts_mvapich
>>> osu_bw
>>>
>>> gethostbyname: Unknown server error
>>>
>>> [bb-nsi-ib04.*.com:mpirun_rsh][child_handler] Error in init phase, aborting!
>>> (0/2 mpispawn connections)
>>>
>>> gethostbyname: Unknown server error
>>>
>>> [root at bb-nsi-ib04 pt2pt]#
>>>
>>>
>>>
>>> It could be caused by wrong configuration. Before on same platform I
>>> have do verification on OpenMPI with same RoCE configurations and
>>> similar host configurations.
>>>
>>>
>>>
>>> Thanks.
>>>
>>> -Yanfei
>>>
>>>
>>> _______________________________________________
>>> mvapich-discuss mailing list
>>> mvapich-discuss at cse.ohio-state.edu
>>> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>
>>
>>
>> --
>> Jonathan Perkins
>> http://www.cse.ohio-state.edu/~perkinjo
>>
>>
>> _______________________________________________
>> mvapich-discuss mailing list
>> mvapich-discuss at cse.ohio-state.edu
>> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
>
>
> --
> Jonathan Perkins
> http://www.cse.ohio-state.edu/~perkinjo



-- 
Jonathan Perkins
http://www.cse.ohio-state.edu/~perkinjo



More information about the mvapich-discuss mailing list