[mvapich-discuss] 答复: 答复: 答复: benchmark osu_bws run failed, on mvapich2-2.0rc1: gethostbyname: Unknown server error

Jonathan Perkins perkinjo at cse.ohio-state.edu
Mon Mar 31 11:38:52 EDT 2014


You may consult our userguide
(http://mvapich.cse.ohio-state.edu/support/user_guide_mvapich2-2.0rc1.html).
 You may be interested in the troubleshooting section as the error
message you're receiving is not very descriptive at this point.
You'll probably want to rebuild with the --disable-fast and
--enable-g=dbg options
(mvapich.cse.ohio-state.edu/support/user_guide_mvapich2-2.0rc1.html).

As a guess, it seems that gethostbyname is not working for you.
mpirun_rsh uses this whereas it looks like mpiexec does not.  If this
is the case, some usage of gethostbyname inside our library *might* be
failing and causing the Other MPI error.

Please let us know the output from your debug build when you get a chance.

On Mon, Mar 31, 2014 at 11:28 AM, Wang,Yanfei(SYS)
<wangyanfei01 at baidu.com> wrote:
> Hi,
> It seem that the error goes further, old error has expired! Are there some online materials about this, I would like to consult that as well, to try to fix this issue by myself.
>
> Before the iptables have prohibited the connection for mpirun_rsh, which has been removed.
>
> [root at bb-nsi-ib04 pt2pt]# mpiexec -n 2 -f hosts_mvapich ./osu_bw
> [cli_1]: aborting job:
> Fatal error in MPI_Init:
> Other MPI error
>
>
> ===================================================================================
> =   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
> =   EXIT CODE: 1
> =   CLEANING UP REMAINING PROCESSES
> =   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
> ===================================================================================
> [cli_0]: aborting job:
> Fatal error in MPI_Init:
> Other MPI error
>
>
> ===================================================================================
> =   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
> =   EXIT CODE: 1
> =   CLEANING UP REMAINING PROCESSES
> =   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
> ===================================================================================
> [root at bb-nsi-ib04 pt2pt]# mpirun_rsh -np 2 --hostfile hosts_mvapich  ./osu_latency
> gethostbyname: Unknown server error
> [bb-nsi-ib04.#com:mpirun_rsh][child_handler] Error in init phase, aborting! (0/2 mpispawn connections)
> gethostbyname: Unknown server error
> [root at bb-nsi-ib04 pt2pt]#
>
>
> Thanks
> Yanfei
>
> -----邮件原件-----
> 发件人: mvapich-discuss [mailto:mvapich-discuss-bounces at cse.ohio-state.edu] 代表 Wang,Yanfei(SYS)
> 发送时间: 2014年3月31日 23:06
> 收件人: Jonathan Perkins
> 抄送: mvapich-discuss
> 主题: [mvapich-discuss] 答复: 答复: benchmark osu_bws run failed, on mvapich2-2.0rc1: gethostbyname: Unknown server error
>
> Hi,
>
> Result:
>
> Mpiexec run fails.
> 1. mpiexec
> [root at bb-nsi-ib04 pt2pt]# mpiexec -n 2 -f hosts_mvapich osu_bw [proxy:0:1 at bb-nsi-ib04.*com] HYDU_create_process (./utils/launch/launch.c:75): execvp error on file osu_bw (No such file or directory)
>
> ===================================================================================
> =   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
> =   EXIT CODE: 255
> =   CLEANING UP REMAINING PROCESSES
> =   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
> ===================================================================================
> [proxy:0:0 at bb-nsi-ib03*.com] HYDU_create_process (./utils/launch/launch.c:75): execvp error on file osu_bw (No such file or directory)
>
> ===================================================================================
> =   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
> =   EXIT CODE: 255
> =   CLEANING UP REMAINING PROCESSES
> =   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
> ===================================================================================
>
> 2. mpirun_rsh with RoCE parameter
> [root at bb-nsi-ib04 pt2pt]# mpirun_rsh -np 2 --hostfile hosts_mvapich MV2_USE_RoCE=1 osu_latency
> gethostbyname: Unknown server error
> [bb-nsi-ib04.*com:mpirun_rsh][child_handler] Error in init phase, aborting! (0/2 mpispawn connections)
> gethostbyname: Unknown server error
> [root at bb-nsi-ib04 pt2pt]#
>
> 3. mpirun_rsh
> [root at bb-nsi-ib04 pt2pt]# mpirun_rsh -np 2 --hostfile hosts_mvapich  osu_latency
> gethostbyname: Unknown server error
> [bb-nsi-ib04.*com:mpirun_rsh][child_handler] Error in init phase, aborting! (0/2 mpispawn connections)
> gethostbyname: Unknown server error
> [root at bb-nsi-ib04 pt2pt]#
>
> BR
>
> Thanks
> Yanfei
>
> -----邮件原件-----
> 发件人: Jonathan Perkins [mailto:perkinjo at cse.ohio-state.edu]
> 发送时间: 2014年3月31日 22:53
> 收件人: Wang,Yanfei(SYS)
> 抄送: Jonathan Perkins; mvapich-discuss
> 主题: Re: 答复: [mvapich-discuss] benchmark osu_bws run failed, on mvapich2-2.0rc1: gethostbyname: Unknown server error
>
> Before debugging further, I would like to know whether the following works for you...
>
> mpiexec -n 2 -f hosts_mvapich osu_bw
>
>
> On Mon, Mar 31, 2014 at 10:12 AM, Wang,Yanfei(SYS) <wangyanfei01 at baidu.com> wrote:
>> Hi,
>>
>>
>>
>> Each node in cluster has same /etc/hosts, which is like:
>>
>> [root at bb-nsi-ib04 pt2pt]# cat /etc/hosts
>>
>> 192.168.71.3 ib03
>>
>> 192.168.71.4 ib04
>>
>> Currently, we have only 2 nodes available in RoCE cluster, IB03 and IB04.
>>
>>
>>
>> BR
>>
>>
>>
>> Thanks
>>
>> Yanfei
>>
>>
>>
>>
>>
>> 发件人: Jonathan Perkins [mailto:perkinjo at cse.ohio-state.edu]
>> 发送时间: 2014年3月31日 21:40
>> 收件人: Wang,Yanfei(SYS)
>> 抄送: mvapich-discuss
>> 主题: Re: [mvapich-discuss] benchmark osu_bws run failed, on mvapich2-2.0rc1:
>> gethostbyname: Unknown server error
>>
>>
>>
>> Can you share the contents of the /etc/hosts file from each machine
>> including the machine that you launch from?
>>
>> On Mar 31, 2014 9:33 AM, "Wang,Yanfei(SYS)" <wangyanfei01 at baidu.com> wrote:
>>
>> Hi,
>>
>>
>>
>> I am a fresh learner of MPI, and just try to do some verification on
>> mVAPICH2 library on RoCE armed with mvapich2-2.0rc1 on
>> MLNX_OFED_LINUX-2.1-1.0.6-rhel6.3-x86_64.
>>
>>
>>
>> Could you give me some tips to fix this following issues.
>>
>>
>>
>> Configuration:
>>
>> [root at bb-nsi-ib04 pt2pt]# cat hosts_mvapich
>>
>> ib03:1
>>
>> ib04:1
>>
>> [root at bb-nsi-ib04 pt2pt]# cat /etc/hosts
>>
>> 192.168.71.3 ib03
>>
>> 192.168.71.4 ib04
>>
>>
>>
>> ERROR:
>>
>> [root at bb-nsi-ib04 pt2pt]# mpirun_rsh -np 2 --hostfile hosts_mvapich
>> osu_bw
>>
>> gethostbyname: Unknown server error
>>
>> [bb-nsi-ib04.*.com:mpirun_rsh][child_handler] Error in init phase, aborting!
>> (0/2 mpispawn connections)
>>
>> gethostbyname: Unknown server error
>>
>> [root at bb-nsi-ib04 pt2pt]#
>>
>>
>>
>> It could be caused by wrong configuration. Before on same platform I
>> have do verification on OpenMPI with same RoCE configurations and
>> similar host configurations.
>>
>>
>>
>> Thanks.
>>
>> -Yanfei
>>
>>
>> _______________________________________________
>> mvapich-discuss mailing list
>> mvapich-discuss at cse.ohio-state.edu
>> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
>
>
> --
> Jonathan Perkins
> http://www.cse.ohio-state.edu/~perkinjo
>
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss



-- 
Jonathan Perkins
http://www.cse.ohio-state.edu/~perkinjo



More information about the mvapich-discuss mailing list