[mvapich-discuss] Problems with hostname resolution and MPI_INIT()

Mike Heinz michael.heinz at qlogic.com
Tue Jan 13 11:17:30 EST 2009


Jonathan,

Records like these appear to be common on many distros. The example before was from a stock RHEL4 installation. Here's another example, from another machine:

# Do not remove the following line, or various programs
# that require network functionality will fail.
127.0.0.1      homer.dev.silverstorm.com  homer localhost.localdomain localhost

Homer is the name of the box - note that, as per the comments, these lines were generated by the distro, not by a user. For comparison, I found these lines in a completely fresh RHEL5 install:

# Do not remove the following line, or various programs
# that require network functionality will fail.
::1     localhost.localdomain   localhost       mheinz-linux

"mheinz-linux" is the name of the box.

Meanwhile, SLES10 does something similar:

127.0.0.2       moe.dev.silverstorm.com moe

Moe is the name of the box. Again, this appears to be done by the distro, not by any user.

We can get around the problem by manually editing all the host files, but I'm concerned because I don't understand why the distros seem to feel this is necessary.

--
Michael Heinz
Principal Engineer, Qlogic Corporation
King of Prussia, Pennsylvania

-----Original Message-----
From: Jonathan Perkins [mailto:perkinjo at cse.ohio-state.edu] 
Sent: Tuesday, January 13, 2009 11:08 AM
To: Mike Heinz
Cc: mvapich-discuss at cse.ohio-state.edu
Subject: Re: [mvapich-discuss] Problems with hostname resolution and MPI_INIT()

Michael:
Hi, my comments are inline.

On Tue, Jan 13, 2009 at 09:29:33AM -0600, Mike Heinz wrote:
> I keep running into this problem at random, and each time it brings someone down for a couple of hours before we figure it out... again.
> 
> Basically, some distros add lines like this to their host file:
> 
> # Do not remove the following line, or various programs
> # that require network functionality will fail.
> 127.0.0.1      node01 localhost.localdomain localhost

My impression is that node01 should not be listed as an entry for the
localhost ip address.  I believe node01 should be listed by its the unique
ip address on its subnet.

> 
> The problem is that this causes "node01" to tell the other MPI ranks that it's IP address is 127.0.0.1, which causes MPI jobs to hang in MPI_INIT().
> 
> I've seen a similar issue with distros that define 127.0.0.2.
> 
> So, I don't mind digging into the code and changing how mvapich does IP address resolution, but I can't help but think that this problem must happen all the time - before I start patching the code, is there a bug in my network configs that I should be fixing?

I think it is an issue with your network configs.  Which distro(s) do
you see this problem on?

> 
> 
> --
> Michael Heinz
> Principal Engineer, Qlogic Corporation
> King of Prussia, Pennsylvania

> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss


-- 
Jonathan Perkins
http://www.cse.ohio-state.edu/~perkinjo



More information about the mvapich-discuss mailing list