[mvapich-discuss] mpirun_rsh: Unable to get host entry

Jonathan Perkins perkinjo at cse.ohio-state.edu
Mon Jan 9 11:07:30 EST 2012


Hello Mark.  I just wanted to let you know that the fix for this issue
is now in our 1.7 branch.  Here is a link to the latest nightly
tarball:
http://mvapich.cse.ohio-state.edu/nightly/mvapich2/branches/1.7/mvapich2-latest.tar.gz.

On Thu, Jan 5, 2012 at 5:30 PM, Mark Debbage <mark.debbage at qlogic.com> wrote:
> Yes, you are right about OFED 1.5.4. Thanks,
>
> Mark.
> ________________________________________
> From: Jonathan Perkins [perkinjo at cse.ohio-state.edu]
> Sent: Thursday, January 05, 2012 2:11 PM
> To: Mark Debbage
> Cc: mvapich-discuss at cse.ohio-state.edu
> Subject: Re: [mvapich-discuss] mpirun_rsh: Unable to get host entry
>
> I believe OFED 1.5.4 was already released but I'll update our SRPM so
> that it will be available for any future OFED releases.
>
> On Thu, Jan 5, 2012 at 5:05 PM, Mark Debbage <mark.debbage at qlogic.com> wrote:
>> That's great. Is it possible to get the fixed 1.7 into OFED 1.5.4?
>>
>> Mark.
>> ________________________________________
>> From: Jonathan Perkins [perkinjo at cse.ohio-state.edu]
>> Sent: Thursday, January 05, 2012 2:03 PM
>> To: Mark Debbage
>> Cc: mvapich-discuss at cse.ohio-state.edu
>> Subject: Re: [mvapich-discuss] mpirun_rsh: Unable to get host entry
>>
>> Let me add that this nightly tarball should be available tomorrow
>> night after we finish with some QA testing.  Thanks again for the
>> report.
>>
>> On Thu, Jan 5, 2012 at 4:47 PM, Jonathan Perkins
>> <perkinjo at cse.ohio-state.edu> wrote:
>>> Thanks for the confirmation.  We plan on making this available on the
>>> 1.7 branch as well as our next release in the 1.8 series (likely
>>> 1.8a2).  Once our nightly tarball for the 1.7 branch is created I'll
>>> provide you the link to it.
>>>
>>> On Thu, Jan 5, 2012 at 4:42 PM, Mark Debbage <mark.debbage at qlogic.com> wrote:
>>>> Yes, that works for me!
>>>>
>>>> I tested using the MVAPICH2 1.7 tar-ball from your site, and these
>>>> configuration options:
>>>>
>>>> export CFLAGS="-O3 -Wp,-D_FORTIFY_SOURCE=2"
>>>> ./configure --prefix=/home/markdebbage/mvapich2/mvapich2-1.7-install --with-device=ch3:psm
>>>>
>>>> Without the patch it fails:
>>>>
>>>> [markdebbage at nperf-33 mvapich2]$ mpirun_rsh -hostfile hosts -np 1 ./mpiworld
>>>> [unset]: Unable to get host entry for '': Unknown host (1)
>>>> [unset]: Unable to connect to  on 59757
>>>> Fatal error in MPI_Init: Other MPI error
>>>> [nperf-33:mpispawn_0][child_handler] MPI process (rank: 0, pid: 16840) exited with status 1
>>>>
>>>> With the patch it succeeds:
>>>>
>>>> [markdebbage at nperf-33 mvapich2]$ mpirun_rsh -hostfile hosts -np 1 ./mpiworld
>>>> nperf-33: hello from rank 0 of 1 processes
>>>>
>>>> Can you let me know which version of MVAPICH2 this will go into so that
>>>> I can keep track of it. We'll be adding this patch to the QLogic build of MVAPICH2 1.7.
>>>>
>>>> Thanks!
>>>>
>>>> Mark.
>>>> ________________________________________
>>>> From: Jonathan Perkins [perkinjo at cse.ohio-state.edu]
>>>> Sent: Thursday, January 05, 2012 11:51 AM
>>>> To: Mark Debbage
>>>> Cc: mvapich-discuss at cse.ohio-state.edu
>>>> Subject: Re: [mvapich-discuss] mpirun_rsh: Unable to get host entry
>>>>
>>>> Mark, thank you for your report and debugging effort.  Can you try
>>>> applying the following patch (attached as well) and let us know if it
>>>> resolves the problem?  Thanks in advance.
>>>>
>>>> Index: src/pm/mpirun/mpispawn.c
>>>> ===================================================================
>>>> --- src/pm/mpirun/mpispawn.c    (revision 5128)
>>>> +++ src/pm/mpirun/mpispawn.c    (working copy)
>>>> @@ -181,6 +181,7 @@
>>>>  int setup_global_environment()
>>>>  {
>>>>     char my_host_name[MAX_HOST_LEN + MAX_PORT_LEN];
>>>> +    char tmp[MAX_HOST_LEN + 1];
>>>>
>>>>     int i = env2int("MPISPAWN_GENERIC_ENV_COUNT");
>>>>
>>>> @@ -190,13 +191,15 @@
>>>>     setenv("MV2_NUM_NODES_IN_JOB", getenv("MPISPAWN_NNODES"), 1);
>>>>
>>>>     /* Ranks now connect to mpispawn */
>>>> -    int rv = gethostname(my_host_name, MAX_HOST_LEN);
>>>> +    int rv = gethostname(tmp, MAX_HOST_LEN);
>>>> +    tmp[MAX_HOST_LEN] = '\0';
>>>> +
>>>>     if ( rv == -1 ) {
>>>>         PRINT_ERROR_ERRNO("gethostname() failed", errno);
>>>>         return -1;
>>>>     }
>>>>
>>>> -    sprintf(my_host_name, "%s:%d", my_host_name, c_port);
>>>> +    sprintf(my_host_name, "%s:%d", tmp, c_port);
>>>>
>>>>     setenv("PMI_PORT", my_host_name, 2);
>>>>
>>>>
>>>>
>>>> On Thu, Jan 5, 2012 at 2:16 PM, Mark Debbage <mark.debbage at qlogic.com> wrote:
>>>>> I hit the same problem as described here:
>>>>>
>>>>>  http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/2011-July/003452.html
>>>>>
>>>>> This appears to be due to the hostname being set to the empty string
>>>>> in the PMI_PORT environment variable. I tracked this down using stace,
>>>>> and I think this is an MVAPICH2 bug. In this code in ./src/pm/mpirun/mpispawn.c :
>>>>>
>>>>> void setup_global_environment()
>>>>> {
>>>>>    char my_host_name[MAX_HOST_LEN + MAX_PORT_LEN];
>>>>>
>>>>>    int i = env2int("MPISPAWN_GENERIC_ENV_COUNT");
>>>>>
>>>>>    setenv("MPIRUN_MPD", "0", 1);
>>>>>    setenv("MPIRUN_NPROCS", getenv("MPISPAWN_GLOBAL_NPROCS"), 1);
>>>>>    setenv("MPIRUN_ID", getenv("MPISPAWN_MPIRUN_ID"), 1);
>>>>>    setenv("MV2_NUM_NODES_IN_JOB", getenv("MPISPAWN_NNODES"), 1);
>>>>>
>>>>>    /* Ranks now connect to mpispawn */
>>>>>    gethostname(my_host_name, MAX_HOST_LEN);
>>>>>
>>>>>    sprintf(my_host_name, "%s:%d", my_host_name, c_port);
>>>>>
>>>>> The sprintf() writes its result into my_host_name, and gets the %s parameter from
>>>>> my_hostname. A sprintf() implementation may well write a nul character into its
>>>>> destination before processing its arguments leading to an empty hostname. This
>>>>> practice is specifically outlawed in the man page for the glibc sprintf():
>>>>>
>>>>> DESCRIPTION
>>>>>       C99  and  POSIX.1-2001  specify  that  the  results are undefined if a call to sprintf(), snprintf(), vsprintf(), or vsnprintf() would cause to copying to take place between
>>>>>       objects that overlap (e.g., if the target string array and one of the supplied input arguments refer to the same buffer).  See NOTES.
>>>>>
>>>>> NOTES
>>>>>       Some programs imprudently rely on code such as the following
>>>>>
>>>>>           sprintf(buf, "%s some further text", buf);
>>>>>
>>>>>       to append text to buf.  However, the standards explicitly note that the results are undefined if source and destination buffers overlap when calling  sprintf(),  snprintf(),
>>>>>       vsprintf(), and vsnprintf().  Depending on the version of gcc(1) used, and the compiler options employed, calls such as the above will not produce the expected results.
>>>>>
>>>>>       The glibc implementation of the functions snprintf() and vsnprintf() conforms to the C99 standard, that is, behaves as described above, since glibc version 2.1.  Until glibc
>>>>>       2.0.6 they would return -1 when the output was truncated.
>>>>>
>>>>> Mark.
>>>>>
>>>>> This message and any attached documents contain information from QLogic Corporation or its wholly-owned subsidiaries that may be confidential. If you are not the intended recipient, you may not read, copy, distribute, or use this information. If you have received this transmission in error, please notify the sender immediately by reply e-mail and then delete this message.
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> mvapich-discuss mailing list
>>>>> mvapich-discuss at cse.ohio-state.edu
>>>>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Jonathan Perkins
>>>> http://www.cse.ohio-state.edu/~perkinjo
>>>>
>>>> This message and any attached documents contain information from QLogic Corporation or its wholly-owned subsidiaries that may be confidential. If you are not the intended recipient, you may not read, copy, distribute, or use this information. If you have received this transmission in error, please notify the sender immediately by reply e-mail and then delete this message.
>>>>
>>>>
>>>
>>>
>>>
>>> --
>>> Jonathan Perkins
>>> http://www.cse.ohio-state.edu/~perkinjo
>>
>>
>>
>> --
>> Jonathan Perkins
>> http://www.cse.ohio-state.edu/~perkinjo
>>
>>
>> This message and any attached documents contain information from QLogic Corporation or its wholly-owned subsidiaries that may be confidential. If you are not the intended recipient, you may not read, copy, distribute, or use this information. If you have received this transmission in error, please notify the sender immediately by reply e-mail and then delete this message.
>>
>>
>
>
>
> --
> Jonathan Perkins
> http://www.cse.ohio-state.edu/~perkinjo
>
>
> This message and any attached documents contain information from QLogic Corporation or its wholly-owned subsidiaries that may be confidential. If you are not the intended recipient, you may not read, copy, distribute, or use this information. If you have received this transmission in error, please notify the sender immediately by reply e-mail and then delete this message.
>
>



-- 
Jonathan Perkins
http://www.cse.ohio-state.edu/~perkinjo



More information about the mvapich-discuss mailing list