[mvapich-discuss] dynamic process management (DPM) questions

Krishna Chaitanya Kandalla kandalla at cse.ohio-state.edu
Mon Apr 26 12:22:32 EDT 2010


Bryan,
            Thank you for your suggestions. We will incorporate these in 
our upcoming release.

Regards,
Krishna


On 04/15/2010 03:57 PM, Bryan D. Green wrote:
> On Thu, Apr 15, 2010 at 12:48:58PM -0500, Bryan D. Green wrote:
>    
>> On Wed, Apr 14, 2010 at 11:49:53AM -0500, Krishna Chaitanya wrote:
>>      
>>> Bryan,
>>>             I think the problem you are seeing here is because the client process is not getting the right port information, even though you are passing it as a command-line argument. I just took a look at the MPI-2.2 document and they have a simple example demonstrating how these functions can to be used. They recommend using gets()/fgets() at the client process to grab the port information that the user types in, instead of using a command-line argument. I just tried out a simple/client server application in this manner and it seems to work fine.
>>>              Regarding the MPI_Comm_spawn error, it appears as though the "child" executable is not available. Could you possibly try setting the PATH variable appropriately and try it again?
>>>              Please let us know if you encounter any further problems.
>>>        
>> You are absolutely right on the first count.  My eyes deceived me in
>> thinking the string in the error message was the same as the one I
>> provided on the command line.  I'm mystified as to why shell variable
>> substitution apparently occurred within single quotes, however.  In any
>> case, its working now.  Thank you for the help!
>>
>> Regarding MPI_Comm_spawn, I think you are right that I didn't have the
>> path right, but the problem seems to be more than that.  I've gotten
>> used to assuming my MPI processes start in the same directory that I
>> launch the MPI job from, because I usually use the PBS-aware version of
>> mpiexec.  I'd like to know how to make mpirun_rsh do the same thing, but
>> I don't see it in the manual.  However, I get the same error when
>> specifying the full path or setting the PATH environment variable on the
>> command line.  I looked at the mvapich2 source code, and I wonder if the
>> problem is that mpirun_rsh is not being found.  The problem might be
>> related to the fact that we use modules here to select which MPI is in
>> our environment, but the environment is not propagated by mpirun_rsh.
>> Any thoughts or suggestions on what I can do about this?
>>
>> By the way, how do I actually specify which host the child process
>> should run on?  I'm not sure how to set up the MPI_UNIVERSE properly.
>>      
> Aha!  Found the problem.  I think you have a bug in mvapich2-1.4.1.
>
> Incidentally, having investigated a little more, its clear that the
> current working directory is correctly being set to the directory from
> which I run mpirun_rsh.  I had assumed incorrectly that it wasn't.
>
> The 'execl' error message appears to be emitted on line 3052 of
> src/pm/mpirun/mpirun_rsh.c
> The program being execl'd is a concatenation of 'binary_dirname' and
> "/mpirun_rsh".  This is suspicious.  'binary_dirname' is set with the
> following code (lines 676-679):
>
> binary_dirname = dirname (strdup (argv[0]));
> if (strlen (binary_dirname) == 1&&  argv[0][0] != '.') {
>      use_dirname = 0;
> }
>
> So, I see two bugs here.
> Number 1, shouldn't "argv[0][0] != '.'" be "argv[0][0] == '.'"?
> Number 2, shouldn't the concatenation of binary_dirname and
> "/mpirun_rsh" on lines 3038 and 3039 be conditional on the value of
> 'use_dirname'?
>
> Sure enough, if I run my test this way, with a full path given for
> mpirun_rsh...
>
> /nasa/mvapich2/1.4.1/intel/bin/mpirun_rsh -np 1 n000 MV2_SUPPORT_DPM=1 ./parent2
>
> ... it works!
>
> -bryan
>
>
>    
>>> On Tue, Apr 13, 2010 at 3:00 PM, Bryan Green<bryan.d.green at nasa.gov<mailto:bryan.d.green at nasa.gov>>  wrote:
>>> Hello,
>>>
>>> I have some questions about using the dynamic process management features of
>>> mvapich 1.4.1.  I'm new to this topic and have not been able to find very
>>> much specific information on the topic online.
>>>
>>> My tests of the MPI_Comm_connect/accept mechanism have not worked and I'm
>>> wondering what I am missing.
>>>
>>> I have a simple server which does the basic setup:
>>>
>>>     MPI_Comm client;
>>>     char port_name[MPI_MAX_PORT_NAME];
>>>     MPI_Open_port(MPI_INFO_NULL, port_name);
>>>     printf("server available at %s\n",port_name);
>>>     MPI_Comm_accept( port_name, MPI_INFO_NULL, 0, MPI_COMM_WORLD,&client );
>>>     ...
>>>
>>> and a simple client:
>>>     MPI_Comm server;
>>>     char port_name[MPI_MAX_PORT_NAME];
>>>
>>>     MPI_Init(&argc,&argv );
>>>     strcpy(port_name, argv[1] );/* assume server's name is cmd-line arg */
>>>
>>>     MPI_Comm_connect( port_name, MPI_INFO_NULL, 0, MPI_COMM_WORLD,
>>>                       &server );
>>>
>>> I run the server:
>>> $ mpirun_rsh -np 1 n000 MV2_SUPPORT_DPM=1 ./serv
>>> server available at tag#0$description#"#RANK:00000000(00000035:0074004b:00000001)#"$
>>>
>>> And I run the client and it fails:
>>>
>>> $ mpirun_rsh -np 1 n000 MV2_SUPPORT_DPM=1 ./cli 'tag#0$description#"#RANK:00000000(00000035:0074004b:00000001)#"$'
>>> Fatal error in MPI_Comm_connect:
>>> Other MPI error, error stack:
>>> MPI_Comm_connect(119)............................:
>>> MPI_Comm_connect(port="tag#0##RANK:00000000(00000035:0074004b:00000001)#$",
>>> MPI_INFO_NULL, root=0, MPI_COMM_WORLD, newcomm=0x7fff720186c8) failed
>>> MPID_Comm_connect(187)...........................:
>>> MPIDI_Comm_connect(388)..........................:
>>> MPIDI_Create_inter_root_communicator_connect(149):
>>> MPIDI_CH3_Connect_to_root(354)...................: Missing hostname or
>>> invalid host/port description in business card
>>> MPI process (rank: 0) terminated unexpectedly on n000
>>> Exit code -5 signaled from n000
>>>
>>>
>>> Can someone inform me of what I am doing wrong?  Is there documentation on
>>> using these features with mvapich that I missed?
>>>
>>> I have also been testing spawning, but my simple test fails  with the
>>> message:
>>> execl failed
>>> : No such file or directory
>>>
>>> It fails inside this call to MPI_Comm_spawn:
>>> MPI_Comm_spawn("./child", MPI_ARGV_NULL, numToSpawn,
>>>                    MPI_INFO_NULL, 0, parentComm,&interComm, errCodes);
>>>
>>> I'm probably missing something obvious.  I appreciate any help I can get.
>>>
>>> Thanks,
>>> -bryan
>>>
>>>
>>> _______________________________________________
>>> mvapich-discuss mailing list
>>> mvapich-discuss at cse.ohio-state.edu<mailto:mvapich-discuss at cse.ohio-state.edu>
>>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>>
>>>
>>>
>>>
>>>        
>> -- 
>> ---------------------------------------
>> Bryan Green
>> Visualization Group
>> NASA Advanced Supercomputing Division
>> NASA Ames Research Center
>> email: bryan.d.green at nasa.gov
>> ---------------------------------------
>>      
>    



More information about the mvapich-discuss mailing list