[mvapich-discuss] problem with jobstartup with mvapich 0.96/0.97 and mpiexec

Sayantan Sur surs at cse.ohio-state.edu
Thu Mar 23 17:19:45 EST 2006


Hello Bobb,

Thanks for taking the time to verify and report results for everybody's
benefit. We are waiting for some feedback from David Skinner and we will
take appropriate action.

Thanks,
Sayantan.

* On Mar,5 bobb<bobb at tchpc.tcd.ie> wrote :
> Sayantan Sur hath declared on Wednesday the 22 day of March 2006  :-:
> > Hi Jimmy,
> 
> > > I'd like to ask if its possible to remove the block of code from the
> > > source? or at least put an ifdef in to disable that code by default?
> 
> > After some discussion with David Skinner and as you indicated, we found
> > out that the code had some unnecessary repitition of a if condition.
> > 
> > We've made the corresponding change, ie. the if condition is listed only
> > once. Can you give our latest code (in the SVN trunk) a shot?
> > 
> > It'll be great if you could give your feedback.
> 
> Hi Sayantan,
> 
> I have been looking at the latest version (r30) from trunk.
> MPI applications are still failing when launched from mpiexec
> (mpiexec from OSC, version 0.80 & svn trunk) with the following error:
> 
> 	connect: Connection refused
> 	mpiexec: Error: poll_or_block_event: tm_poll remote 15010: System error.
> 
> 
> The one remaining blocks from the three duplicated blocks is still causing the
> problem.  If I remove that block (as per the attached patch), mpiexec runs fine.
> (The patch only corrects the vapi version.  I haven't looked at ch_gen2.)
> 
> 
> I'm a little confused about this block of code.  From the comment, I am
> presuming that is for use with osc-mpiexec, yet it in fact breaks osc-mpiexec.
> 
> 
> The code is failing when it attempts to connect to the ports specified by
> the MPIEXEC_*_PORT environment variables.  I added a sleep just before
> the code in question and ran cpi through strace with mpiexec on 2 dual nodes.
> 
> Running `` netstat -tanp | egrep "Foreign|mpiexec|cpi|strace" '' on the two nodes
> during the sleep gives:
> 
> 
> 	iitac223: Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name   
> 	iitac223: tcp        0      0 192.168.115.90:45606    0.0.0.0:*               LISTEN     20360/mpiexec       
> 	iitac223: tcp        0      0 127.0.0.1:45610         127.0.0.1:15003         ESTABLISHED20360/mpiexec       
> 	iitac223: tcp        0      0 127.0.0.1:45608         127.0.0.1:45618         ESTABLISHED20362/mpiexec       
> 	iitac223: tcp        0      0 127.0.0.1:45609         127.0.0.1:45619         ESTABLISHED20362/mpiexec       
> 	iitac223: tcp        0      0 127.0.0.1:45613         127.0.0.1:45607         ESTABLISHED20363/strace        
> 	iitac223: tcp        0      0 127.0.0.1:45614         127.0.0.1:45608         ESTABLISHED20363/strace        
> 	iitac223: tcp        0      0 127.0.0.1:45618         127.0.0.1:45608         ESTABLISHED20365/strace        
> 	iitac223: tcp        0      0 127.0.0.1:45615         127.0.0.1:45609         ESTABLISHED20363/strace        
> 	iitac223: tcp        0      0 127.0.0.1:45619         127.0.0.1:45609         ESTABLISHED20365/strace        
> 	iitac223: tcp        0      0 127.0.0.1:45608         127.0.0.1:45614         ESTABLISHED20362/mpiexec       
> 	iitac223: tcp        0      0 127.0.0.1:45609         127.0.0.1:45615         ESTABLISHED20362/mpiexec       
> 	iitac223: tcp        0      0 127.0.0.1:45607         127.0.0.1:45613         ESTABLISHED20362/mpiexec       
> 	iitac223: tcp        0      0 192.168.115.90:45608    192.168.115.89:37401    ESTABLISHED20362/mpiexec       
> 	iitac223: tcp        0      0 192.168.115.90:45609    192.168.115.89:37402    ESTABLISHED20362/mpiexec       
> 	iitac223: tcp        0      0 192.168.115.90:45608    192.168.115.89:37405    ESTABLISHED20362/mpiexec       
> 	iitac223: tcp        0      0 192.168.115.90:45609    192.168.115.89:37406    ESTABLISHED20362/mpiexec       
> 	
> 
> 	iitac222: Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name   
> 	iitac222: tcp        0      0 192.168.115.89:37402    192.168.115.90:45609    ESTABLISHED4512/strace         
> 	iitac222: tcp        0      0 192.168.115.89:37406    192.168.115.90:45609    ESTABLISHED4514/strace         
> 	iitac222: tcp        0      0 192.168.115.89:37401    192.168.115.90:45608    ESTABLISHED4512/strace         
> 	iitac222: tcp        0      0 192.168.115.89:37405    192.168.115.90:45608    ESTABLISHED4514/strace         
> 
> 
> 
> After the sleep, the strace shows the call to connect() failing with a "Connection refused".
> 
> 
> 	connect(4, {sa_family=AF_INET, sin_port=htons(45691), sin_addr=inet_addr("192.168.115.90")}, 16) = -1 ECONNREFUSED (Connection refused)
> 	write(1, " => FAILED !!\n\n", 15)       = 15
> 	rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
> 	rt_sigaction(SIGCHLD, NULL, {SIG_DFL}, 8) = 0
> 	rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
> 
> 	
> As I said, I'm not sure what this chunk of code is supposed to achieve. 
> STDIN, STDOUT and STDERR seem to be routed through mpiexec fine without it.
> 
> 
> We (and a few users) have been running tests on 0.9.7 with this block removed.
> 
> 
> Cheers,
> 
> 
> - bobb
> 
> 
> -- 
> Bob B. Crosbie.
> Trinity Centre for High Performance Computing,
> Room 208, Lloyd Building, Trinity College Dublin.
> Tel: +353 1 608 3725                                    http://www.tchpc.tcd.ie

> Index: mpid/vapi/process/pmgr_client_mpirun_rsh.c
> ===================================================================
> --- mpid/vapi/process/pmgr_client_mpirun_rsh.c	(revision 30)
> +++ mpid/vapi/process/pmgr_client_mpirun_rsh.c	(working copy)
> @@ -167,108 +167,6 @@
>      *id_p = pmgr_id;
>      *processes_p = pmgr_processes;
>  
> -     /*
> -      *  Route stdout and stderr to mpiexec if applicable  - dskinner at nersc.gov
> -      *  if MPIEXEC_STDOUT_PORT and MPIEXEC_STDERR_PORT not detected
> -      *  do nothing.  What's on the other side is described at
> -      *  Route stdin stdout and stderr to mpiexec  - dskinner at nersc.gov
> -      *  if MPIEXEC_STDOUT_PORT MPIEXEC_STDOUT_PORT and MPIEXEC_STDERR_PORT are
> -      *  not detected no sockets are created and stdout/stderr are left as is.
> -      *  no conditional recompilation should be required due to the above fact.
> -      *  What's on the other side of these sockets is described at
> -      *  http://www.osc.edu/~pw/mpiexec/
> -      *
> -      */
> -
> -   str = getenv("MPIEXEC_STDOUT_PORT");
> -   if(str) {
> -     mpirun_stdout_port = atoi(str);
> -     if (mpirun_port <= 0) {
> -         fprintf(stderr, "Invalid MPIEXEC_STDOUT_PORT port %s\n", str);
> -         exit(1);
> -     }
> -    mpirun_stdout_socket = socket(AF_INET, SOCK_STREAM, 0);
> -    if (mpirun_stdout_socket < 0) {
> -        perror("socket");
> -        exit(1);
> -    }
> -
> -    sockaddr.sin_family = AF_INET;
> -    sockaddr.sin_addr = *(struct in_addr *) (*mpirun_hostent->h_addr_list);
> -    sockaddr.sin_port = htons(mpirun_stdout_port);
> -
> -    if (connect(mpirun_stdout_socket, (struct sockaddr *) &sockaddr,
> -                sizeof(sockaddr)) < 0) {
> -        perror("connect");
> -        exit(1);
> -    }
> -
> -    fflush(stdout);
> -    dup2(mpirun_stdout_socket,1);
> -    close(mpirun_stdout_socket);
> -
> -    /* we have now connected stdout to the mpiexec program */
> -   }
> -
> -   str = getenv("MPIEXEC_STDERR_PORT");
> -   if(str) {
> -     mpirun_stderr_port = atoi(str);
> -     if (mpirun_port <= 0) {
> -         fprintf(stderr, "Invalid MPIEXEC_STDERR_PORT port %s\n", str);
> -         exit(1);
> -     }
> -    mpirun_stderr_socket = socket(AF_INET, SOCK_STREAM, IPPROTO_TCP);
> -    if (mpirun_stderr_socket < 0) {
> -        perror("socket");
> -        exit(1);
> -    }
> -
> -    sockaddr.sin_family = AF_INET;
> -    sockaddr.sin_addr = *(struct in_addr *) (*mpirun_hostent->h_addr_list);
> -    sockaddr.sin_port = htons(mpirun_stderr_port);
> -
> -    if (connect(mpirun_stderr_socket, (struct sockaddr *) &sockaddr,
> -                sizeof(sockaddr)) < 0) {
> -        perror("connect");
> -        exit(1);
> -    }
> -
> -    fflush(stderr);
> -    dup2(mpirun_stderr_socket,2);
> -    close(mpirun_stderr_socket);
> -
> -    /* we have now connected stderr to the mpiexec program */
> -   }
> -   str = getenv("MPIEXEC_STDIN_PORT");
> -   if(str) {
> -     mpirun_stdin_port = atoi(str);
> -     if (mpirun_port <= 0) {
> -         fprintf(stderr, "Invalid MPIEXEC_STDIN_PORT port %s\n", str);
> -         exit(1);
> -     }
> -    mpirun_stdin_socket = socket(AF_INET, SOCK_STREAM, IPPROTO_TCP);
> -    if (mpirun_stdin_socket < 0) {
> -        perror("socket");
> -        exit(1);
> -    }
> -
> -    sockaddr.sin_family = AF_INET;
> -    sockaddr.sin_addr = *(struct in_addr *) (*mpirun_hostent->h_addr_list);
> -    sockaddr.sin_port = htons(mpirun_stdin_port);
> -
> -    if (connect(mpirun_stdin_socket, (struct sockaddr *) &sockaddr,
> -                sizeof(sockaddr)) < 0) {
> -        perror("connect");
> -        exit(1);
> -    }
> -
> -    fflush(stderr);
> -    dup2(mpirun_stdin_socket,0);
> -    close(mpirun_stdin_socket);
> -
> -    /* we have now connected stdin to the mpiexec program */
> -   }
> -
>     return 1;
>  }
>  


-- 
http://www.cse.ohio-state.edu/~surs


More information about the mvapich-discuss mailing list