[mvapich-discuss] problem with jobstartup with mvapich 0.96/0.97 and mpiexec

bobb bobb at tchpc.tcd.ie
Thu Mar 23 16:31:27 EST 2006


Sayantan Sur hath declared on Wednesday the 22 day of March 2006  :-:
> Hi Jimmy,

> > I'd like to ask if its possible to remove the block of code from the
> > source? or at least put an ifdef in to disable that code by default?

> After some discussion with David Skinner and as you indicated, we found
> out that the code had some unnecessary repitition of a if condition.
> 
> We've made the corresponding change, ie. the if condition is listed only
> once. Can you give our latest code (in the SVN trunk) a shot?
> 
> It'll be great if you could give your feedback.

Hi Sayantan,

I have been looking at the latest version (r30) from trunk.
MPI applications are still failing when launched from mpiexec
(mpiexec from OSC, version 0.80 & svn trunk) with the following error:

	connect: Connection refused
	mpiexec: Error: poll_or_block_event: tm_poll remote 15010: System error.


The one remaining blocks from the three duplicated blocks is still causing the
problem.  If I remove that block (as per the attached patch), mpiexec runs fine.
(The patch only corrects the vapi version.  I haven't looked at ch_gen2.)


I'm a little confused about this block of code.  From the comment, I am
presuming that is for use with osc-mpiexec, yet it in fact breaks osc-mpiexec.


The code is failing when it attempts to connect to the ports specified by
the MPIEXEC_*_PORT environment variables.  I added a sleep just before
the code in question and ran cpi through strace with mpiexec on 2 dual nodes.

Running `` netstat -tanp | egrep "Foreign|mpiexec|cpi|strace" '' on the two nodes
during the sleep gives:


	iitac223: Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name   
	iitac223: tcp        0      0 192.168.115.90:45606    0.0.0.0:*               LISTEN     20360/mpiexec       
	iitac223: tcp        0      0 127.0.0.1:45610         127.0.0.1:15003         ESTABLISHED20360/mpiexec       
	iitac223: tcp        0      0 127.0.0.1:45608         127.0.0.1:45618         ESTABLISHED20362/mpiexec       
	iitac223: tcp        0      0 127.0.0.1:45609         127.0.0.1:45619         ESTABLISHED20362/mpiexec       
	iitac223: tcp        0      0 127.0.0.1:45613         127.0.0.1:45607         ESTABLISHED20363/strace        
	iitac223: tcp        0      0 127.0.0.1:45614         127.0.0.1:45608         ESTABLISHED20363/strace        
	iitac223: tcp        0      0 127.0.0.1:45618         127.0.0.1:45608         ESTABLISHED20365/strace        
	iitac223: tcp        0      0 127.0.0.1:45615         127.0.0.1:45609         ESTABLISHED20363/strace        
	iitac223: tcp        0      0 127.0.0.1:45619         127.0.0.1:45609         ESTABLISHED20365/strace        
	iitac223: tcp        0      0 127.0.0.1:45608         127.0.0.1:45614         ESTABLISHED20362/mpiexec       
	iitac223: tcp        0      0 127.0.0.1:45609         127.0.0.1:45615         ESTABLISHED20362/mpiexec       
	iitac223: tcp        0      0 127.0.0.1:45607         127.0.0.1:45613         ESTABLISHED20362/mpiexec       
	iitac223: tcp        0      0 192.168.115.90:45608    192.168.115.89:37401    ESTABLISHED20362/mpiexec       
	iitac223: tcp        0      0 192.168.115.90:45609    192.168.115.89:37402    ESTABLISHED20362/mpiexec       
	iitac223: tcp        0      0 192.168.115.90:45608    192.168.115.89:37405    ESTABLISHED20362/mpiexec       
	iitac223: tcp        0      0 192.168.115.90:45609    192.168.115.89:37406    ESTABLISHED20362/mpiexec       
	

	iitac222: Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name   
	iitac222: tcp        0      0 192.168.115.89:37402    192.168.115.90:45609    ESTABLISHED4512/strace         
	iitac222: tcp        0      0 192.168.115.89:37406    192.168.115.90:45609    ESTABLISHED4514/strace         
	iitac222: tcp        0      0 192.168.115.89:37401    192.168.115.90:45608    ESTABLISHED4512/strace         
	iitac222: tcp        0      0 192.168.115.89:37405    192.168.115.90:45608    ESTABLISHED4514/strace         



After the sleep, the strace shows the call to connect() failing with a "Connection refused".


	connect(4, {sa_family=AF_INET, sin_port=htons(45691), sin_addr=inet_addr("192.168.115.90")}, 16) = -1 ECONNREFUSED (Connection refused)
	write(1, " => FAILED !!\n\n", 15)       = 15
	rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
	rt_sigaction(SIGCHLD, NULL, {SIG_DFL}, 8) = 0
	rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0

	
As I said, I'm not sure what this chunk of code is supposed to achieve. 
STDIN, STDOUT and STDERR seem to be routed through mpiexec fine without it.


We (and a few users) have been running tests on 0.9.7 with this block removed.


Cheers,


- bobb


-- 
Bob B. Crosbie.
Trinity Centre for High Performance Computing,
Room 208, Lloyd Building, Trinity College Dublin.
Tel: +353 1 608 3725                                    http://www.tchpc.tcd.ie
-------------- next part --------------
Index: mpid/vapi/process/pmgr_client_mpirun_rsh.c
===================================================================
--- mpid/vapi/process/pmgr_client_mpirun_rsh.c	(revision 30)
+++ mpid/vapi/process/pmgr_client_mpirun_rsh.c	(working copy)
@@ -167,108 +167,6 @@
     *id_p = pmgr_id;
     *processes_p = pmgr_processes;
 
-     /*
-      *  Route stdout and stderr to mpiexec if applicable  - dskinner at nersc.gov
-      *  if MPIEXEC_STDOUT_PORT and MPIEXEC_STDERR_PORT not detected
-      *  do nothing.  What's on the other side is described at
-      *  Route stdin stdout and stderr to mpiexec  - dskinner at nersc.gov
-      *  if MPIEXEC_STDOUT_PORT MPIEXEC_STDOUT_PORT and MPIEXEC_STDERR_PORT are
-      *  not detected no sockets are created and stdout/stderr are left as is.
-      *  no conditional recompilation should be required due to the above fact.
-      *  What's on the other side of these sockets is described at
-      *  http://www.osc.edu/~pw/mpiexec/
-      *
-      */
-
-   str = getenv("MPIEXEC_STDOUT_PORT");
-   if(str) {
-     mpirun_stdout_port = atoi(str);
-     if (mpirun_port <= 0) {
-         fprintf(stderr, "Invalid MPIEXEC_STDOUT_PORT port %s\n", str);
-         exit(1);
-     }
-    mpirun_stdout_socket = socket(AF_INET, SOCK_STREAM, 0);
-    if (mpirun_stdout_socket < 0) {
-        perror("socket");
-        exit(1);
-    }
-
-    sockaddr.sin_family = AF_INET;
-    sockaddr.sin_addr = *(struct in_addr *) (*mpirun_hostent->h_addr_list);
-    sockaddr.sin_port = htons(mpirun_stdout_port);
-
-    if (connect(mpirun_stdout_socket, (struct sockaddr *) &sockaddr,
-                sizeof(sockaddr)) < 0) {
-        perror("connect");
-        exit(1);
-    }
-
-    fflush(stdout);
-    dup2(mpirun_stdout_socket,1);
-    close(mpirun_stdout_socket);
-
-    /* we have now connected stdout to the mpiexec program */
-   }
-
-   str = getenv("MPIEXEC_STDERR_PORT");
-   if(str) {
-     mpirun_stderr_port = atoi(str);
-     if (mpirun_port <= 0) {
-         fprintf(stderr, "Invalid MPIEXEC_STDERR_PORT port %s\n", str);
-         exit(1);
-     }
-    mpirun_stderr_socket = socket(AF_INET, SOCK_STREAM, IPPROTO_TCP);
-    if (mpirun_stderr_socket < 0) {
-        perror("socket");
-        exit(1);
-    }
-
-    sockaddr.sin_family = AF_INET;
-    sockaddr.sin_addr = *(struct in_addr *) (*mpirun_hostent->h_addr_list);
-    sockaddr.sin_port = htons(mpirun_stderr_port);
-
-    if (connect(mpirun_stderr_socket, (struct sockaddr *) &sockaddr,
-                sizeof(sockaddr)) < 0) {
-        perror("connect");
-        exit(1);
-    }
-
-    fflush(stderr);
-    dup2(mpirun_stderr_socket,2);
-    close(mpirun_stderr_socket);
-
-    /* we have now connected stderr to the mpiexec program */
-   }
-   str = getenv("MPIEXEC_STDIN_PORT");
-   if(str) {
-     mpirun_stdin_port = atoi(str);
-     if (mpirun_port <= 0) {
-         fprintf(stderr, "Invalid MPIEXEC_STDIN_PORT port %s\n", str);
-         exit(1);
-     }
-    mpirun_stdin_socket = socket(AF_INET, SOCK_STREAM, IPPROTO_TCP);
-    if (mpirun_stdin_socket < 0) {
-        perror("socket");
-        exit(1);
-    }
-
-    sockaddr.sin_family = AF_INET;
-    sockaddr.sin_addr = *(struct in_addr *) (*mpirun_hostent->h_addr_list);
-    sockaddr.sin_port = htons(mpirun_stdin_port);
-
-    if (connect(mpirun_stdin_socket, (struct sockaddr *) &sockaddr,
-                sizeof(sockaddr)) < 0) {
-        perror("connect");
-        exit(1);
-    }
-
-    fflush(stderr);
-    dup2(mpirun_stdin_socket,0);
-    close(mpirun_stdin_socket);
-
-    /* we have now connected stdin to the mpiexec program */
-   }
-
    return 1;
 }
 


More information about the mvapich-discuss mailing list