[mvapich-discuss] problem with jobstartup with mvapich 0.96/0.97
and mpiexec
bobb
bobb at tchpc.tcd.ie
Thu Mar 23 16:31:27 EST 2006
Sayantan Sur hath declared on Wednesday the 22 day of March 2006 :-:
> Hi Jimmy,
> > I'd like to ask if its possible to remove the block of code from the
> > source? or at least put an ifdef in to disable that code by default?
> After some discussion with David Skinner and as you indicated, we found
> out that the code had some unnecessary repitition of a if condition.
>
> We've made the corresponding change, ie. the if condition is listed only
> once. Can you give our latest code (in the SVN trunk) a shot?
>
> It'll be great if you could give your feedback.
Hi Sayantan,
I have been looking at the latest version (r30) from trunk.
MPI applications are still failing when launched from mpiexec
(mpiexec from OSC, version 0.80 & svn trunk) with the following error:
connect: Connection refused
mpiexec: Error: poll_or_block_event: tm_poll remote 15010: System error.
The one remaining blocks from the three duplicated blocks is still causing the
problem. If I remove that block (as per the attached patch), mpiexec runs fine.
(The patch only corrects the vapi version. I haven't looked at ch_gen2.)
I'm a little confused about this block of code. From the comment, I am
presuming that is for use with osc-mpiexec, yet it in fact breaks osc-mpiexec.
The code is failing when it attempts to connect to the ports specified by
the MPIEXEC_*_PORT environment variables. I added a sleep just before
the code in question and ran cpi through strace with mpiexec on 2 dual nodes.
Running `` netstat -tanp | egrep "Foreign|mpiexec|cpi|strace" '' on the two nodes
during the sleep gives:
iitac223: Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
iitac223: tcp 0 0 192.168.115.90:45606 0.0.0.0:* LISTEN 20360/mpiexec
iitac223: tcp 0 0 127.0.0.1:45610 127.0.0.1:15003 ESTABLISHED20360/mpiexec
iitac223: tcp 0 0 127.0.0.1:45608 127.0.0.1:45618 ESTABLISHED20362/mpiexec
iitac223: tcp 0 0 127.0.0.1:45609 127.0.0.1:45619 ESTABLISHED20362/mpiexec
iitac223: tcp 0 0 127.0.0.1:45613 127.0.0.1:45607 ESTABLISHED20363/strace
iitac223: tcp 0 0 127.0.0.1:45614 127.0.0.1:45608 ESTABLISHED20363/strace
iitac223: tcp 0 0 127.0.0.1:45618 127.0.0.1:45608 ESTABLISHED20365/strace
iitac223: tcp 0 0 127.0.0.1:45615 127.0.0.1:45609 ESTABLISHED20363/strace
iitac223: tcp 0 0 127.0.0.1:45619 127.0.0.1:45609 ESTABLISHED20365/strace
iitac223: tcp 0 0 127.0.0.1:45608 127.0.0.1:45614 ESTABLISHED20362/mpiexec
iitac223: tcp 0 0 127.0.0.1:45609 127.0.0.1:45615 ESTABLISHED20362/mpiexec
iitac223: tcp 0 0 127.0.0.1:45607 127.0.0.1:45613 ESTABLISHED20362/mpiexec
iitac223: tcp 0 0 192.168.115.90:45608 192.168.115.89:37401 ESTABLISHED20362/mpiexec
iitac223: tcp 0 0 192.168.115.90:45609 192.168.115.89:37402 ESTABLISHED20362/mpiexec
iitac223: tcp 0 0 192.168.115.90:45608 192.168.115.89:37405 ESTABLISHED20362/mpiexec
iitac223: tcp 0 0 192.168.115.90:45609 192.168.115.89:37406 ESTABLISHED20362/mpiexec
iitac222: Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
iitac222: tcp 0 0 192.168.115.89:37402 192.168.115.90:45609 ESTABLISHED4512/strace
iitac222: tcp 0 0 192.168.115.89:37406 192.168.115.90:45609 ESTABLISHED4514/strace
iitac222: tcp 0 0 192.168.115.89:37401 192.168.115.90:45608 ESTABLISHED4512/strace
iitac222: tcp 0 0 192.168.115.89:37405 192.168.115.90:45608 ESTABLISHED4514/strace
After the sleep, the strace shows the call to connect() failing with a "Connection refused".
connect(4, {sa_family=AF_INET, sin_port=htons(45691), sin_addr=inet_addr("192.168.115.90")}, 16) = -1 ECONNREFUSED (Connection refused)
write(1, " => FAILED !!\n\n", 15) = 15
rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
rt_sigaction(SIGCHLD, NULL, {SIG_DFL}, 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
As I said, I'm not sure what this chunk of code is supposed to achieve.
STDIN, STDOUT and STDERR seem to be routed through mpiexec fine without it.
We (and a few users) have been running tests on 0.9.7 with this block removed.
Cheers,
- bobb
--
Bob B. Crosbie.
Trinity Centre for High Performance Computing,
Room 208, Lloyd Building, Trinity College Dublin.
Tel: +353 1 608 3725 http://www.tchpc.tcd.ie
-------------- next part --------------
Index: mpid/vapi/process/pmgr_client_mpirun_rsh.c
===================================================================
--- mpid/vapi/process/pmgr_client_mpirun_rsh.c (revision 30)
+++ mpid/vapi/process/pmgr_client_mpirun_rsh.c (working copy)
@@ -167,108 +167,6 @@
*id_p = pmgr_id;
*processes_p = pmgr_processes;
- /*
- * Route stdout and stderr to mpiexec if applicable - dskinner at nersc.gov
- * if MPIEXEC_STDOUT_PORT and MPIEXEC_STDERR_PORT not detected
- * do nothing. What's on the other side is described at
- * Route stdin stdout and stderr to mpiexec - dskinner at nersc.gov
- * if MPIEXEC_STDOUT_PORT MPIEXEC_STDOUT_PORT and MPIEXEC_STDERR_PORT are
- * not detected no sockets are created and stdout/stderr are left as is.
- * no conditional recompilation should be required due to the above fact.
- * What's on the other side of these sockets is described at
- * http://www.osc.edu/~pw/mpiexec/
- *
- */
-
- str = getenv("MPIEXEC_STDOUT_PORT");
- if(str) {
- mpirun_stdout_port = atoi(str);
- if (mpirun_port <= 0) {
- fprintf(stderr, "Invalid MPIEXEC_STDOUT_PORT port %s\n", str);
- exit(1);
- }
- mpirun_stdout_socket = socket(AF_INET, SOCK_STREAM, 0);
- if (mpirun_stdout_socket < 0) {
- perror("socket");
- exit(1);
- }
-
- sockaddr.sin_family = AF_INET;
- sockaddr.sin_addr = *(struct in_addr *) (*mpirun_hostent->h_addr_list);
- sockaddr.sin_port = htons(mpirun_stdout_port);
-
- if (connect(mpirun_stdout_socket, (struct sockaddr *) &sockaddr,
- sizeof(sockaddr)) < 0) {
- perror("connect");
- exit(1);
- }
-
- fflush(stdout);
- dup2(mpirun_stdout_socket,1);
- close(mpirun_stdout_socket);
-
- /* we have now connected stdout to the mpiexec program */
- }
-
- str = getenv("MPIEXEC_STDERR_PORT");
- if(str) {
- mpirun_stderr_port = atoi(str);
- if (mpirun_port <= 0) {
- fprintf(stderr, "Invalid MPIEXEC_STDERR_PORT port %s\n", str);
- exit(1);
- }
- mpirun_stderr_socket = socket(AF_INET, SOCK_STREAM, IPPROTO_TCP);
- if (mpirun_stderr_socket < 0) {
- perror("socket");
- exit(1);
- }
-
- sockaddr.sin_family = AF_INET;
- sockaddr.sin_addr = *(struct in_addr *) (*mpirun_hostent->h_addr_list);
- sockaddr.sin_port = htons(mpirun_stderr_port);
-
- if (connect(mpirun_stderr_socket, (struct sockaddr *) &sockaddr,
- sizeof(sockaddr)) < 0) {
- perror("connect");
- exit(1);
- }
-
- fflush(stderr);
- dup2(mpirun_stderr_socket,2);
- close(mpirun_stderr_socket);
-
- /* we have now connected stderr to the mpiexec program */
- }
- str = getenv("MPIEXEC_STDIN_PORT");
- if(str) {
- mpirun_stdin_port = atoi(str);
- if (mpirun_port <= 0) {
- fprintf(stderr, "Invalid MPIEXEC_STDIN_PORT port %s\n", str);
- exit(1);
- }
- mpirun_stdin_socket = socket(AF_INET, SOCK_STREAM, IPPROTO_TCP);
- if (mpirun_stdin_socket < 0) {
- perror("socket");
- exit(1);
- }
-
- sockaddr.sin_family = AF_INET;
- sockaddr.sin_addr = *(struct in_addr *) (*mpirun_hostent->h_addr_list);
- sockaddr.sin_port = htons(mpirun_stdin_port);
-
- if (connect(mpirun_stdin_socket, (struct sockaddr *) &sockaddr,
- sizeof(sockaddr)) < 0) {
- perror("connect");
- exit(1);
- }
-
- fflush(stderr);
- dup2(mpirun_stdin_socket,0);
- close(mpirun_stdin_socket);
-
- /* we have now connected stdin to the mpiexec program */
- }
-
return 1;
}
More information about the mvapich-discuss
mailing list