[mvapich-discuss] MPI_Comm_accept failed when some client connected to the server
马凯
makailove123 at 163.com
Sun Apr 19 23:48:31 EDT 2015
Hi, Hari!
I have rebuild the mvapich2 with --enable-debug=dbg --disable-fast.
I run server as this:
$ mpirun_rsh -np 1 -hostfile hf MV2_SUPPORT_DPM=1 MV2_DEBUG_SHOW_BACKTRACE=1 ./mpi_port
when the client connected to it, the server still aborted and told me this:
server available at tag#0$description#"#RANK:00000000(00000001:00000422:00000001:00000000)#"$
[gpu-cluster-1:mpi_rank_0][error_sighandler] Caught error: Segmentation fault (signal 11)
[gpu-cluster-1:mpi_rank_0][print_backtrace] 0: /usr/local/lib/libmpi.so.12(print_backtrace+0x21) [0x7f053006a731]
[gpu-cluster-1:mpi_rank_0][print_backtrace] 1: /usr/local/lib/libmpi.so.12(error_sighandler+0x5e) [0x7f053006a84e]
[gpu-cluster-1:mpi_rank_0][print_backtrace] 2: /lib/x86_64-linux-gnu/libc.so.6(+0x36c30) [0x7f052f751c30]
[gpu-cluster-1:mpi_rank_0][print_backtrace] 3: /usr/local/lib/libmpi.so.12(MPIDI_CH3I_MRAIL_Parse_header+0x6da) [0x7f052ffdd747]
[gpu-cluster-1:mpi_rank_0][print_backtrace] 4: /usr/local/lib/libmpi.so.12(+0x4c87be) [0x7f052ffa97be]
[gpu-cluster-1:mpi_rank_0][print_backtrace] 5: /usr/local/lib/libmpi.so.12(handle_read+0x81) [0x7f052ffa968d]
[gpu-cluster-1:mpi_rank_0][print_backtrace] 6: /usr/local/lib/libmpi.so.12(MPIDI_CH3I_Progress+0x2a4) [0x7f052ffa72c7]
[gpu-cluster-1:mpi_rank_0][print_backtrace] 7: /usr/local/lib/libmpi.so.12(+0x46222a) [0x7f052ff4322a]
[gpu-cluster-1:mpi_rank_0][print_backtrace] 8: /usr/local/lib/libmpi.so.12(MPIDI_Comm_accept+0x1cc) [0x7f052ff45532]
[gpu-cluster-1:mpi_rank_0][print_backtrace] 9: /usr/local/lib/libmpi.so.12(MPID_Comm_accept+0x6a) [0x7f052ff8ad25]
[gpu-cluster-1:mpi_rank_0][print_backtrace] 10: /usr/local/lib/libmpi.so.12(MPIR_Comm_accept_impl+0x39) [0x7f052fead287]
[gpu-cluster-1:mpi_rank_0][print_backtrace] 11: /usr/local/lib/libmpi.so.12(PMPI_Comm_accept+0x389) [0x7f052fead612]
[gpu-cluster-1:mpi_rank_0][print_backtrace] 12: ./mpi_port() [0x400ae5]
[gpu-cluster-1:mpi_rank_0][print_backtrace] 13: /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5) [0x7f052f73cec5]
[gpu-cluster-1:mpi_rank_0][print_backtrace] 14: ./mpi_port() [0x400959]
[gpu-cluster-1:mpispawn_0][readline] Unexpected End-Of-File on file descriptor 5. MPI process died?
[gpu-cluster-1:mpispawn_0][mtpmi_processops] Error while reading PMI socket. MPI process died?
[gpu-cluster-1:mpispawn_0][child_handler] MPI process (rank: 0, pid: 22023) terminated with signal 11 -> abort job
[gpu-cluster-1:mpirun_rsh][process_mpispawn_connection] mpispawn_0 from node 192.168.2.1 aborted: Error while reading a PMI socket (4)
What's the meaning of these information?
At 2015-04-19 22:58:36, "Hari Subramoni" <subramoni.1 at osu.edu> wrote:
Hello,
Can you please send the following
1. Output of mpiname -a?
2. Exact command used to run the application
3. Run-time parameters used
Can you re-compile MVAPICH2 with debugging options and run with "MV2_DEBUG_SHOW_BACKTRACE=1"?
Please refer to the following sections of the userguide for more information.
http://mvapich.cse.ohio-state.edu/static/media/mvapich/mvapich2-2.1-userguide.html#x1-1250009.1.14
http://mvapich.cse.ohio-state.edu/static/media/mvapich/mvapich2-2.1-userguide.html#x1-15500010.5
Regards,
Hari.
On Sun, Apr 19, 2015 at 12:22 AM, 马凯 <makailove123 at 163.com> wrote:
I have met another trouble when using port communication.
The MPI_Open_port seem to work OK, and it gave me this:
server available at tag#0$description#"#RANK:00000000(00000001:0000034a:00000001:00000000)#"$
And then, I launched the client to connect to the port according to the port_name. But at that moment, the server would abort, and told me this:
[gpu-cluster-1:mpi_rank_0][error_sighandler] Caught error: Segmentation fault (signal 11)
[gpu-cluster-1:mpispawn_0][readline] Unexpected End-Of-File on file descriptor 5. MPI process died?
[gpu-cluster-1:mpispawn_0][mtpmi_processops] Error while reading PMI socket. MPI process died?
[gpu-cluster-1:mpispawn_0][child_handler] MPI process (rank: 0, pid: 12810) terminated with signal 11 -> abort job
[gpu-cluster-1:mpirun_rsh][process_mpispawn_connection] mpispawn_0 from node 192.168.2.1 aborted: Error while reading a PMI socket (4)
And the client would do nothing just being stuck.
I will be appreciated for any help!
The fallowing is my server code:
#include<mpi.h>
#include<stdio.h>
#include<stdlib.h>
void main(int argv, char *argc[]) {
int myid;
int size;
MPI_Comm client;
MPI_Status status;
char port_name[MPI_MAX_PORT_NAME];
char buf[1024];
MPI_Init(&argv, &argc);
MPI_Comm_size(MPI_COMM_WORLD, &size);
if(size > 1) {
printf("Server too big\n");
exit(EXIT_FAILURE);
}
MPI_Open_port(MPI_INFO_NULL, port_name);
printf("server available at %s\n", port_name);
// MPI_Publish_name("server", MPI_INFO_NULL, port_name);
MPI_Comm_accept(port_name, MPI_INFO_NULL, 0, MPI_COMM_WORLD, &client);
printf("Accpet successfully\n");
MPI_Recv(buf, 1024, MPI_CHAR, MPI_ANY_SOURCE, MPI_ANY_TAG, client, &status);
MPI_Comm_disconnect(&client);
// MPI_Unpublish_name("server", MPI_INFO_NULL, port_name);
MPI_Comm_free(&client);
MPI_Finalize();
}
The fallowing is my client code:
#include<mpi.h>
#include<stdio.h>
#include<stdlib.h>
void main(int argc, char *argv[]) {
int myid;
int size;
MPI_Comm server;
MPI_Status status;
char port_name[MPI_MAX_PORT_NAME];
char buf[1024];
if(argc < 2) {
printf("too few arguments\n");
exit(EXIT_FAILURE);
}
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &size);
if(size > 1) {
printf("Client too big\n");
exit(EXIT_FAILURE);
}
//MPI_Lookup_name("server", MPI_INFO_NULL, port_name);
MPI_Comm_connect(argv[1], MPI_INFO_NULL, 0, MPI_COMM_WORLD, &server);
printf("Connect successfully\n");
MPI_Send(buf, 0, MPI_CHAR, 0, 100, server);
MPI_Comm_disconnect(&server);
MPI_Comm_free(&server);
MPI_Finalize();
}
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20150420/0750b21e/attachment-0001.html>
More information about the mvapich-discuss
mailing list