[mvapich-discuss] MPI_Comm_accept failed when some client connected to the server

马凯 makailove123 at 163.com
Sun Apr 19 23:48:31 EDT 2015


Hi, Hari!
    I have rebuild the mvapich2 with --enable-debug=dbg --disable-fast.
    I run server as this:
$ mpirun_rsh -np 1 -hostfile hf MV2_SUPPORT_DPM=1 MV2_DEBUG_SHOW_BACKTRACE=1 ./mpi_port
    when the client connected to it, the server still aborted and told me this:
server available at tag#0$description#"#RANK:00000000(00000001:00000422:00000001:00000000)#"$
[gpu-cluster-1:mpi_rank_0][error_sighandler] Caught error: Segmentation fault (signal 11)
[gpu-cluster-1:mpi_rank_0][print_backtrace]   0: /usr/local/lib/libmpi.so.12(print_backtrace+0x21) [0x7f053006a731]
[gpu-cluster-1:mpi_rank_0][print_backtrace]   1: /usr/local/lib/libmpi.so.12(error_sighandler+0x5e) [0x7f053006a84e]
[gpu-cluster-1:mpi_rank_0][print_backtrace]   2: /lib/x86_64-linux-gnu/libc.so.6(+0x36c30) [0x7f052f751c30]
[gpu-cluster-1:mpi_rank_0][print_backtrace]   3: /usr/local/lib/libmpi.so.12(MPIDI_CH3I_MRAIL_Parse_header+0x6da) [0x7f052ffdd747]
[gpu-cluster-1:mpi_rank_0][print_backtrace]   4: /usr/local/lib/libmpi.so.12(+0x4c87be) [0x7f052ffa97be]
[gpu-cluster-1:mpi_rank_0][print_backtrace]   5: /usr/local/lib/libmpi.so.12(handle_read+0x81) [0x7f052ffa968d]
[gpu-cluster-1:mpi_rank_0][print_backtrace]   6: /usr/local/lib/libmpi.so.12(MPIDI_CH3I_Progress+0x2a4) [0x7f052ffa72c7]
[gpu-cluster-1:mpi_rank_0][print_backtrace]   7: /usr/local/lib/libmpi.so.12(+0x46222a) [0x7f052ff4322a]
[gpu-cluster-1:mpi_rank_0][print_backtrace]   8: /usr/local/lib/libmpi.so.12(MPIDI_Comm_accept+0x1cc) [0x7f052ff45532]
[gpu-cluster-1:mpi_rank_0][print_backtrace]   9: /usr/local/lib/libmpi.so.12(MPID_Comm_accept+0x6a) [0x7f052ff8ad25]
[gpu-cluster-1:mpi_rank_0][print_backtrace]  10: /usr/local/lib/libmpi.so.12(MPIR_Comm_accept_impl+0x39) [0x7f052fead287]
[gpu-cluster-1:mpi_rank_0][print_backtrace]  11: /usr/local/lib/libmpi.so.12(PMPI_Comm_accept+0x389) [0x7f052fead612]
[gpu-cluster-1:mpi_rank_0][print_backtrace]  12: ./mpi_port() [0x400ae5]
[gpu-cluster-1:mpi_rank_0][print_backtrace]  13: /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5) [0x7f052f73cec5]
[gpu-cluster-1:mpi_rank_0][print_backtrace]  14: ./mpi_port() [0x400959]
[gpu-cluster-1:mpispawn_0][readline] Unexpected End-Of-File on file descriptor 5. MPI process died?
[gpu-cluster-1:mpispawn_0][mtpmi_processops] Error while reading PMI socket. MPI process died?
[gpu-cluster-1:mpispawn_0][child_handler] MPI process (rank: 0, pid: 22023) terminated with signal 11 -> abort job
[gpu-cluster-1:mpirun_rsh][process_mpispawn_connection] mpispawn_0 from node 192.168.2.1 aborted: Error while reading a PMI socket (4)
    What's the meaning of these information?






At 2015-04-19 22:58:36, "Hari Subramoni" <subramoni.1 at osu.edu> wrote:

Hello,


Can you please send the following


1. Output of mpiname -a?
2. Exact command used to run the application
3. Run-time parameters used


Can you re-compile MVAPICH2 with debugging options and run with "MV2_DEBUG_SHOW_BACKTRACE=1"?



Please refer to the following sections of the userguide for more information.


http://mvapich.cse.ohio-state.edu/static/media/mvapich/mvapich2-2.1-userguide.html#x1-1250009.1.14

http://mvapich.cse.ohio-state.edu/static/media/mvapich/mvapich2-2.1-userguide.html#x1-15500010.5



Regards,
Hari.


On Sun, Apr 19, 2015 at 12:22 AM, 马凯 <makailove123 at 163.com> wrote:

    I have met another trouble when using port communication.
    The MPI_Open_port seem to work OK, and it gave me this:
server available at tag#0$description#"#RANK:00000000(00000001:0000034a:00000001:00000000)#"$


    And then, I launched the client to connect to the port according to the port_name. But at that moment, the server would abort, and told me this:
 [gpu-cluster-1:mpi_rank_0][error_sighandler] Caught error: Segmentation fault (signal 11)
[gpu-cluster-1:mpispawn_0][readline] Unexpected End-Of-File on file descriptor 5. MPI process died?
[gpu-cluster-1:mpispawn_0][mtpmi_processops] Error while reading PMI socket. MPI process died?
[gpu-cluster-1:mpispawn_0][child_handler] MPI process (rank: 0, pid: 12810) terminated with signal 11 -> abort job
[gpu-cluster-1:mpirun_rsh][process_mpispawn_connection] mpispawn_0 from node 192.168.2.1 aborted: Error while reading a PMI socket (4)
    And the client would do nothing just being stuck.
    I will be appreciated for any help!




    The fallowing is my server code:
#include<mpi.h>
#include<stdio.h>
#include<stdlib.h>


void main(int argv, char *argc[]) {
    int myid;
    int size;
    MPI_Comm client;
    MPI_Status status;
    char port_name[MPI_MAX_PORT_NAME];
    char buf[1024];


    MPI_Init(&argv, &argc);
    MPI_Comm_size(MPI_COMM_WORLD, &size);
    if(size > 1) {
        printf("Server too big\n");
        exit(EXIT_FAILURE);
    }


    MPI_Open_port(MPI_INFO_NULL, port_name);
    printf("server available at %s\n", port_name);


//  MPI_Publish_name("server", MPI_INFO_NULL, port_name);


    MPI_Comm_accept(port_name, MPI_INFO_NULL, 0, MPI_COMM_WORLD, &client);
    printf("Accpet successfully\n");


    MPI_Recv(buf, 1024, MPI_CHAR, MPI_ANY_SOURCE, MPI_ANY_TAG, client, &status);


    MPI_Comm_disconnect(&client);
//  MPI_Unpublish_name("server", MPI_INFO_NULL, port_name);
    MPI_Comm_free(&client);
    MPI_Finalize();
}


    The fallowing is my client code:
#include<mpi.h>
#include<stdio.h>
#include<stdlib.h>


void main(int argc, char *argv[]) {
    int myid;
    int size;
    MPI_Comm server;
    MPI_Status status;
    char port_name[MPI_MAX_PORT_NAME];
    char buf[1024];


    if(argc < 2) {
        printf("too few arguments\n");
        exit(EXIT_FAILURE);
    }


    MPI_Init(&argc, &argv);
    MPI_Comm_size(MPI_COMM_WORLD, &size);
    if(size > 1) {
        printf("Client too big\n");
        exit(EXIT_FAILURE);
    }


    //MPI_Lookup_name("server", MPI_INFO_NULL, port_name);


    MPI_Comm_connect(argv[1], MPI_INFO_NULL, 0, MPI_COMM_WORLD, &server);
    printf("Connect successfully\n");


    MPI_Send(buf, 0, MPI_CHAR, 0, 100, server);


    MPI_Comm_disconnect(&server);
    MPI_Comm_free(&server);
    MPI_Finalize();
}







-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20150420/0750b21e/attachment-0001.html>


More information about the mvapich-discuss mailing list