[mvapich-discuss] Issue with running MVAPICH

Sourav Chakraborty chakraborty.52 at buckeyemail.osu.edu
Mon May 14 18:15:00 EDT 2018


Hi Michael,

Can you please try setting the following environment variable?

MV2_HCA_AWARE_PROCESS_MAPPING=0

If the issue still persists, you can also try setting MV2_ENABLE_AFFINITY=0.

Can you also let us know which adapter you are using so that we can debug
this issue further?

Thanks,
Sourav


On Mon, May 14, 2018 at 5:29 PM Michael Cui <xiaolongc at vmware.com> wrote:

> Hi,
>
>
>
> This is Michael from VMware. I use OpenMPI a lot but am a first-time user
> of MVAPICH. I installed MVAPICH 2.3 to run over RoCE across 2 nodes, but
> currently having seg fault with running MPI programs. Here is the debugging
> traceback for a dummy MPI_hello_world program.
>
>
>
> *vmware at ubuntu16-gdr-01*:*~*$ mpirun_rsh -n 2 ubuntu16-gdr-01
> ubuntu16-gdr-02 MV2_USE_RoCE=1 MV2_DEBUG_SHOW_BACKTRACE=1 mpi_hello_world
>
> [ubuntu16-gdr-01:mpi_rank_0][error_sighandler] Caught error: Segmentation
> fault (signal 11)
>
> [ubuntu16-gdr-01:mpi_rank_0][print_backtrace]   0:
> /home/vmware/mvapich_install/lib/libmpi.so.12(print_backtrace+0x2f)
> [0x7fedf240d44f]
>
> [ubuntu16-gdr-01:mpi_rank_0][print_backtrace]   1:
> /home/vmware/mvapich_install/lib/libmpi.so.12(error_sighandler+0x63)
> [0x7fedf240d593]
>
> [ubuntu16-gdr-01:mpi_rank_0][print_backtrace]   2:
> /lib/x86_64-linux-gnu/libc.so.6(+0x354b0) [0x7fedf1c104b0]
>
> [ubuntu16-gdr-01:mpi_rank_0][print_backtrace]   3:
> /home/vmware/mvapich_install/lib/libmpi.so.12(_int_malloc+0x1cc)
> [0x7fedf240688c]
>
> [ubuntu16-gdr-01:mpi_rank_0][print_backtrace]   4:
> /home/vmware/mvapich_install/lib/libmpi.so.12(malloc+0x7b) [0x7fedf240752b]
>
> [ubuntu16-gdr-01:mpi_rank_0][print_backtrace]   5:
> /lib/x86_64-linux-gnu/libc.so.6(+0x6dcdd) [0x7fedf1c48cdd]
>
> [ubuntu16-gdr-01:mpi_rank_0][print_backtrace]   6:
> /home/vmware/mvapich_install/lib/libmpi.so.12(get_ib_socket+0x8b)
> [0x7fedf2467feb]
>
> [ubuntu16-gdr-01:mpi_rank_0][print_backtrace]   7:
> /home/vmware/mvapich_install/lib/libmpi.so.12(mv2_get_cpu_core_closest_to_hca+0x129)
> [0x7fedf246b2b9]
>
> [ubuntu16-gdr-01:mpi_rank_0][print_backtrace]   8:
> /home/vmware/mvapich_install/lib/libmpi.so.12(smpi_setaffinity+0x7d1)
> [0x7fedf246ccd1]
>
> [ubuntu16-gdr-01:mpi_rank_0][print_backtrace]   9:
> /home/vmware/mvapich_install/lib/libmpi.so.12(MPIDI_CH3I_set_affinity+0x200)
> [0x7fedf246d6d0]
>
> [ubuntu16-gdr-01:mpi_rank_0][print_backtrace]  10:
> /home/vmware/mvapich_install/lib/libmpi.so.12(MPID_Init+0x46b)
> [0x7fedf239672b]
>
> [ubuntu16-gdr-01:mpi_rank_0][print_backtrace]  11:
> /home/vmware/mvapich_install/lib/libmpi.so.12(MPIR_Init_thread+0x361)
> [0x7fedf22ba3b1]
>
> [ubuntu16-gdr-01:mpi_rank_0][print_backtrace]  12:
> /home/vmware/mvapich_install/lib/libmpi.so.12(MPI_Init+0xc8)
> [0x7fedf22b9c38]
>
> [ubuntu16-gdr-01:mpi_rank_0][print_backtrace]  13: ./mpi_hello_world()
> [0x4008dc]
>
> [ubuntu16-gdr-01:mpi_rank_0][print_backtrace]  14:
> /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0) [0x7fedf1bfb830]
>
> [ubuntu16-gdr-01:mpi_rank_0][print_backtrace]  15: ./mpi_hello_world()
> [0x4007d9]
>
> [ubuntu16-gdr-02:mpi_rank_1][error_sighandler] Caught error: Segmentation
> fault (signal 11)
>
> [ubuntu16-gdr-02:mpi_rank_1][print_backtrace]   0:
> /home/vmware/mvapich_install/lib/libmpi.so.12(print_backtrace+0x2f)
> [0x7f01f8e7c44f]
>
> [ubuntu16-gdr-02:mpi_rank_1][print_backtrace]   1:
> /home/vmware/mvapich_install/lib/libmpi.so.12(error_sighandler+0x63)
> [0x7f01f8e7c593]
>
> [ubuntu16-gdr-02:mpi_rank_1][print_backtrace]   2:
> /lib/x86_64-linux-gnu/libc.so.6(+0x354b0) [0x7f01f867f4b0]
>
> [ubuntu16-gdr-02:mpi_rank_1][print_backtrace]   3:
> /home/vmware/mvapich_install/lib/libmpi.so.12(_int_malloc+0x1cc)
> [0x7f01f8e7588c]
>
> [ubuntu16-gdr-02:mpi_rank_1][print_backtrace]   4:
> /home/vmware/mvapich_install/lib/libmpi.so.12(malloc+0x7b) [0x7f01f8e7652b]
>
> [ubuntu16-gdr-02:mpi_rank_1][print_backtrace]   5:
> /lib/x86_64-linux-gnu/libc.so.6(+0x6dcdd) [0x7f01f86b7cdd]
>
> [ubuntu16-gdr-02:mpi_rank_1][print_backtrace]   6:
> /home/vmware/mvapich_install/lib/libmpi.so.12(get_ib_socket+0x8b)
> [0x7f01f8ed6feb]
>
> [ubuntu16-gdr-02:mpi_rank_1][print_backtrace]   7:
> /home/vmware/mvapich_install/lib/libmpi.so.12(mv2_get_cpu_core_closest_to_hca+0x129)
> [0x7f01f8eda2b9]
>
> [ubuntu16-gdr-02:mpi_rank_1][print_backtrace]   8:
> /home/vmware/mvapich_install/lib/libmpi.so.12(smpi_setaffinity+0x7d1)
> [0x7f01f8edbcd1]
>
> [ubuntu16-gdr-02:mpi_rank_1][print_backtrace]   9:
> /home/vmware/mvapich_install/lib/libmpi.so.12(MPIDI_CH3I_set_affinity+0x200)
> [0x7f01f8edc6d0]
>
> [ubuntu16-gdr-02:mpi_rank_1][print_backtrace]  10:
> /home/vmware/mvapich_install/lib/libmpi.so.12(MPID_Init+0x46b)
> [0x7f01f8e0572b]
>
> [ubuntu16-gdr-02:mpi_rank_1][print_backtrace]  11:
> /home/vmware/mvapich_install/lib/libmpi.so.12(MPIR_Init_thread+0x361)
> [0x7f01f8d293b1]
>
> [ubuntu16-gdr-02:mpi_rank_1][print_backtrace]  12:
> /home/vmware/mvapich_install/lib/libmpi.so.12(MPI_Init+0xc8)
> [0x7f01f8d28c38]
>
> [ubuntu16-gdr-02:mpi_rank_1][print_backtrace]  13: ./mpi_hello_world()
> [0x4008dc]
>
> [ubuntu16-gdr-02:mpi_rank_1][print_backtrace]  14:
> /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0) [0x7f01f866a830]
>
> [ubuntu16-gdr-02:mpi_rank_1][print_backtrace]  15: ./mpi_hello_world()
> [0x4007d9]
>
> [ubuntu16-gdr-01:mpispawn_0][readline] Unexpected End-Of-File on file
> descriptor 6. MPI process died?
>
> [ubuntu16-gdr-01:mpispawn_0][mtpmi_processops] Error while reading PMI
> socket. MPI process died?
>
> [ubuntu16-gdr-01:mpispawn_0][child_handler] MPI process (rank: 0, pid:
> 5491) terminated with signal 11 -> abort job
>
> [ubuntu16-gdr-02:mpispawn_1][readline] Unexpected End-Of-File on file
> descriptor 6. MPI process died?
>
> [ubuntu16-gdr-02:mpispawn_1][mtpmi_processops] Error while reading PMI
> socket. MPI process died?
>
> [ubuntu16-gdr-02:mpispawn_1][child_handler] MPI process (rank: 1, pid:
> 18788) terminated with signal 11 -> abort job
>
> [ubuntu16-gdr-01:mpirun_rsh][process_mpispawn_connection] mpispawn_0 from
> node ubuntu16-gdr-01 aborted: Error while reading a PMI socket (4)
>
>
>
> I am using Ubuntu 16.04 and below is the output from “uname -a”
>
>
>
>                 Linux ubuntu16-gdr-01 4.4.0-121-generic #145-Ubuntu SMP
> Fri Apr 13 13:47:23 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
>
>
>
> The step output from configure/make/make install are attached. Thanks for
> your help!
>
>
>
>
>
>
>
> --
>
> Michael (Xiaolong) Cui
>
> Member of Technical Staff – HPC
>
> Office of the CTO
>
> xiaolongc at vmware.com
>
> 2 Ave de Lafayette, Boston, MA
>
> 617.528.3113 Office
>
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20180514/495095c0/attachment.html>


More information about the mvapich-discuss mailing list