[mvapich-discuss] MPI programs not running over multiple hosts

Jonathan Perkins perkinjo at cse.ohio-state.edu
Wed Dec 12 11:10:02 EST 2012


Thanks for your note.  It does seem to be failing in initialization in
both hellow as well as your application.  We'll follow up with you off
list until we narrow down the cause and then post the results back on
the discuss list.

On Wed, Dec 12, 2012 at 11:59:22AM +0200, John Gilmore wrote:
> Dear all,
> 
> I again ask for your gracious help!
> 
> I'm having issues running programs over multiple hosts, including the MVAPICH2 example "hellow" program. I'm using the following configuration options:
> ./configure -prefix=/home/john/opt/mvapich CFLAGS=-m64 CXXFLAGS=-m64 FFLAGS=-m64 FCFLAGS=-m64 --disable-f77 --disable-fc --disable-mcast --with-device=ch3:mrail --with-rdma=gen2 --enable-g=dbg --enable-error-messages=all
> 
> I have the following simple host file:
> # IP:slots
> 10.221.0.1:3
> 10.221.0.2:3
> 10.221.0.3:3
> 10.221.0.4:3
> 
> These four hosts are all accessible without login and their DNS names have been mapped to their IP addresses in the /etc/hosts file on each machine for the benefit of the strange MVAPICH2 proxy process.
> 
> The program runs fine when I run mpirun_rsh with:
> mpirun_rsh -np 3 -hostfile /(location)/(hostfile) MV2_DEBUG_CORESIZE=unlimited MV2_DEBUG_SHOW_BACKTRACE=1 ./hellow
> i.e. when the program only executes on a single node.
> 
> But crashes when I run the mpirun_rsh with:
> mpirun_rsh -np 4 -hostfile /(location)/(hostfile) MV2_DEBUG_CORESIZE=unlimited MV2_DEBUG_SHOW_BACKTRACE=1 ./hellow
> i.e. when at leash one process has to be scheduled on another machine.
> 
> It seems three segmentation faults occur according to the output:
> 
> [next-10-221-0-1.vastech.co.za:mpi_rank_2][error_sighandler] Caught error: Segmentation fault (signal 11)
> [next-10-221-0-1.vastech.co.za:mpi_rank_1][error_sighandler] Caught error: Segmentation fault (signal 11)
> [next-10-221-0-2.vastech.co.za:mpi_rank_3][error_sighandler] Caught error: Segmentation fault (signal 11)
> 
> The stack trace for each is as follows:
> 
> [next-10-221-0-1.vastech.co.za:mpi_rank_1][print_backtrace]   0: /home/john/opt/mvapich/lib/libmpich.so.3(print_backtrace+0x1c) [0x7f27dc4c280c]
> [next-10-221-0-1.vastech.co.za:mpi_rank_1][print_backtrace]   1: /home/john/opt/mvapich/lib/libmpich.so.3(error_sighandler+0x59) [0x7f27dc4c2919]
> [next-10-221-0-1.vastech.co.za:mpi_rank_1][print_backtrace]   2: /lib64/libpthread.so.0(+0xf500) [0x7f27db515500]
> [next-10-221-0-1.vastech.co.za:mpi_rank_1][print_backtrace]   3: /home/john/opt/mvapich/lib/libmpich.so.3(MPIDI_CH3_PktHandler_EagerSyncAck+0x20) [0x7f27dc4989c0]
> [next-10-221-0-1.vastech.co.za:mpi_rank_1][print_backtrace]   4: /home/john/opt/mvapich/lib/libmpich.so.3(handle_read+0xdf) [0x7f27dc48ec8f]
> [next-10-221-0-1.vastech.co.za:mpi_rank_1][print_backtrace]   5: /home/john/opt/mvapich/lib/libmpich.so.3(MPIDI_CH3I_Progress+0xce) [0x7f27dc48f23e]
> [next-10-221-0-1.vastech.co.za:mpi_rank_1][print_backtrace]   6: /home/john/opt/mvapich/lib/libmpich.so.3(MPIC_Wait+0x35) [0x7f27dc4d2fe5]
> [next-10-221-0-1.vastech.co.za:mpi_rank_1][print_backtrace]   7: /home/john/opt/mvapich/lib/libmpich.so.3(MPIC_Sendrecv+0x13a) [0x7f27dc4d314a]
> [next-10-221-0-1.vastech.co.za:mpi_rank_1][print_backtrace]   8: /home/john/opt/mvapich/lib/libmpich.so.3(MPIC_Sendrecv_ft+0xff) [0x7f27dc4d37ef]
> [next-10-221-0-1.vastech.co.za:mpi_rank_1][print_backtrace]   9: /home/john/opt/mvapich/lib/libmpich.so.3(MPIR_Allgather_intra+0x60e) [0x7f27dc473b7e]
> [next-10-221-0-1.vastech.co.za:mpi_rank_1][print_backtrace]  10: /home/john/opt/mvapich/lib/libmpich.so.3(MPIR_Allgather_MV2+0x106) [0x7f27dc475206]
> [next-10-221-0-1.vastech.co.za:mpi_rank_1][print_backtrace]  11: /home/john/opt/mvapich/lib/libmpich.so.3(create_2level_comm+0x129) [0x7f27dc4b4619]
> [next-10-221-0-1.vastech.co.za:mpi_rank_1][print_backtrace]  12: /home/john/opt/mvapich/lib/libmpich.so.3(MPIR_Init_thread+0x3fc) [0x7f27dc4e57ac]
> [next-10-221-0-1.vastech.co.za:mpi_rank_1][print_backtrace]  13: /home/john/opt/mvapich/lib/libmpich.so.3(MPI_Init+0x95) [0x7f27dc4e5205]
> [next-10-221-0-1.vastech.co.za:mpi_rank_1][print_backtrace]  14: ./hellow() [0x400792]
> [next-10-221-0-1.vastech.co.za:mpi_rank_1][print_backtrace]  15: /lib64/libc.so.6(__libc_start_main+0xed) [0x7f27db16f69d]
> [next-10-221-0-1.vastech.co.za:mpi_rank_1][print_backtrace]  16: ./hellow() [0x4006b9]
> 
> When running my own program, MPI_Init seems to be the problem, since my stack trace looks as follows:
> [next-10-221-0-4.vastech.co.za:mpi_rank_9][print_backtrace]   0: /home/john/opt/mvapich/lib/libmpich.so.3(print_backtrace+0x1c) [0x7fdedf91780c]
> [next-10-221-0-4.vastech.co.za:mpi_rank_9][print_backtrace]   1: /home/john/opt/mvapich/lib/libmpich.so.3(MPIDI_CH3_Abort+0x5f) [0x7fdedf8df6cf]
> [next-10-221-0-4.vastech.co.za:mpi_rank_9][print_backtrace]   2: /home/john/opt/mvapich/lib/libmpich.so.3(MPID_Abort+0x49) [0x7fdedf93ec79]
> [next-10-221-0-4.vastech.co.za:mpi_rank_9][print_backtrace]   3: /home/john/opt/mvapich/lib/libmpich.so.3(+0xb49cd) [0x7fdedf9189cd]
> [next-10-221-0-4.vastech.co.za:mpi_rank_9][print_backtrace]   4: /home/john/opt/mvapich/lib/libmpich.so.3(MPIR_Err_return_comm+0xf0) [0x7fdedf918ad0]
> [next-10-221-0-4.vastech.co.za:mpi_rank_9][print_backtrace]   5: /home/john/opt/mvapich/lib/libmpich.so.3(MPI_Init+0x1c0) [0x7fdedf93a330]
> [next-10-221-0-4.vastech.co.za:mpi_rank_9][print_backtrace]   6: ./FlowProc(main+0x4f) [0x400de6]
> [next-10-221-0-4.vastech.co.za:mpi_rank_9][print_backtrace]   7: /lib64/libc.so.6(__libc_start_main+0xed) [0x7fdede5c469d]
> [next-10-221-0-4.vastech.co.za:mpi_rank_9][print_backtrace]   8: ./FlowProc() [0x400af9]
> [cli_9]: aborting job:
> Fatal error in MPI_Init:
> Internal MPI error!
> 
> Can anyone please provide some insights? Are there any more special MVAPICH2 flags that I first have to set? My application worked fine when I executed it with Open MPI, so I don't think that there is any issue with my Infiniband setup. Also a note, when I use mpiexec to start the program on a remote host, it also works. So as long as the program is executing on any single host, it works.
> 
> I appreciate your time.
> Regards
> John Gilmore



> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss


-- 
Jonathan Perkins
http://www.cse.ohio-state.edu/~perkinjo



More information about the mvapich-discuss mailing list