[mvapich-discuss] Scientific Linux 6.5 with Mellanox OFED and BLCR 0.8.5

Hari Subramoni subramoni.1 at osu.edu
Thu Jan 15 10:03:19 EST 2015


Can you please try to rerun the application with MV2_USE_SHMEM_COLL
<http://mvapich.cse.ohio-state.edu/static/media/mvapich/mvapich2-2.1rc1-userguide.html#x1-24800011.94>
=0?

Thx,
Hari

On Thu, Jan 15, 2015 at 9:28 AM, Arjun J Rao <rectangle.king at gmail.com>
wrote:

>  Thanks for the information above.  Can you also send the output of
> mpiname -a?  I'm looking for the options used to build MVAPICH2.
>
>  Output of mpiname -a is
>
> [root at goat1 ~]# mpiname -a
> MVAPICH2 2.1a Sun Sep 21 12:00:00 EDT 2014 ch3:mrail
>
> Compilation
> CC: gcc -DNDEBUG -DNVALGRIND -O2
> CXX: g++ -DNDEBUG -DNVALGRIND -O2
> F77: gfortran -L/lib -L/lib -O2
> FC: gfortran -O2
>
> Configuration
> --enable-ckpt
>
>  Trying to run checkpointing with environment variables for mpiexec or
> mpiexec.hydra doesn't seem to work at all.
>
>  Is there any failures or does the program just run normally?
>
>
>  Tried running it both with environment variables set....
>
> [root at goat1 cellauto]# env | grep MV2
> MV2_CKPT_INTERVAL=1
> MV2_CKPT_FILE=/home/zz_ckpt/mpiexec_yea
> MV2_CKPT_MAX_SAVE_CKPTS=10
> [root at goat1 cellauto]# mpiexec -n 24 -f goat.hosts ./mvpch221a_cellauto
>
> ... and also by specifying them in the command itself.
>
> [root at goat1 cellauto]# mpiexec -n 24 -f goat.hosts -env
> MV2_CKPT_INTERVAL=1 -env MV2_CKPT_FILE=/home/zz_ckpt/mpiexec_yea
> ./mvpch221a_cellauto
>
> Using mpiexec or mpiexec.hydra simply doesn't work. The program runs
> normally, but no checkpointing attempts are made.
>
>
>  One quick observation, you seem to want aggregation disabled.  Can you
> make the following replacement:
>
>     MV2_USE_AGGREGATION=0 -> MV2_CKPT_USE_AGGREGATION=0.
>
>
> Got excited about this option. Maybe I should have used the correct
> aggregation-disabler. But nope.
> Tried changing MV2_USE_AGGREGATION=0 to MV2_CKPT_USE_AGGREGATION=0 but got
> the same result as before.
>
>
>
>  It seems one of the MPI processes dies while writing out the *2.auto file
> and then the whole thing just crashes. What could be the reason ?
>
> At this point we're not sure but getting a backtrace from the
> segmentation fault(s) would be helpful.
>
> Can you try adding the runtime options mentioned inhttp://mvapich.cse.ohio-state.edu/static/media/mvapich/mvapich2-2.1rc1-userguide.html#x1-1210009.1.11?
>
> MV2_DEBUG_CORESIZE=unlimited
> MV2_DEBUG_SHOW_BACKTRACE=1
>
> You may need to make a debug build if these don't give us any more
> information.
>
>
>  Then I built a debugging build with the options --enable-g=all
> --enable-error-messages=all
>
> mpiname -a shows the following output:
> [root at goat1 ~]# mpiname -a
> MVAPICH2 2.1a Sun Sep 21 12:00:00 EDT 2014 ch3:mrail
>
> Compilation
> CC: gcc -DNDEBUG -DNVALGRIND *-g* -O2
> CXX: g++ -DNDEBUG -DNVALGRIND *-g* -O2
> F77: gfortran -L/lib -L/lib *-g* -O2
> FC: gfortran *-g* -O2
>
> Configuration
> --enable-ckpt *--enable-g=all --enable-error-messages=all*
>
> Running mpirun_rsh now gives me the following result (this time with the
> options MV2_DEBUG_CORESIZE=unlimited and MV2_DEBUG_SHOW_BACKTRACE=1)
>
> root at goat1 cellauto]# mpirun_rsh -np 24 -hostfile goat.hosts
> MV2_CKPT_FILE=/home/zz_ckpt/yea_ MV2_CKPT_INTERVAL=1 MV2_DEBUG_FT_VERBOSE=1
> MV2_CKPT_USE_AGGREGATION=0 MV2_DEBUG_CORESIZE=unlimited
> MV2_DEBUG_SHOW_BACKTRACE=1 ./mvpch221a_cellauto
> mpirun_rsh opening file /home/zz_ckpt/yea_.1.auto
> [Rank 7] opening file /home/zz_ckpt/yea_.1.7..
> [Rank 8] opening file /home/zz_ckpt/yea_.1.8..
> [Rank 5] opening file /home/zz_ckpt/yea_.1.5..
> [Rank 6] opening file /home/zz_ckpt/yea_.1.6..
> [Rank 4] opening file /home/zz_ckpt/yea_.1.4..
> [Rank 11] opening file /home/zz_ckpt/yea_.1.11..
> [Rank 1] opening file /home/zz_ckpt/yea_.1.1..
> [Rank 10] opening file /home/zz_ckpt/yea_.1.10..
> [Rank 2] opening file /home/zz_ckpt/yea_.1.2..
> [Rank 9] opening file /home/zz_ckpt/yea_.1.9..
> [Rank 3] opening file /home/zz_ckpt/yea_.1.3..
> [Rank 16] opening file /home/zz_ckpt/yea_.1.16..
> [Rank 15] opening file /home/zz_ckpt/yea_.1.15..
> [Rank 23] opening file /home/zz_ckpt/yea_.1.23..
> [Rank 14] opening file /home/zz_ckpt/yea_.1.14..
> [Rank 21] opening file /home/zz_ckpt/yea_.1.21..
> [Rank 13] opening file /home/zz_ckpt/yea_.1.13..
> [Rank 17] opening file /home/zz_ckpt/yea_.1.17..
> [Rank 19] opening file /home/zz_ckpt/yea_.1.19..
> [Rank 20] opening file /home/zz_ckpt/yea_.1.20..
> [Rank 22] opening file /home/zz_ckpt/yea_.1.22..
> [Rank 18] opening file /home/zz_ckpt/yea_.1.18..
> [Rank 0] opening file /home/zz_ckpt/yea_.1.0..
> [Rank 12] opening file /home/zz_ckpt/yea_.1.12..
> mpirun_rsh opening file /home/zz_ckpt/yea_.2.auto
> [goat1:mpi_rank_11][error_sighandler] Caught error: Segmentation fault
> (signal 11)
> *[goat1:mpi_rank_11][print_backtrace]   0:
> /usr/local/lib/libmpi.so.12(print_backtrace+0x1e) [0x7f55b467dd6e]*
> *[goat1:mpi_rank_11][print_backtrace]   1:
> /usr/local/lib/libmpi.so.12(error_sighandler+0x59) [0x7f55b467de79]*
> *[goat1:mpi_rank_11][print_backtrace]   2: /lib64/libc.so.6()
> [0x3b4c6329a0]*
> *[goat1:mpi_rank_11][print_backtrace]   3:
> /lib64/libpthread.so.0(pthread_spin_lock+0) [0x3b4ca0c380]*
> *[goat1:mpi_rank_11][print_backtrace]   4:
> /usr/local/lib/libmpi.so.12(MPIDI_CH3I_SHMEM_COLL_finalize+0xc4)
> [0x7f55b43b3aa4]*
> *[goat1:mpi_rank_11][print_backtrace]   5:
> /usr/local/lib/libmpi.so.12(MPIDI_CH3I_SMP_finalize+0x2eb) [0x7f55b46379fb]*
> *[goat1:mpi_rank_11][print_backtrace]   6:
> /usr/local/lib/libmpi.so.12(CR_IBU_Release_network+0x2e8) [0x7f55b4680388]*
> *[goat1:mpi_rank_11][print_backtrace]   7:
> /usr/local/lib/libmpi.so.12(CR_IBU_Suspend_channels+0x127) [0x7f55b4680607]*
> *[goat1:mpi_rank_11][print_backtrace]   8:
> /usr/local/lib/libmpi.so.12(CR_Thread_loop+0x1dd) [0x7f55b46814ed]*
> *[goat1:mpi_rank_11][print_backtrace]   9:
> /usr/local/lib/libmpi.so.12(CR_Thread_entry+0xea) [0x7f55b4681caa]*
> *[goat1:mpi_rank_11][print_backtrace]  10: /lib64/libpthread.so.0()
> [0x3b4ca079d1]*
> *[goat1:mpi_rank_11][print_backtrace]  11: /lib64/libc.so.6(clone+0x6d)
> [0x3b4c6e8b6d]*
> [goat1:mpi_rank_0][error_sighandler] Caught error: Segmentation fault
> (signal 11)
> *[goat1:mpi_rank_0][print_backtrace]   0:
> /usr/local/lib/libmpi.so.12(print_backtrace+0x1e) [0x7f47b171cd6e]*
> *[goat1:mpi_rank_0][print_backtrace]   1:
> /usr/local/lib/libmpi.so.12(error_sighandler+0x59) [0x7f47b171ce79]*
> *[goat1:mpi_rank_0][print_backtrace]   2: /lib64/libc.so.6()
> [0x3b4c6329a0]*
> *[goat1:mpi_rank_0][print_backtrace]   3:
> /lib64/libpthread.so.0(pthread_spin_lock+0) [0x3b4ca0c380]*
> *[goat1:mpi_rank_0][print_backtrace]   4:
> /usr/local/lib/libmpi.so.12(MPIDI_CH3I_SHMEM_COLL_finalize+0xc4)
> [0x7f47b1452aa4]*
> *[goat1:mpi_rank_0][print_backtrace]   5:
> /usr/local/lib/libmpi.so.12(MPIDI_CH3I_SMP_finalize+0x2eb) [0x7f47b16d69fb]*
> *[goat1:mpi_rank_0][print_backtrace]   6:
> /usr/local/lib/libmpi.so.12(CR_IBU_Release_network+0x2e8) [0x7f47b171f388]*
> *[goat1:mpi_rank_0][print_backtrace]   7:
> /usr/local/lib/libmpi.so.12(CR_IBU_Suspend_channels+0x127) [0x7f47b171f607]*
> *[goat1:mpi_rank_0][print_backtrace]   8:
> /usr/local/lib/libmpi.so.12(CR_Thread_loop+0x1dd) [0x7f47b17204ed]*
> *[goat1:mpi_rank_0][print_backtrace]   9:
> /usr/local/lib/libmpi.so.12(CR_Thread_entry+0xea) [0x7f47b1720caa]*
> *[goat1:mpi_rank_0][print_backtrace]  10: /lib64/libpthread.so.0()
> [0x3b4ca079d1]*
> *[goat1:mpi_rank_0][print_backtrace]  11: /lib64/libc.so.6(clone+0x6d)
> [0x3b4c6e8b6d]*
> [goat2:mpi_rank_23][error_sighandler] Caught error: Segmentation fault
> (signal 11)
> *[goat2:mpi_rank_23][print_backtrace]   0:
> /usr/local/lib/libmpi.so.12(print_backtrace+0x1e) [0x7fed44810d6e]*
> *[goat2:mpi_rank_23][print_backtrace]   1:
> /usr/local/lib/libmpi.so.12(error_sighandler+0x59) [0x7fed44810e79]*
> *[goat2:mpi_rank_23][print_backtrace]   2: /lib64/libc.so.6()
> [0x3d8f0329a0]*
> *[goat2:mpi_rank_23][print_backtrace]   3:
> /lib64/libpthread.so.0(pthread_spin_lock+0) [0x3d8f40c380]*
> *[goat2:mpi_rank_23][print_backtrace]   4:
> /usr/local/lib/libmpi.so.12(MPIDI_CH3I_SHMEM_COLL_finalize+0xc4)
> [0x7fed44546aa4]*
> *[goat2:mpi_rank_23][print_backtrace]   5:
> /usr/local/lib/libmpi.so.12(MPIDI_CH3I_SMP_finalize+0x2eb) [0x7fed447ca9fb]*
> *[goat2:mpi_rank_23][print_backtrace]   6:
> /usr/local/lib/libmpi.so.12(CR_IBU_Release_network+0x2e8) [0x7fed44813388]*
> *[goat2:mpi_rank_23][print_backtrace]   7:
> /usr/local/lib/libmpi.so.12(CR_IBU_Suspend_channels+0x127) [0x7fed44813607]*
> *[goat2:mpi_rank_23][print_backtrace]   8:
> /usr/local/lib/libmpi.so.12(CR_Thread_loop+0x1dd) [0x7fed448144ed]*
> *[goat2:mpi_rank_23][print_backtrace]   9:
> /usr/local/lib/libmpi.so.12(CR_Thread_entry+0xea) [0x7fed44814caa]*
> *[goat2:mpi_rank_23][print_backtrace]  10: /lib64/libpthread.so.0()
> [0x3d8f4079d1]*
> *[goat2:mpi_rank_23][print_backtrace]  11: /lib64/libc.so.6(clone+0x6d)
> [0x3d8f0e8b6d]*
> [goat2:mpi_rank_12][error_sighandler] Caught error: Segmentation fault
> (signal 11)
> *[goat2:mpi_rank_12][print_backtrace]   0:
> /usr/local/lib/libmpi.so.12(print_backtrace+0x1e) [0x7fb2526f9d6e]*
> *[goat2:mpi_rank_12][print_backtrace]   1:
> /usr/local/lib/libmpi.so.12(error_sighandler+0x59) [0x7fb2526f9e79]*
> *[goat2:mpi_rank_12][print_backtrace]   2: /lib64/libc.so.6()
> [0x3d8f0329a0]*
> *[goat2:mpi_rank_12][print_backtrace]   3:
> /lib64/libpthread.so.0(pthread_spin_lock+0) [0x3d8f40c380]*
> *[goat2:mpi_rank_12][print_backtrace]   4:
> /usr/local/lib/libmpi.so.12(MPIDI_CH3I_SHMEM_COLL_finalize+0xc4)
> [0x7fb25242faa4]*
> *[goat2:mpi_rank_12][print_backtrace]   5:
> /usr/local/lib/libmpi.so.12(MPIDI_CH3I_SMP_finalize+0x2eb) [0x7fb2526b39fb]*
> *[goat2:mpi_rank_12][print_backtrace]   6:
> /usr/local/lib/libmpi.so.12(CR_IBU_Release_network+0x2e8) [0x7fb2526fc388]*
> *[goat2:mpi_rank_12][print_backtrace]   7:
> /usr/local/lib/libmpi.so.12(CR_IBU_Suspend_channels+0x127) [0x7fb2526fc607]*
> *[goat2:mpi_rank_12][print_backtrace]   8:
> /usr/local/lib/libmpi.so.12(CR_Thread_loop+0x1dd) [0x7fb2526fd4ed]*
> *[goat2:mpi_rank_12][print_backtrace]   9:
> /usr/local/lib/libmpi.so.12(CR_Thread_entry+0xea) [0x7fb2526fdcaa]*
> *[goat2:mpi_rank_12][print_backtrace]  10: /lib64/libpthread.so.0()
> [0x3d8f4079d1]*
> *[goat2:mpi_rank_12][print_backtrace]  11: /lib64/libc.so.6(clone+0x6d)
> [0x3d8f0e8b6d]*
> [goat2:mpispawn_1][readline] Unexpected End-Of-File on file descriptor 13.
> MPI process died?
> [goat2:mpispawn_1][mtpmi_processops] Error while reading PMI socket. MPI
> process died?
> [goat2:mpispawn_1][child_handler] MPI process (rank: 23, pid: 3432)
> terminated with signal 11 -> abort job
> [goat1:mpispawn_0][readline] Unexpected End-Of-File on file descriptor 22.
> MPI process died?
> [goat1:mpispawn_0][mtpmi_processops] Error while reading PMI socket. MPI
> process died?
> [goat1:mpispawn_0][child_handler] MPI process (rank: 11, pid: 2940)
> terminated with signal 11 -> abort job
> [goat1:mpirun_rsh][process_mpispawn_connection] mpispawn_1 from node goat2
> aborted: MPI process error (1)
>
> The dmesg meanwhile prints out stuff like:
>
>
> blcr: warning: skipped a socket.
> *.... repeated many times here....*
> blcr: warning: skipped a socket.
> blcr: warning: skipped a socket.
> blcr: chkpt_watchdog: 'mpirun_rsh' (tgid/pid 3138/3139) exited with code 0
> during checkpoint
> blcr: chkpt_watchdog: 'mpirun_rsh' (tgid/pid 3138/3144) exited with code 0
> during checkpoint
> blcr: chkpt_watchdog: 'mpirun_rsh' (tgid/pid 3138/3138) exited with code 1
> during checkpoint
> blcr: chkpt_watchdog: 'mpirun_rsh' (tgid/pid 3138/3140) exited with code 1
> during checkpoint
> blcr: chkpt_watchdog: 'mpirun_rsh' (tgid/pid 3138/3141) exited with code 1
> during checkpoint
>
> One particular run i did just before writing out this mail gave a slightly
> different output near the end :
>
> [root at goat1 cellauto]# mpirun_rsh -np 24 -hostfile goat.hosts
> MV2_CKPT_FILE=/home/zz_ckpt/yea_ MV2_CKPT_INTERVAL=1 MV2_DEBUG_FT_VERBOSE=1
> MV2_CKPT_USE_AGGREGATION=0 MV2_DEBUG_CORESIZE=unlimited
> MV2_DEBUG_SHOW_BACKTRACE=1 ./mvpch221a_cellauto
> mpirun_rsh opening file /home/zz_ckpt/yea_.1.auto
> [Rank 17] opening file /home/zz_ckpt/yea_.1.17..
> [Rank 14] opening file /home/zz_ckpt/yea_.1.14..
> [Rank 21] opening file /home/zz_ckpt/yea_.1.21..
> [Rank 20] opening file /home/zz_ckpt/yea_.1.20..
> [Rank 13] opening file /home/zz_ckpt/yea_.1.13..
> [Rank 16] opening file /home/zz_ckpt/yea_.1.16..
> [Rank 15] opening file /home/zz_ckpt/yea_.1.15..
> [Rank 18] opening file /home/zz_ckpt/yea_.1.18..
> [Rank 23] opening file /home/zz_ckpt/yea_.1.23..
> [Rank 19] opening file /home/zz_ckpt/yea_.1.19..
> [Rank 22] opening file /home/zz_ckpt/yea_.1.22..
> [Rank 9] opening file /home/zz_ckpt/yea_.1.9..
> [Rank 11] opening file /home/zz_ckpt/yea_.1.11..
> [Rank 10] opening file /home/zz_ckpt/yea_.1.10..
> [Rank 8] opening file /home/zz_ckpt/yea_.1.8..
> [Rank 2] opening file /home/zz_ckpt/yea_.1.2..
> [Rank 6] opening file /home/zz_ckpt/yea_.1.6..
> [Rank 3] opening file /home/zz_ckpt/yea_.1.3..
> [Rank 4] opening file /home/zz_ckpt/yea_.1.4..
> [Rank 5] opening file /home/zz_ckpt/yea_.1.5..
> [Rank 1] opening file /home/zz_ckpt/yea_.1.1..
> [Rank 7] opening file /home/zz_ckpt/yea_.1.7..
> [Rank 12] opening file /home/zz_ckpt/yea_.1.12..
> [Rank 0] opening file /home/zz_ckpt/yea_.1.0..
> ^[[A
> mpirun_rsh opening file /home/zz_ckpt/yea_.2.auto
> [goat2:mpi_rank_12][error_sighandler] Caught error: Segmentation fault
> (signal 11)
> [goat2:mpi_rank_23][error_sighandler] Caught error: Segmentation fault
> (signal 11)
> [goat2:mpi_rank_12][print_backtrace]   0:
> /usr/local/lib/libmpi.so.12(print_backtrace+0x1e) [0x7f7b0e0e1d6e]
> [goat2:mpi_rank_12][print_backtrace]   1:
> /usr/local/lib/libmpi.so.12(error_sighandler+0x59) [0x7f7b0e0e1e79]
> [goat2:mpi_rank_12][print_backtrace]   2: /lib64/libc.so.6() [0x3d8f0329a0]
> [goat2:mpi_rank_12][print_backtrace]   3:
> /lib64/libpthread.so.0(pthread_spin_lock+0) [0x3d8f40c380]
> [goat2:mpi_rank_12][print_backtrace]   4:
> /usr/local/lib/libmpi.so.12(MPIDI_CH3I_SHMEM_COLL_finalize+0xc4)
> [0x7f7b0de17aa4]
> [goat2:mpi_rank_12][print_backtrace]   5:
> /usr/local/lib/libmpi.so.12(MPIDI_CH3I_SMP_finalize+0x2eb) [0x7f7b0e09b9fb]
> [goat2:mpi_rank_12][print_backtrace]   6:
> /usr/local/lib/libmpi.so.12(CR_IBU_Release_network+0x2e8) [0x7f7b0e0e4388]
> [goat2:mpi_rank_12][print_backtrace]   7:
> /usr/local/lib/libmpi.so.12(CR_IBU_Suspend_channels+0x127) [0x7f7b0e0e4607]
> [goat2:mpi_rank_12][print_backtrace]   8:
> /usr/local/lib/libmpi.so.12(CR_Thread_loop+0x1dd) [0x7f7b0e0e54ed]
> [goat2:mpi_rank_12][print_backtrace]   9:
> /usr/local/lib/libmpi.so.12(CR_Thread_entry+0xea) [0x7f7b0e0e5caa]
> [goat2:mpi_rank_12][print_backtrace]  10: /lib64/libpthread.so.0()
> [0x3d8f4079d1]
> [goat2:mpi_rank_12][print_backtrace]  11: /lib64/libc.so.6(clone+0x6d)
> [0x3d8f0e8b6d]
> [goat2:mpi_rank_23][print_backtrace]   0:
> /usr/local/lib/libmpi.so.12(print_backtrace+0x1e) [0x7f49e8d0fd6e]
> [goat2:mpi_rank_23][print_backtrace]   1:
> /usr/local/lib/libmpi.so.12(error_sighandler+0x59) [0x7f49e8d0fe79]
> [goat2:mpi_rank_23][print_backtrace]   2: /lib64/libc.so.6() [0x3d8f0329a0]
> [goat2:mpi_rank_23][print_backtrace]   3:
> /lib64/libpthread.so.0(pthread_spin_lock+0) [0x3d8f40c380]
> [goat2:mpi_rank_23][print_backtrace]   4:
> /usr/local/lib/libmpi.so.12(MPIDI_CH3I_SHMEM_COLL_finalize+0xc4)
> [0x7f49e8a45aa4]
> [goat2:mpi_rank_23][print_backtrace]   5:
> /usr/local/lib/libmpi.so.12(MPIDI_CH3I_SMP_finalize+0x2eb) [0x7f49e8cc99fb]
> [goat2:mpi_rank_23][print_backtrace]   6:
> /usr/local/lib/libmpi.so.12(CR_IBU_Release_network+0x2e8) [0x7f49e8d12388]
> [goat2:mpi_rank_23][print_backtrace]   7:
> /usr/local/lib/libmpi.so.12(CR_IBU_Suspend_channels+0x127) [0x7f49e8d12607]
> [goat2:mpi_rank_23][print_backtrace]   8:
> /usr/local/lib/libmpi.so.12(CR_Thread_loop+0x1dd) [0x7f49e8d134ed]
> [goat2:mpi_rank_23][print_backtrace]   9:
> /usr/local/lib/libmpi.so.12(CR_Thread_entry+0xea) [0x7f49e8d13caa]
> [goat2:mpi_rank_23][print_backtrace]  10: /lib64/libpthread.so.0()
> [0x3d8f4079d1]
> [goat2:mpi_rank_23][print_backtrace]  11: /lib64/libc.so.6(clone+0x6d)
> [0x3d8f0e8b6d]
> [goat1:mpi_rank_11][error_sighandler] Caught error: Segmentation fault
> (signal 11)
> [goat1:mpi_rank_0][error_sighandler] Caught error: Segmentation fault
> (signal 11)
> [goat1:mpi_rank_11][print_backtrace]   0:
> /usr/local/lib/libmpi.so.12(print_backtrace+0x1e) [0x7ff93f31bd6e]
> [goat1:mpi_rank_0][print_backtrace]   0:
> /usr/local/lib/libmpi.so.12(print_backtrace+0x1e) [0x7f3c5a915d6e]
> [goat1:mpi_rank_11][print_backtrace]   1:
> /usr/local/lib/libmpi.so.12(error_sighandler+0x59) [0x7ff93f31be79]
> [goat1:mpi_rank_0][print_backtrace]   1:
> /usr/local/lib/libmpi.so.12(error_sighandler+0x59) [0x7f3c5a915e79]
> [goat1:mpi_rank_11][print_backtrace]   2: /lib64/libc.so.6() [0x3b4c6329a0]
> [goat1:mpi_rank_0][print_backtrace]   2: /lib64/libc.so.6() [0x3b4c6329a0]
> [goat1:mpi_rank_11][print_backtrace]   3:
> /lib64/libpthread.so.0(pthread_spin_lock+0) [0x3b4ca0c380]
> [goat1:mpi_rank_0][print_backtrace]   3:
> /lib64/libpthread.so.0(pthread_spin_lock+0) [0x3b4ca0c380]
> [goat1:mpi_rank_11][print_backtrace]   4:
> /usr/local/lib/libmpi.so.12(MPIDI_CH3I_SHMEM_COLL_finalize+0xc4)
> [0x7ff93f051aa4]
> [goat1:mpi_rank_0][print_backtrace]   4:
> /usr/local/lib/libmpi.so.12(MPIDI_CH3I_SHMEM_COLL_finalize+0xc4)
> [0x7f3c5a64baa4]
> [goat1:mpi_rank_11][print_backtrace]   5:
> /usr/local/lib/libmpi.so.12(MPIDI_CH3I_SMP_finalize+0x2eb) [0x7ff93f2d59fb]
> [goat1:mpi_rank_0][print_backtrace]   5:
> /usr/local/lib/libmpi.so.12(MPIDI_CH3I_SMP_finalize+0x2eb) [0x7f3c5a8cf9fb]
> [goat1:mpi_rank_11][print_backtrace]   6:
> /usr/local/lib/libmpi.so.12(CR_IBU_Release_network+0x2e8) [0x7ff93f31e388]
> [goat1:mpi_rank_0][print_backtrace]   6:
> /usr/local/lib/libmpi.so.12(CR_IBU_Release_network+0x2e8) [0x7f3c5a918388]
> [goat1:mpi_rank_11][print_backtrace]   7: /usr/local/lib/libmpi.so.....
> repeated many times here....12(CR_IBU_Suspend_channels+0x127)
> [0x7ff93f31e607]
> [goat1:mpi_rank_0][print_backtrace]   7:
> /usr/local/lib/libmpi.so.12(CR_IBU_Suspend_channels+0x127) [0x7f3c5a918607]
> [goat1:mpi_rank_11][print_backtrace]   8:
> /usr/local/lib/libmpi.so.12(CR_Thread_loop+0x1dd) [0x7ff93f31f4ed]
> [goat1:mpi_rank_0][print_backtrace]   8:
> /usr/local/lib/libmpi.so.12(CR_Thread_loop+0x1dd) [0x7f3c5a9194ed]
> [goat1:mpi_rank_11][print_backtrace]   9:
> /usr/local/lib/libmpi.so.12(CR_Thread_entry+0xea) [0x7ff93f31fcaa]
> [goat1:mpi_rank_0][print_backtrace]   9:
> /usr/local/lib/libmpi.so.12(CR_Thread_entry+0xea) [0x7f3c5a919caa]
> [goat1:mpi_rank_11][print_backtrace]  10: /lib64/libpthread.so.0()
> [0x3b4ca079d1]
> [goat1:mpi_rank_0][print_backtrace]  10: /lib64/libpthread.so.0()
> [0x3b4ca079d1]
> [goat1:mpi_rank_11][print_backtrace]  11: /lib64/libc.so.6(clone+0x6d)
> [0x3b4c6e8b6d]
> [goat1:mpi_rank_0][print_backtrace]  11: /lib64/libc.so.6(clone+0x6d)
> [0x3b4c6e8b6d]
> [goat2:mpispawn_1][readline] Unexpected End-Of-File on file descriptor 16.
> MPI process died?
> [goat2:mpispawn_1][mtpmi_processops] Error while reading PMI socket. MPI
> process died?
> [goat2:mpispawn_1][child_handler] MPI process (rank: 23, pid: 4029)
> terminated with signal 11 -> abort job
> [goat1:mpispawn_0][readline] Unexpected End-Of-File on file descriptor 22.
> MPI process died?
> [goat1:mpispawn_0][mtpmi_processops] Error while reading PMI socket. MPI
> process died?
> [goat1:mpispawn_0][child_handler] MPI process (rank: 11, pid: 3235)
> terminated with signal 11 -> abort job
> [goat1:mpirun_rsh][process_mpispawn_connection] mpispawn_1 from node goat2
> aborted: MPI process error (1)
> *[goat1:mpirun_rsh][CR_Callback] Unexpected results from 1: ""*
> *[goat1:mpirun_rsh][CR_Callback] Some processes failed to checkpoint.
> Abort checkpoint...*
> *[goat1:mpirun_rsh][request_checkpoint] BLCR call cr_poll_checkpoint()
> failed with error 2354: Temporary error: checkpoint cancelled*
> *[goat1:mpirun_rsh][CR_Loop] Checkpoint failed*
>
>
> From a cursory glance, it seems that the problems are because of
> shared-libraries messing up. Could static linking of
> libraries help ?
>
>
>
>
> On Wednesday 14 January 2015 09:30 PM, Jonathan Perkins wrote:
>
> On Wed, Jan 14, 2015 at 09:04:05AM +0530, Arjun J Rao wrote:
>
>  I'm trying to get some checkpointing done on my testing system of two
> nodes. Both systems have the following software installed.
>
> MVAPICH2 version: MVAPICH2-2.1a
> BLCR version        : BLCR 0.8.5
> Linux Kernel          : 2.6.32-431.el6.x86_64 (Scientific Linux 6.5)
> OFED version       : Mellanox OFED 2.2-1.0.1 for RHEL/CentOS 6.5
>
> SELinux and iptables are disabled on both the machines.
>
>  Thanks for the information above.  Can you also send the output of
> mpiname -a?  I'm looking for the options used to builOne quick observation, you seem to want aggregation disabled.  Can you
> make the following replacement:
>
>     MV2_USE_AGGREGATION=0 -> MV2_CKPT_USE_AGGREGATION=0.
> d MVAPICH2.
>
>
>  Trying to run checkpointing with environment variables for mpiexec or
> mpiexec.hydra doesn't seem to work at all.
>
>  Is there any failures or does the program just run normally?
>
>
>  However, with mpirun_rsh, I get the following output. (Each node has 12
> cores)
>
> mpirun_rsh -np 24 -hostfile hosts MV2_CKPT_FILE=/home/zz_ckpt/yea_
> MV2_CKPT_INTERVAL=1 MV2_DEBUG_FT_VERBOSE=1 MV2_USE_AGGREGATION=0
> ./mvpch221a_cellauto
>
>  One quick observation, you seem to want aggregation disabled.  Can you
> make the following replacement:
>
>     MV2_USE_AGGREGATION=0 -> MV2_CKPT_USE_AGGREGATION=0.
> mpirun_rsh opening file /home/zz_ckpt/yea_.1.auto Re: Scientific Linux 6.5 with Mellanox OFED and BLCR 0.8.5
>       (Jonathan Perkins)
> [Rank 8] opening file /tmp/cr-1110124613559/wa/yea_.1.8..
> [Rank 6] opening file /tmp/cr-1110124613559/wa/yea_.1.6..
> [Rank 5] opening file /tmp/cr-1110124613559/wa/yea_.1.5..
> [Rank 4] opening file /tmp/cr-1110124613559/wa/yea_.1.4..
> [Rank 11] opening file /tmp/cr-1110124613559/wa/yea_.1.11..
> [Rank 3] opening file /tmp/cr-1110124613559/wa/yea_.1.3..
> [Rank 9] opening file /tmp/cr-1110124613559/wa/yea_.1.9..
> [Rank 1] opening file /tmp/cr-1110124613559/wa/yea_.1.1..
> [Rank 2] opening file /tmp/cr-1110124613559/wa/yea_.1.2..
> [Rank 10] opening file /tmp/cr-1110124613559/wa/yea_.1.10..
> [Rank 7] opening file /tmp/cr-1110124613559/wa/yea_.1.7..
> [Rank 0] opening file /tmp/cr-1110124613559/wa/yea_.1.0..
> [Rank 18] opening file /tmp/cr-1110124613559/wa/yea_.1.18..
> [Rank 20] opening file /tmp/cr-1110124613559/wa/yea_.1.20..
> [Rank 21] opening file /tmp/cr-1110124613559/wa/yea_.1.21..
> [Rank 22] opening file /tmp/cr-1110124613559/wa/yea_.1.22..
> [Rank 13] opening file /tmp/cr-1110124613559/wa/yea_.1.13..
> [Rank 16] opening file /tmp/cr-1110124613559/wa/yea_.1.16..
> [Rank 23] opening file /tmp/cr-1110124613559/wa/yea_.1.23..
> [Rank 17] opening file /tmp/cr-1110124613559/wa/yea_.1.17..
> [Rank 15] opening file /tmp/cr-1110124613559/wa/yea_.1.15..
> [Rank 19] opening file /tmp/cr-1110124613559/wa/yea_.1.19..
> [Rank 14] opening file /tmp/cr-1110124613559/wa/yea_.1.14..
> [Rank 12] opening file /tmp/cr-1110124613559/wa/yea_.1.12..
> mpirun_rsh opening file /home/zz_ckpt/yea_.2.auto
> [goat1:mpi_rank_11][error_sighandler] Caught error: Segmentation fault
> (signal 11)
> [goat1:mpi_rank_0][error_sighandler] Caught error: Segmentation fault
> (signal 11)
> [goat1:mpispawn_0][readline] Unexpected End-Of-File on file descriptor 14.
> MPI process died?
> [goat1:mpispawn_0][mtpmi_processops] Error while reading PMI socket. MPI
> process died?
> [goat1:mpispawn_0][child_handler] MPI process (rank: 11, pid: 13599)
> terminated with signal 11 -> abort job
> [goat2:mpi_rank_23][error_sighandler] Caught error: Segmentation fault
> (signal 11)
> [goat2:mpispawn_1][readline] Unexpected End-Of-File on file descriptor 14.
> MPI process died?
> [goat2:mpispawn_1][mtpmi_processops] Error while reading PMI socket. MPI
> process died?
> [goat2:mpispawn_1][child_handler] MPI process (rank: 23, pid: 14273)
> terminated with signal 11 -> abort job
> [goat2:mpi_rank_12][error_sighandler] Caught error: Segmentation fault
> (signal 11)
> [goat1:mpirun_rsh][process_mpispawn_connection] mpispawn_1 from node goat2
> aborted: MPI process error (1)
>
>
> It seems one of the MPI processes dies while writing out the *2.auto file
> and then the whole thing just crashes. What could be the reason ?
>
> At this point we're not sure but getting a backtrace from the
> segmentation fault(s) would be helpful.
>
> Can you try adding the runtime options mentioned inhttp://mvapich.cse.ohio-state.edu/static/media/mvapich/mvapich2-2.1rc1-userguide.html#x1-1210009.1.11?
>
> MV2_DEBUG_CORESIZE=unlimited
> MV2_DEBUG_SHOW_BACKTRACE=1
>
> You may need to make a debug build if these don't give us any more
> information.
>
>
>
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20150115/0cb90c2a/attachment-0001.html>


More information about the mvapich-discuss mailing list