[mvapich-discuss] Scientific Linux 6.5 with Mellanox OFED and BLCR 0.8.5

Arjun J Rao rectangle.king at gmail.com
Thu Jan 15 09:28:05 EST 2015


> Thanks for the information above.  Can you also send the output of
> mpiname -a?  I'm looking for the options used to build MVAPICH2.
Output of mpiname -a is

[root at goat1 ~]# mpiname -a
MVAPICH2 2.1a Sun Sep 21 12:00:00 EDT 2014 ch3:mrail

Compilation
CC: gcc -DNDEBUG -DNVALGRIND -O2
CXX: g++ -DNDEBUG -DNVALGRIND -O2
F77: gfortran -L/lib -L/lib -O2
FC: gfortran -O2

Configuration
--enable-ckpt

>> Trying to run checkpointing with environment variables for mpiexec or
>> mpiexec.hydra doesn't seem to work at all.
> Is there any failures or does the program just run normally?
>
Tried running it both with environment variables set....

[root at goat1 cellauto]# env | grep MV2
MV2_CKPT_INTERVAL=1
MV2_CKPT_FILE=/home/zz_ckpt/mpiexec_yea
MV2_CKPT_MAX_SAVE_CKPTS=10
[root at goat1 cellauto]# mpiexec -n 24 -f goat.hosts ./mvpch221a_cellauto

... and also by specifying them in the command itself.

[root at goat1 cellauto]# mpiexec -n 24 -f goat.hosts -env 
MV2_CKPT_INTERVAL=1 -env MV2_CKPT_FILE=/home/zz_ckpt/mpiexec_yea 
./mvpch221a_cellauto

Using mpiexec or mpiexec.hydra simply doesn't work. The program runs 
normally, but no checkpointing attempts are made.


> One quick observation, you seem to want aggregation disabled.  Can you
> make the following replacement:
>
>      MV2_USE_AGGREGATION=0 -> MV2_CKPT_USE_AGGREGATION=0.

Got excited about this option. Maybe I should have used the correct 
aggregation-disabler. But nope.
Tried changing MV2_USE_AGGREGATION=0 to MV2_CKPT_USE_AGGREGATION=0 but 
got the same result as before.



> It seems one of the MPI processes dies while writing out the *2.auto file
> and then the whole thing just crashes. What could be the reason ?
> At this point we're not sure but getting a backtrace from the
> segmentation fault(s) would be helpful.
>
> Can you try adding the runtime options mentioned in
> http://mvapich.cse.ohio-state.edu/static/media/mvapich/mvapich2-2.1rc1-userguide.html#x1-1210009.1.11?
>
> MV2_DEBUG_CORESIZE=unlimited
> MV2_DEBUG_SHOW_BACKTRACE=1
>
> You may need to make a debug build if these don't give us any more
> information.
>
Then I built a debugging build with the options --enable-g=all 
--enable-error-messages=all

mpiname -a shows the following output:
[root at goat1 ~]# mpiname -a
MVAPICH2 2.1a Sun Sep 21 12:00:00 EDT 2014 ch3:mrail

Compilation
CC: gcc -DNDEBUG -DNVALGRIND *-g* -O2
CXX: g++ -DNDEBUG -DNVALGRIND *-g* -O2
F77: gfortran -L/lib -L/lib *-g* -O2
FC: gfortran *-g* -O2

Configuration
--enable-ckpt *--enable-g=all --enable-error-messages=all*

Running mpirun_rsh now gives me the following result (this time with the 
options MV2_DEBUG_CORESIZE=unlimited and MV2_DEBUG_SHOW_BACKTRACE=1)

root at goat1 cellauto]# mpirun_rsh -np 24 -hostfile goat.hosts 
MV2_CKPT_FILE=/home/zz_ckpt/yea_ MV2_CKPT_INTERVAL=1 
MV2_DEBUG_FT_VERBOSE=1 MV2_CKPT_USE_AGGREGATION=0 
MV2_DEBUG_CORESIZE=unlimited MV2_DEBUG_SHOW_BACKTRACE=1 
./mvpch221a_cellauto
mpirun_rsh opening file /home/zz_ckpt/yea_.1.auto
[Rank 7] opening file /home/zz_ckpt/yea_.1.7..
[Rank 8] opening file /home/zz_ckpt/yea_.1.8..
[Rank 5] opening file /home/zz_ckpt/yea_.1.5..
[Rank 6] opening file /home/zz_ckpt/yea_.1.6..
[Rank 4] opening file /home/zz_ckpt/yea_.1.4..
[Rank 11] opening file /home/zz_ckpt/yea_.1.11..
[Rank 1] opening file /home/zz_ckpt/yea_.1.1..
[Rank 10] opening file /home/zz_ckpt/yea_.1.10..
[Rank 2] opening file /home/zz_ckpt/yea_.1.2..
[Rank 9] opening file /home/zz_ckpt/yea_.1.9..
[Rank 3] opening file /home/zz_ckpt/yea_.1.3..
[Rank 16] opening file /home/zz_ckpt/yea_.1.16..
[Rank 15] opening file /home/zz_ckpt/yea_.1.15..
[Rank 23] opening file /home/zz_ckpt/yea_.1.23..
[Rank 14] opening file /home/zz_ckpt/yea_.1.14..
[Rank 21] opening file /home/zz_ckpt/yea_.1.21..
[Rank 13] opening file /home/zz_ckpt/yea_.1.13..
[Rank 17] opening file /home/zz_ckpt/yea_.1.17..
[Rank 19] opening file /home/zz_ckpt/yea_.1.19..
[Rank 20] opening file /home/zz_ckpt/yea_.1.20..
[Rank 22] opening file /home/zz_ckpt/yea_.1.22..
[Rank 18] opening file /home/zz_ckpt/yea_.1.18..
[Rank 0] opening file /home/zz_ckpt/yea_.1.0..
[Rank 12] opening file /home/zz_ckpt/yea_.1.12..
mpirun_rsh opening file /home/zz_ckpt/yea_.2.auto
[goat1:mpi_rank_11][error_sighandler] Caught error: Segmentation fault 
(signal 11)
*[goat1:mpi_rank_11][print_backtrace]   0: 
/usr/local/lib/libmpi.so.12(print_backtrace+0x1e) [0x7f55b467dd6e]*
*[goat1:mpi_rank_11][print_backtrace]   1: 
/usr/local/lib/libmpi.so.12(error_sighandler+0x59) [0x7f55b467de79]*
*[goat1:mpi_rank_11][print_backtrace]   2: /lib64/libc.so.6() 
[0x3b4c6329a0]*
*[goat1:mpi_rank_11][print_backtrace]   3: 
/lib64/libpthread.so.0(pthread_spin_lock+0) [0x3b4ca0c380]*
*[goat1:mpi_rank_11][print_backtrace]   4: 
/usr/local/lib/libmpi.so.12(MPIDI_CH3I_SHMEM_COLL_finalize+0xc4) 
[0x7f55b43b3aa4]*
*[goat1:mpi_rank_11][print_backtrace]   5: 
/usr/local/lib/libmpi.so.12(MPIDI_CH3I_SMP_finalize+0x2eb) [0x7f55b46379fb]*
*[goat1:mpi_rank_11][print_backtrace]   6: 
/usr/local/lib/libmpi.so.12(CR_IBU_Release_network+0x2e8) [0x7f55b4680388]*
*[goat1:mpi_rank_11][print_backtrace]   7: 
/usr/local/lib/libmpi.so.12(CR_IBU_Suspend_channels+0x127) [0x7f55b4680607]*
*[goat1:mpi_rank_11][print_backtrace]   8: 
/usr/local/lib/libmpi.so.12(CR_Thread_loop+0x1dd) [0x7f55b46814ed]*
*[goat1:mpi_rank_11][print_backtrace]   9: 
/usr/local/lib/libmpi.so.12(CR_Thread_entry+0xea) [0x7f55b4681caa]*
*[goat1:mpi_rank_11][print_backtrace]  10: /lib64/libpthread.so.0() 
[0x3b4ca079d1]*
*[goat1:mpi_rank_11][print_backtrace]  11: /lib64/libc.so.6(clone+0x6d) 
[0x3b4c6e8b6d]*
[goat1:mpi_rank_0][error_sighandler] Caught error: Segmentation fault 
(signal 11)
*[goat1:mpi_rank_0][print_backtrace]   0: 
/usr/local/lib/libmpi.so.12(print_backtrace+0x1e) [0x7f47b171cd6e]*
*[goat1:mpi_rank_0][print_backtrace]   1: 
/usr/local/lib/libmpi.so.12(error_sighandler+0x59) [0x7f47b171ce79]*
*[goat1:mpi_rank_0][print_backtrace]   2: /lib64/libc.so.6() [0x3b4c6329a0]*
*[goat1:mpi_rank_0][print_backtrace]   3: 
/lib64/libpthread.so.0(pthread_spin_lock+0) [0x3b4ca0c380]*
*[goat1:mpi_rank_0][print_backtrace]   4: 
/usr/local/lib/libmpi.so.12(MPIDI_CH3I_SHMEM_COLL_finalize+0xc4) 
[0x7f47b1452aa4]*
*[goat1:mpi_rank_0][print_backtrace]   5: 
/usr/local/lib/libmpi.so.12(MPIDI_CH3I_SMP_finalize+0x2eb) [0x7f47b16d69fb]*
*[goat1:mpi_rank_0][print_backtrace]   6: 
/usr/local/lib/libmpi.so.12(CR_IBU_Release_network+0x2e8) [0x7f47b171f388]*
*[goat1:mpi_rank_0][print_backtrace]   7: 
/usr/local/lib/libmpi.so.12(CR_IBU_Suspend_channels+0x127) [0x7f47b171f607]*
*[goat1:mpi_rank_0][print_backtrace]   8: 
/usr/local/lib/libmpi.so.12(CR_Thread_loop+0x1dd) [0x7f47b17204ed]*
*[goat1:mpi_rank_0][print_backtrace]   9: 
/usr/local/lib/libmpi.so.12(CR_Thread_entry+0xea) [0x7f47b1720caa]*
*[goat1:mpi_rank_0][print_backtrace]  10: /lib64/libpthread.so.0() 
[0x3b4ca079d1]*
*[goat1:mpi_rank_0][print_backtrace]  11: /lib64/libc.so.6(clone+0x6d) 
[0x3b4c6e8b6d]*
[goat2:mpi_rank_23][error_sighandler] Caught error: Segmentation fault 
(signal 11)
*[goat2:mpi_rank_23][print_backtrace]   0: 
/usr/local/lib/libmpi.so.12(print_backtrace+0x1e) [0x7fed44810d6e]*
*[goat2:mpi_rank_23][print_backtrace]   1: 
/usr/local/lib/libmpi.so.12(error_sighandler+0x59) [0x7fed44810e79]*
*[goat2:mpi_rank_23][print_backtrace]   2: /lib64/libc.so.6() 
[0x3d8f0329a0]*
*[goat2:mpi_rank_23][print_backtrace]   3: 
/lib64/libpthread.so.0(pthread_spin_lock+0) [0x3d8f40c380]*
*[goat2:mpi_rank_23][print_backtrace]   4: 
/usr/local/lib/libmpi.so.12(MPIDI_CH3I_SHMEM_COLL_finalize+0xc4) 
[0x7fed44546aa4]*
*[goat2:mpi_rank_23][print_backtrace]   5: 
/usr/local/lib/libmpi.so.12(MPIDI_CH3I_SMP_finalize+0x2eb) [0x7fed447ca9fb]*
*[goat2:mpi_rank_23][print_backtrace]   6: 
/usr/local/lib/libmpi.so.12(CR_IBU_Release_network+0x2e8) [0x7fed44813388]*
*[goat2:mpi_rank_23][print_backtrace]   7: 
/usr/local/lib/libmpi.so.12(CR_IBU_Suspend_channels+0x127) [0x7fed44813607]*
*[goat2:mpi_rank_23][print_backtrace]   8: 
/usr/local/lib/libmpi.so.12(CR_Thread_loop+0x1dd) [0x7fed448144ed]*
*[goat2:mpi_rank_23][print_backtrace]   9: 
/usr/local/lib/libmpi.so.12(CR_Thread_entry+0xea) [0x7fed44814caa]*
*[goat2:mpi_rank_23][print_backtrace]  10: /lib64/libpthread.so.0() 
[0x3d8f4079d1]*
*[goat2:mpi_rank_23][print_backtrace]  11: /lib64/libc.so.6(clone+0x6d) 
[0x3d8f0e8b6d]*
[goat2:mpi_rank_12][error_sighandler] Caught error: Segmentation fault 
(signal 11)
*[goat2:mpi_rank_12][print_backtrace]   0: 
/usr/local/lib/libmpi.so.12(print_backtrace+0x1e) [0x7fb2526f9d6e]*
*[goat2:mpi_rank_12][print_backtrace]   1: 
/usr/local/lib/libmpi.so.12(error_sighandler+0x59) [0x7fb2526f9e79]*
*[goat2:mpi_rank_12][print_backtrace]   2: /lib64/libc.so.6() 
[0x3d8f0329a0]*
*[goat2:mpi_rank_12][print_backtrace]   3: 
/lib64/libpthread.so.0(pthread_spin_lock+0) [0x3d8f40c380]*
*[goat2:mpi_rank_12][print_backtrace]   4: 
/usr/local/lib/libmpi.so.12(MPIDI_CH3I_SHMEM_COLL_finalize+0xc4) 
[0x7fb25242faa4]*
*[goat2:mpi_rank_12][print_backtrace]   5: 
/usr/local/lib/libmpi.so.12(MPIDI_CH3I_SMP_finalize+0x2eb) [0x7fb2526b39fb]*
*[goat2:mpi_rank_12][print_backtrace]   6: 
/usr/local/lib/libmpi.so.12(CR_IBU_Release_network+0x2e8) [0x7fb2526fc388]*
*[goat2:mpi_rank_12][print_backtrace]   7: 
/usr/local/lib/libmpi.so.12(CR_IBU_Suspend_channels+0x127) [0x7fb2526fc607]*
*[goat2:mpi_rank_12][print_backtrace]   8: 
/usr/local/lib/libmpi.so.12(CR_Thread_loop+0x1dd) [0x7fb2526fd4ed]*
*[goat2:mpi_rank_12][print_backtrace]   9: 
/usr/local/lib/libmpi.so.12(CR_Thread_entry+0xea) [0x7fb2526fdcaa]*
*[goat2:mpi_rank_12][print_backtrace]  10: /lib64/libpthread.so.0() 
[0x3d8f4079d1]*
*[goat2:mpi_rank_12][print_backtrace]  11: /lib64/libc.so.6(clone+0x6d) 
[0x3d8f0e8b6d]*
[goat2:mpispawn_1][readline] Unexpected End-Of-File on file descriptor 
13. MPI process died?
[goat2:mpispawn_1][mtpmi_processops] Error while reading PMI socket. MPI 
process died?
[goat2:mpispawn_1][child_handler] MPI process (rank: 23, pid: 3432) 
terminated with signal 11 -> abort job
[goat1:mpispawn_0][readline] Unexpected End-Of-File on file descriptor 
22. MPI process died?
[goat1:mpispawn_0][mtpmi_processops] Error while reading PMI socket. MPI 
process died?
[goat1:mpispawn_0][child_handler] MPI process (rank: 11, pid: 2940) 
terminated with signal 11 -> abort job
[goat1:mpirun_rsh][process_mpispawn_connection] mpispawn_1 from node 
goat2 aborted: MPI process error (1)

The dmesg meanwhile prints out stuff like:


blcr: warning: skipped a socket.
/.... repeated many times here..../
blcr: warning: skipped a socket.
blcr: warning: skipped a socket.
blcr: chkpt_watchdog: 'mpirun_rsh' (tgid/pid 3138/3139) exited with code 
0 during checkpoint
blcr: chkpt_watchdog: 'mpirun_rsh' (tgid/pid 3138/3144) exited with code 
0 during checkpoint
blcr: chkpt_watchdog: 'mpirun_rsh' (tgid/pid 3138/3138) exited with code 
1 during checkpoint
blcr: chkpt_watchdog: 'mpirun_rsh' (tgid/pid 3138/3140) exited with code 
1 during checkpoint
blcr: chkpt_watchdog: 'mpirun_rsh' (tgid/pid 3138/3141) exited with code 
1 during checkpoint

One particular run i did just before writing out this mail gave a 
slightly different output near the end :

[root at goat1 cellauto]# mpirun_rsh -np 24 -hostfile goat.hosts 
MV2_CKPT_FILE=/home/zz_ckpt/yea_ MV2_CKPT_INTERVAL=1 
MV2_DEBUG_FT_VERBOSE=1 MV2_CKPT_USE_AGGREGATION=0 
MV2_DEBUG_CORESIZE=unlimited MV2_DEBUG_SHOW_BACKTRACE=1 
./mvpch221a_cellauto
mpirun_rsh opening file /home/zz_ckpt/yea_.1.auto
[Rank 17] opening file /home/zz_ckpt/yea_.1.17..
[Rank 14] opening file /home/zz_ckpt/yea_.1.14..
[Rank 21] opening file /home/zz_ckpt/yea_.1.21..
[Rank 20] opening file /home/zz_ckpt/yea_.1.20..
[Rank 13] opening file /home/zz_ckpt/yea_.1.13..
[Rank 16] opening file /home/zz_ckpt/yea_.1.16..
[Rank 15] opening file /home/zz_ckpt/yea_.1.15..
[Rank 18] opening file /home/zz_ckpt/yea_.1.18..
[Rank 23] opening file /home/zz_ckpt/yea_.1.23..
[Rank 19] opening file /home/zz_ckpt/yea_.1.19..
[Rank 22] opening file /home/zz_ckpt/yea_.1.22..
[Rank 9] opening file /home/zz_ckpt/yea_.1.9..
[Rank 11] opening file /home/zz_ckpt/yea_.1.11..
[Rank 10] opening file /home/zz_ckpt/yea_.1.10..
[Rank 8] opening file /home/zz_ckpt/yea_.1.8..
[Rank 2] opening file /home/zz_ckpt/yea_.1.2..
[Rank 6] opening file /home/zz_ckpt/yea_.1.6..
[Rank 3] opening file /home/zz_ckpt/yea_.1.3..
[Rank 4] opening file /home/zz_ckpt/yea_.1.4..
[Rank 5] opening file /home/zz_ckpt/yea_.1.5..
[Rank 1] opening file /home/zz_ckpt/yea_.1.1..
[Rank 7] opening file /home/zz_ckpt/yea_.1.7..
[Rank 12] opening file /home/zz_ckpt/yea_.1.12..
[Rank 0] opening file /home/zz_ckpt/yea_.1.0..
^[[A
mpirun_rsh opening file /home/zz_ckpt/yea_.2.auto
[goat2:mpi_rank_12][error_sighandler] Caught error: Segmentation fault 
(signal 11)
[goat2:mpi_rank_23][error_sighandler] Caught error: Segmentation fault 
(signal 11)
[goat2:mpi_rank_12][print_backtrace]   0: 
/usr/local/lib/libmpi.so.12(print_backtrace+0x1e) [0x7f7b0e0e1d6e]
[goat2:mpi_rank_12][print_backtrace]   1: 
/usr/local/lib/libmpi.so.12(error_sighandler+0x59) [0x7f7b0e0e1e79]
[goat2:mpi_rank_12][print_backtrace]   2: /lib64/libc.so.6() [0x3d8f0329a0]
[goat2:mpi_rank_12][print_backtrace]   3: 
/lib64/libpthread.so.0(pthread_spin_lock+0) [0x3d8f40c380]
[goat2:mpi_rank_12][print_backtrace]   4: 
/usr/local/lib/libmpi.so.12(MPIDI_CH3I_SHMEM_COLL_finalize+0xc4) 
[0x7f7b0de17aa4]
[goat2:mpi_rank_12][print_backtrace]   5: 
/usr/local/lib/libmpi.so.12(MPIDI_CH3I_SMP_finalize+0x2eb) [0x7f7b0e09b9fb]
[goat2:mpi_rank_12][print_backtrace]   6: 
/usr/local/lib/libmpi.so.12(CR_IBU_Release_network+0x2e8) [0x7f7b0e0e4388]
[goat2:mpi_rank_12][print_backtrace]   7: 
/usr/local/lib/libmpi.so.12(CR_IBU_Suspend_channels+0x127) [0x7f7b0e0e4607]
[goat2:mpi_rank_12][print_backtrace]   8: 
/usr/local/lib/libmpi.so.12(CR_Thread_loop+0x1dd) [0x7f7b0e0e54ed]
[goat2:mpi_rank_12][print_backtrace]   9: 
/usr/local/lib/libmpi.so.12(CR_Thread_entry+0xea) [0x7f7b0e0e5caa]
[goat2:mpi_rank_12][print_backtrace]  10: /lib64/libpthread.so.0() 
[0x3d8f4079d1]
[goat2:mpi_rank_12][print_backtrace]  11: /lib64/libc.so.6(clone+0x6d) 
[0x3d8f0e8b6d]
[goat2:mpi_rank_23][print_backtrace]   0: 
/usr/local/lib/libmpi.so.12(print_backtrace+0x1e) [0x7f49e8d0fd6e]
[goat2:mpi_rank_23][print_backtrace]   1: 
/usr/local/lib/libmpi.so.12(error_sighandler+0x59) [0x7f49e8d0fe79]
[goat2:mpi_rank_23][print_backtrace]   2: /lib64/libc.so.6() [0x3d8f0329a0]
[goat2:mpi_rank_23][print_backtrace]   3: 
/lib64/libpthread.so.0(pthread_spin_lock+0) [0x3d8f40c380]
[goat2:mpi_rank_23][print_backtrace]   4: 
/usr/local/lib/libmpi.so.12(MPIDI_CH3I_SHMEM_COLL_finalize+0xc4) 
[0x7f49e8a45aa4]
[goat2:mpi_rank_23][print_backtrace]   5: 
/usr/local/lib/libmpi.so.12(MPIDI_CH3I_SMP_finalize+0x2eb) [0x7f49e8cc99fb]
[goat2:mpi_rank_23][print_backtrace]   6: 
/usr/local/lib/libmpi.so.12(CR_IBU_Release_network+0x2e8) [0x7f49e8d12388]
[goat2:mpi_rank_23][print_backtrace]   7: 
/usr/local/lib/libmpi.so.12(CR_IBU_Suspend_channels+0x127) [0x7f49e8d12607]
[goat2:mpi_rank_23][print_backtrace]   8: 
/usr/local/lib/libmpi.so.12(CR_Thread_loop+0x1dd) [0x7f49e8d134ed]
[goat2:mpi_rank_23][print_backtrace]   9: 
/usr/local/lib/libmpi.so.12(CR_Thread_entry+0xea) [0x7f49e8d13caa]
[goat2:mpi_rank_23][print_backtrace]  10: /lib64/libpthread.so.0() 
[0x3d8f4079d1]
[goat2:mpi_rank_23][print_backtrace]  11: /lib64/libc.so.6(clone+0x6d) 
[0x3d8f0e8b6d]
[goat1:mpi_rank_11][error_sighandler] Caught error: Segmentation fault 
(signal 11)
[goat1:mpi_rank_0][error_sighandler] Caught error: Segmentation fault 
(signal 11)
[goat1:mpi_rank_11][print_backtrace]   0: 
/usr/local/lib/libmpi.so.12(print_backtrace+0x1e) [0x7ff93f31bd6e]
[goat1:mpi_rank_0][print_backtrace]   0: 
/usr/local/lib/libmpi.so.12(print_backtrace+0x1e) [0x7f3c5a915d6e]
[goat1:mpi_rank_11][print_backtrace]   1: 
/usr/local/lib/libmpi.so.12(error_sighandler+0x59) [0x7ff93f31be79]
[goat1:mpi_rank_0][print_backtrace]   1: 
/usr/local/lib/libmpi.so.12(error_sighandler+0x59) [0x7f3c5a915e79]
[goat1:mpi_rank_11][print_backtrace]   2: /lib64/libc.so.6() [0x3b4c6329a0]
[goat1:mpi_rank_0][print_backtrace]   2: /lib64/libc.so.6() [0x3b4c6329a0]
[goat1:mpi_rank_11][print_backtrace]   3: 
/lib64/libpthread.so.0(pthread_spin_lock+0) [0x3b4ca0c380]
[goat1:mpi_rank_0][print_backtrace]   3: 
/lib64/libpthread.so.0(pthread_spin_lock+0) [0x3b4ca0c380]
[goat1:mpi_rank_11][print_backtrace]   4: 
/usr/local/lib/libmpi.so.12(MPIDI_CH3I_SHMEM_COLL_finalize+0xc4) 
[0x7ff93f051aa4]
[goat1:mpi_rank_0][print_backtrace]   4: 
/usr/local/lib/libmpi.so.12(MPIDI_CH3I_SHMEM_COLL_finalize+0xc4) 
[0x7f3c5a64baa4]
[goat1:mpi_rank_11][print_backtrace]   5: 
/usr/local/lib/libmpi.so.12(MPIDI_CH3I_SMP_finalize+0x2eb) [0x7ff93f2d59fb]
[goat1:mpi_rank_0][print_backtrace]   5: 
/usr/local/lib/libmpi.so.12(MPIDI_CH3I_SMP_finalize+0x2eb) [0x7f3c5a8cf9fb]
[goat1:mpi_rank_11][print_backtrace]   6: 
/usr/local/lib/libmpi.so.12(CR_IBU_Release_network+0x2e8) [0x7ff93f31e388]
[goat1:mpi_rank_0][print_backtrace]   6: 
/usr/local/lib/libmpi.so.12(CR_IBU_Release_network+0x2e8) [0x7f3c5a918388]
[goat1:mpi_rank_11][print_backtrace]   7: /usr/local/lib/libmpi.so..... 
repeated many times here....12(CR_IBU_Suspend_channels+0x127) 
[0x7ff93f31e607]
[goat1:mpi_rank_0][print_backtrace]   7: 
/usr/local/lib/libmpi.so.12(CR_IBU_Suspend_channels+0x127) [0x7f3c5a918607]
[goat1:mpi_rank_11][print_backtrace]   8: 
/usr/local/lib/libmpi.so.12(CR_Thread_loop+0x1dd) [0x7ff93f31f4ed]
[goat1:mpi_rank_0][print_backtrace]   8: 
/usr/local/lib/libmpi.so.12(CR_Thread_loop+0x1dd) [0x7f3c5a9194ed]
[goat1:mpi_rank_11][print_backtrace]   9: 
/usr/local/lib/libmpi.so.12(CR_Thread_entry+0xea) [0x7ff93f31fcaa]
[goat1:mpi_rank_0][print_backtrace]   9: 
/usr/local/lib/libmpi.so.12(CR_Thread_entry+0xea) [0x7f3c5a919caa]
[goat1:mpi_rank_11][print_backtrace]  10: /lib64/libpthread.so.0() 
[0x3b4ca079d1]
[goat1:mpi_rank_0][print_backtrace]  10: /lib64/libpthread.so.0() 
[0x3b4ca079d1]
[goat1:mpi_rank_11][print_backtrace]  11: /lib64/libc.so.6(clone+0x6d) 
[0x3b4c6e8b6d]
[goat1:mpi_rank_0][print_backtrace]  11: /lib64/libc.so.6(clone+0x6d) 
[0x3b4c6e8b6d]
[goat2:mpispawn_1][readline] Unexpected End-Of-File on file descriptor 
16. MPI process died?
[goat2:mpispawn_1][mtpmi_processops] Error while reading PMI socket. MPI 
process died?
[goat2:mpispawn_1][child_handler] MPI process (rank: 23, pid: 4029) 
terminated with signal 11 -> abort job
[goat1:mpispawn_0][readline] Unexpected End-Of-File on file descriptor 
22. MPI process died?
[goat1:mpispawn_0][mtpmi_processops] Error while reading PMI socket. MPI 
process died?
[goat1:mpispawn_0][child_handler] MPI process (rank: 11, pid: 3235) 
terminated with signal 11 -> abort job
[goat1:mpirun_rsh][process_mpispawn_connection] mpispawn_1 from node 
goat2 aborted: MPI process error (1)
*[goat1:mpirun_rsh][CR_Callback] Unexpected results from 1: ""**
**[goat1:mpirun_rsh][CR_Callback] Some processes failed to checkpoint. 
Abort checkpoint...**
**[goat1:mpirun_rsh][request_checkpoint] BLCR call cr_poll_checkpoint() 
failed with error 2354: Temporary error: checkpoint cancelled**
**[goat1:mpirun_rsh][CR_Loop] Checkpoint failed*


 From a cursory glance, it seems that the problems are because of 
shared-libraries messing up. Could static linking of
libraries help ?




On Wednesday 14 January 2015 09:30 PM, Jonathan Perkins wrote:
> On Wed, Jan 14, 2015 at 09:04:05AM +0530, Arjun J Rao wrote:
>> I'm trying to get some checkpointing done on my testing system of two
>> nodes. Both systems have the following software installed.
>>
>> MVAPICH2 version: MVAPICH2-2.1a
>> BLCR version        : BLCR 0.8.5
>> Linux Kernel          : 2.6.32-431.el6.x86_64 (Scientific Linux 6.5)
>> OFED version       : Mellanox OFED 2.2-1.0.1 for RHEL/CentOS 6.5
>>
>> SELinux and iptables are disabled on both the machines.
> Thanks for the information above.  Can you also send the output of
> mpiname -a?  I'm looking for the options used to builOne quick observation, you seem to want aggregation disabled.  Can you
> make the following replacement:
>
>      MV2_USE_AGGREGATION=0 -> MV2_CKPT_USE_AGGREGATION=0.
> d MVAPICH2.
>
>> Trying to run checkpointing with environment variables for mpiexec or
>> mpiexec.hydra doesn't seem to work at all.
> Is there any failures or does the program just run normally?
>
>> However, with mpirun_rsh, I get the following output. (Each node has 12
>> cores)
>>
>> mpirun_rsh -np 24 -hostfile hosts MV2_CKPT_FILE=/home/zz_ckpt/yea_
>> MV2_CKPT_INTERVAL=1 MV2_DEBUG_FT_VERBOSE=1 MV2_USE_AGGREGATION=0
>> ./mvpch221a_cellauto
> One quick observation, you seem to want aggregation disabled.  Can you
> make the following replacement:
>
>      MV2_USE_AGGREGATION=0 -> MV2_CKPT_USE_AGGREGATION=0.
> mpirun_rsh opening file /home/zz_ckpt/yea_.1.auto Re: Scientific Linux 6.5 with Mellanox OFED and BLCR 0.8.5
>        (Jonathan Perkins)
> [Rank 8] opening file /tmp/cr-1110124613559/wa/yea_.1.8..
> [Rank 6] opening file /tmp/cr-1110124613559/wa/yea_.1.6..
> [Rank 5] opening file /tmp/cr-1110124613559/wa/yea_.1.5..
> [Rank 4] opening file /tmp/cr-1110124613559/wa/yea_.1.4..
> [Rank 11] opening file /tmp/cr-1110124613559/wa/yea_.1.11..
> [Rank 3] opening file /tmp/cr-1110124613559/wa/yea_.1.3..
> [Rank 9] opening file /tmp/cr-1110124613559/wa/yea_.1.9..
> [Rank 1] opening file /tmp/cr-1110124613559/wa/yea_.1.1..
> [Rank 2] opening file /tmp/cr-1110124613559/wa/yea_.1.2..
> [Rank 10] opening file /tmp/cr-1110124613559/wa/yea_.1.10..
> [Rank 7] opening file /tmp/cr-1110124613559/wa/yea_.1.7..
> [Rank 0] opening file /tmp/cr-1110124613559/wa/yea_.1.0..
> [Rank 18] opening file /tmp/cr-1110124613559/wa/yea_.1.18..
> [Rank 20] opening file /tmp/cr-1110124613559/wa/yea_.1.20..
> [Rank 21] opening file /tmp/cr-1110124613559/wa/yea_.1.21..
> [Rank 22] opening file /tmp/cr-1110124613559/wa/yea_.1.22..
> [Rank 13] opening file /tmp/cr-1110124613559/wa/yea_.1.13..
> [Rank 16] opening file /tmp/cr-1110124613559/wa/yea_.1.16..
> [Rank 23] opening file /tmp/cr-1110124613559/wa/yea_.1.23..
> [Rank 17] opening file /tmp/cr-1110124613559/wa/yea_.1.17..
> [Rank 15] opening file /tmp/cr-1110124613559/wa/yea_.1.15..
> [Rank 19] opening file /tmp/cr-1110124613559/wa/yea_.1.19..
> [Rank 14] opening file /tmp/cr-1110124613559/wa/yea_.1.14..
> [Rank 12] opening file /tmp/cr-1110124613559/wa/yea_.1.12..
> mpirun_rsh opening file /home/zz_ckpt/yea_.2.auto
> [goat1:mpi_rank_11][error_sighandler] Caught error: Segmentation fault
> (signal 11)
> [goat1:mpi_rank_0][error_sighandler] Caught error: Segmentation fault
> (signal 11)
> [goat1:mpispawn_0][readline] Unexpected End-Of-File on file descriptor 14.
> MPI process died?
> [goat1:mpispawn_0][mtpmi_processops] Error while reading PMI socket. MPI
> process died?
> [goat1:mpispawn_0][child_handler] MPI process (rank: 11, pid: 13599)
> terminated with signal 11 -> abort job
> [goat2:mpi_rank_23][error_sighandler] Caught error: Segmentation fault
> (signal 11)
> [goat2:mpispawn_1][readline] Unexpected End-Of-File on file descriptor 14.
> MPI process died?
> [goat2:mpispawn_1][mtpmi_processops] Error while reading PMI socket. MPI
> process died?
> [goat2:mpispawn_1][child_handler] MPI process (rank: 23, pid: 14273)
> terminated with signal 11 -> abort job
> [goat2:mpi_rank_12][error_sighandler] Caught error: Segmentation fault
> (signal 11)
> [goat1:mpirun_rsh][process_mpispawn_connection] mpispawn_1 from node goat2
> aborted: MPI process error (1)
>
>
> It seems one of the MPI processes dies while writing out the *2.auto file
> and then the whole thing just crashes. What could be the reason ?
> At this point we're not sure but getting a backtrace from the
> segmentation fault(s) would be helpful.
>
> Can you try adding the runtime options mentioned in
> http://mvapich.cse.ohio-state.edu/static/media/mvapich/mvapich2-2.1rc1-userguide.html#x1-1210009.1.11?
>
> MV2_DEBUG_CORESIZE=unlimited
> MV2_DEBUG_SHOW_BACKTRACE=1
>
> You may need to make a debug build if these don't give us any more
> information.
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20150115/b9fe6941/attachment-0001.html>


More information about the mvapich-discuss mailing list