[mvapich-discuss] Scientific Linux 6.5 with Mellanox OFED and BLCR 0.8.5

Arjun J Rao rectangle.king at gmail.com
Wed Jan 21 10:51:45 EST 2015


Sorry for the delay. Got seriously sidetracked at work. I tried 
including the MV2_USE_SHMEM option in the run. But it just gets stuck 
with the processes running but no checkpoints happening.

[root at centos65_mlnx22101_mvpch221a_roach1 cellauto]# mpirun_rsh -np 24 
-hostfile ../zz_hosts/roach.hosts MV2_CKPT_FILE=/home/arjun 
MV2_CKPT_INTERVAL=3 MV2_MAX_SAVE_CKPTS=10 MV2_CKPT_USE_AGGREGATION=0 
MV2_DEBUG_FT_VERBOSE=1 MV2_SHOW_BACKTRACE=1 MV2_USE_SHMEM_COLL=0 
./cellauto_roach
mpirun_rsh opening file /home/arjun.1.auto
[Rank 5] opening file /home/arjun.1.5..
[Rank 6] opening file /home/arjun.1.6..
[Rank 11] opening file /home/arjun.1.11..
[Rank 7] opening file /home/arjun.1.7..
[Rank 8] opening file /home/arjun.1.8..
[Rank 9] opening file /home/arjun.1.9..
[Rank 10] opening file /home/arjun.1.10..
[Rank 0] opening file /home/arjun.1.0..
[Rank 3] opening file /home/arjun.1.3..
[Rank 2] opening file /home/arjun.1.2..
[Rank 4] opening file /home/arjun.1.4..
[Rank 1] opening file /home/arjun.1.1..
[Rank 18] opening file /home/arjun.1.18..
[Rank 16] opening file /home/arjun.1.16..
[Rank 19] opening file /home/arjun.1.19..
[Rank 23] opening file /home/arjun.1.23..
[Rank 17] opening file /home/arjun.1.17..
[Rank 20] opening file /home/arjun.1.20..
[Rank 22] opening file /home/arjun.1.22..
[Rank 21] opening file /home/arjun.1.21..
[Rank 13] opening file /home/arjun.1.13..
[Rank 12] opening file /home/arjun.1.12..
[Rank 15] opening file /home/arjun.1.15..
[Rank 14] opening file /home/arjun.1.14..
mpirun_rsh opening file /home/arjun.2.auto
[Rank 23] opening file /home/arjun.2.23..
[Rank 12] opening file /home/arjun.2.12..
[Rank 11] opening file /home/arjun.2.11..
[Rank 0] opening file /home/arjun.2.0..


Without using the MV2_USE_SHMEM=0 option, I get the following error:


[root at centos65_mlnx22101_mvpch221a_roach1 cellauto]# mpirun_rsh -np 24 
-hostfile ../zz_hosts/roach.hosts MV2_CKPT_FILE=/home/arjun 
MV2_CKPT_INTERVAL=3 MV2_MAX_SAVE_CKPTS=10 MV2_CKPT_USE_AGGREGATION=0 
MV2_DEBUG_FT_VERBOSE=1 MV2_SHOW_BACKTRACE=1 ./cellauto_roach mpirun_rsh 
opening file /home/arjun.1.auto
[Rank 18] opening file /home/arjun.1.18..
[Rank 20] opening file /home/arjun.1.20..
[Rank 19] opening file /home/arjun.1.19..
[Rank 22] opening file /home/arjun.1.22..
[Rank 23] opening file /home/arjun.1.23..
[Rank 21] opening file /home/arjun.1.21..
[Rank 15] opening file /home/arjun.1.15..
[Rank 16] opening file /home/arjun.1.16..
[Rank 17] opening file /home/arjun.1.17..
[Rank 13] opening file /home/arjun.1.13..
[Rank 14] opening file /home/arjun.1.14..
[Rank 10] opening file /home/arjun.1.10..
[Rank 11] opening file /home/arjun.1.11..
[Rank 8] opening file /home/arjun.1.8..
[Rank 6] opening file /home/arjun.1.6..
[Rank 7] opening file /home/arjun.1.7..
[Rank 9] opening file /home/arjun.1.9..
[Rank 1] opening file /home/arjun.1.1..
[Rank 3] opening file /home/arjun.1.3..
[Rank 4] opening file /home/arjun.1.4..
[Rank 2] opening file /home/arjun.1.2..
[Rank 5] opening file /home/arjun.1.5..
[Rank 12] opening file /home/arjun.1.12..
[Rank 0] opening file /home/arjun.1.0..
mpirun_rsh opening file /home/arjun.2.auto
[centos65_mlnx22101_mvpch221a_roach1:mpirun_rsh][CR_Callback] Unexpected 
results from 1: ""
[centos65_mlnx22101_mvpch221a_roach1:mpirun_rsh][CR_Callback] Some 
processes failed to checkpoint. Abort checkpoint...
[centos65_mlnx22101_mvpch221a_roach1:mpirun_rsh][request_checkpoint] 
BLCR call cr_poll_checkpoint() failed with error 2354: Temporary error: 
checkpoint cancelled
[centos65_mlnx22101_mvpch221a_roach1:mpirun_rsh][CR_Loop] Checkpoint failed
[centos65_mlnx22101_mvpch221a_roach1:mpi_rank_11][handle_cqe] Send desc 
error in msg to 12, wc_opcode=0
[centos65_mlnx22101_mvpch221a_roach1:mpi_rank_0][handle_cqe] Send desc 
error in msg to 23, wc_opcode=0
[centos65_mlnx22101_mvpch221a_roach1:mpi_rank_0][handle_cqe] Msg from 
23: wc.status=12, wc.wr_id=0x12bdd70, wc.opcode=0, vbuf->phead->type=27 
= MPIDI_CH3_PKT_CM_SUSPEND
[centos65_mlnx22101_mvpch221a_roach1:mpi_rank_0][handle_cqe] 
src/mpid/ch3/channels/mrail/src/gen2/ibv_channel_manager.c:579: [] Got 
completion with error 12, vendor code=0x81, dest rank=23
: No such file or directory (2)
[centos65_mlnx22101_mvpch221a_roach1:mpi_rank_11][handle_cqe] Msg from 
12: wc.status=12, wc.wr_id=0x1ff8f90, wc.opcode=0, vbuf->phead->type=27 
= MPIDI_CH3_PKT_CM_SUSPEND
[centos65_mlnx22101_mvpch221a_roach1:mpi_rank_11][handle_cqe] 
src/mpid/ch3/channels/mrail/src/gen2/ibv_channel_manager.c:579: [] Got 
completion with error 12, vendor code=0x81, dest rank=12
: No such file or directory (2)
[centos65_mlnx22101_mvpch221a_roach1:mpispawn_0][readline] Unexpected 
End-Of-File on file descriptor 16. MPI process died?
[centos65_mlnx22101_mvpch221a_roach1:mpispawn_0][mtpmi_processops] Error 
while reading PMI socket. MPI process died?
[centos65_mlnx22101_mvpch221a_roach1:mpispawn_0][child_handler] MPI 
process (rank: 11, pid: 3183) exited with status 252
[centos65_mlnx22101_mvpch221a_roach1:mpispawn_0][child_handler] MPI 
process (rank: 0, pid: 3172) exited with status 252



Could it be a hardware problem ? My Mellanox switch is a Grid Director 
4700 (with ConnectX-2 VPI) and firmware version 2.9.1000.

Mellanox OFED version : 2.2-1.0.1
BLCR version                :  0.8.5
OS kernel version         : 2.6.32-431.el6.x86_64

Output of mpiname -a is :
[root at centos65_mlnx22101_mvpch221a_roach1 cellauto]# mpirun_rsh -np 24 
-hostfile ../zz_hosts/roach.hosts MV2_CKPT_FILE=/home/arjun 
MV2_CKPT_INTERVAL=3 MV2_MAX_SAVE_CKPTS=10 MV2_CKPT_USE_AGGREGATION=0 
MV2_DEBUG_FT_VERBOSE=1 MV2_SHOW_BACKTRACE=1 ./cellauto_roach mpirun_rsh 
opening file /home/arjun.1.auto
[Rank 18] opening file /home/arjun.1.18..
[Rank 20] opening file /home/arjun.1.20..
[Rank 19] opening file /home/arjun.1.19..
[Rank 22] opening file /home/arjun.1.22..
[Rank 23] opening file /home/arjun.1.23..
[Rank 21] opening file /home/arjun.1.21..
[Rank 15] opening file /home/arjun.1.15..
[Rank 16] opening file /home/arjun.1.16..
[Rank 17] opening file /home/arjun.1.17..
[Rank 13] opening file /home/arjun.1.13..
[Rank 14] opening file /home/arjun.1.14..
[Rank 10] opening file /home/arjun.1.10..
[Rank 11] opening file /home/arjun.1.11..
[Rank 8] opening file /home/arjun.1.8..
[Rank 6] opening file /home/arjun.1.6..
[Rank 7] opening file /home/arjun.1.7..
[Rank 9] opening file /home/arjun.1.9..
[Rank 1] opening file /home/arjun.1.1..
[Rank 3] opening file /home/arjun.1.3..
[Rank 4] opening file /home/arjun.1.4..
[Rank 2] opening file /home/arjun.1.2..
[Rank 5] opening file /home/arjun.1.5..
[Rank 12] opening file /home/arjun.1.12..
[Rank 0] opening file /home/arjun.1.0..
mpirun_rsh opening file /home/arjun.2.auto
[centos65_mlnx22101_mvpch221a_roach1:mpirun_rsh][CR_Callback] Unexpected 
results from 1: ""
[centos65_mlnx22101_mvpch221a_roach1:mpirun_rsh][CR_Callback] Some 
processes failed to checkpoint. Abort checkpoint...
[centos65_mlnx22101_mvpch221a_roach1:mpirun_rsh][request_checkpoint] 
BLCR call cr_poll_checkpoint() failed with error 2354: Temporary error: 
checkpoint cancelled
[centos65_mlnx22101_mvpch221a_roach1:mpirun_rsh][CR_Loop] Checkpoint failed
[centos65_mlnx22101_mvpch221a_roach1:mpi_rank_11][handle_cqe] Send desc 
error in msg to 12, wc_opcode=0
[centos65_mlnx22101_mvpch221a_roach1:mpi_rank_0][handle_cqe] Send desc 
error in msg to 23, wc_opcode=0
[centos65_mlnx22101_mvpch221a_roach1:mpi_rank_0][handle_cqe] Msg from 
23: wc.status=12, wc.wr_id=0x12bdd70, wc.opcode=0, vbuf->phead->type=27 
= MPIDI_CH3_PKT_CM_SUSPEND
[centos65_mlnx22101_mvpch221a_roach1:mpi_rank_0][handle_cqe] 
src/mpid/ch3/channels/mrail/src/gen2/ibv_channel_manager.c:579: [] Got 
completion with error 12, vendor code=0x81, dest rank=23
: No such file or directory (2)
[centos65_mlnx22101_mvpch221a_roach1:mpi_rank_11][handle_cqe] Msg from 
12: wc.status=12, wc.wr_id=0x1ff8f90, wc.opcode=0, vbuf->phead->type=27 
= MPIDI_CH3_PKT_CM_SUSPEND
[centos65_mlnx22101_mvpch221a_roach1:mpi_rank_11][handle_cqe] 
src/mpid/ch3/channels/mrail/src/gen2/ibv_channel_manager.c:579: [] Got 
completion with error 12, vendor code=0x81, dest rank=12
: No such file or directory (2)
[centos65_mlnx22101_mvpch221a_roach1:mpispawn_0][readline] Unexpected 
End-Of-File on file descriptor 16. MPI process died?
[centos65_mlnx22101_mvpch221a_roach1:mpispawn_0][mtpmi_processops] Error 
while reading PMI socket. MPI process died?
[centos65_mlnx22101_mvpch221a_roach1:mpispawn_0][child_handler] MPI 
process (rank: 11, pid: 3183) exited with status 252
[centos65_mlnx22101_mvpch221a_roach1:mpispawn_0][child_handler] MPI 
process (rank: 0, pid: 3172) exited with status 252


Output of mpiname -a is:

MVAPICH2 2.1a Sun Sep 21 12:00:00 EDT 2014 ch3:mrail

Compilation
CC: gcc  -DNDEBUG   -DNVALGRIND -O2
CXX: g++ -DNDEBUG -DNVALGRIND -O2
F77: gfortran -L/lib -L/lib -O2
FC: gfortran -O2

Configuration
--enable-ckpt



On Thursday 15 January 2015 08:33 PM, 
mvapich-discuss-request at cse.ohio-state.edu wrote:
> Send mvapich-discuss mailing list submissions to
> 	mvapich-discuss at cse.ohio-state.edu
>
> To subscribe or unsubscribe via the World Wide Web, visit
> 	http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> or, via email, send a message with subject or body 'help' to
> 	mvapich-discuss-request at cse.ohio-state.edu
>
> You can reach the person managing the list at
> 	mvapich-discuss-owner at cse.ohio-state.edu
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of mvapich-discuss digest..."
>
>
> Today's Topics:
>
>     1. Re: Scientific Linux 6.5 with Mellanox OFED and BLCR 0.8.5
>        (Hari Subramoni)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Thu, 15 Jan 2015 10:03:19 -0500
> From: Hari Subramoni <subramoni.1 at osu.edu>
> To: Arjun J Rao <rectangle.king at gmail.com>
> Cc: "mvapich-discuss at cse.ohio-state.edu"
> 	<mvapich-discuss at cse.ohio-state.edu>
> Subject: Re: [mvapich-discuss] Scientific Linux 6.5 with Mellanox OFED
> 	and BLCR 0.8.5
> Message-ID:
> 	<CAGUk2tGKXRhAKcYWrJOSPn0BNPo3jB=fXrREtKpACxv+kqX3uQ at mail.gmail.com>
> Content-Type: text/plain; charset="utf-8"
>
> Can you please try to rerun the application with MV2_USE_SHMEM_COLL
> <http://mvapich.cse.ohio-state.edu/static/media/mvapich/mvapich2-2.1rc1-userguide.html#x1-24800011.94>
> =0?
>
> Thx,
> Hari
>
> On Thu, Jan 15, 2015 at 9:28 AM, Arjun J Rao <rectangle.king at gmail.com>
> wrote:
>
>>   Thanks for the information above.  Can you also send the output of
>> mpiname -a?  I'm looking for the options used to build MVAPICH2.
>>
>>   Output of mpiname -a is
>>
>> [root at goat1 ~]# mpiname -a
>> MVAPICH2 2.1a Sun Sep 21 12:00:00 EDT 2014 ch3:mrail
>>
>> Compilation
>> CC: gcc -DNDEBUG -DNVALGRIND -O2
>> CXX: g++ -DNDEBUG -DNVALGRIND -O2
>> F77: gfortran -L/lib -L/lib -O2
>> FC: gfortran -O2
>>
>> Configuration
>> --enable-ckpt
>>
>>   Trying to run checkpointing with environment variables for mpiexec or
>> mpiexec.hydra doesn't seem to work at all.
>>
>>   Is there any failures or does the program just run normally?
>>
>>
>>   Tried running it both with environment variables set....
>>
>> [root at goat1 cellauto]# env | grep MV2
>> MV2_CKPT_INTERVAL=1
>> MV2_CKPT_FILE=/home/zz_ckpt/mpiexec_yea
>> MV2_CKPT_MAX_SAVE_CKPTS=10
>> [root at goat1 cellauto]# mpiexec -n 24 -f goat.hosts ./mvpch221a_cellauto
>>
>> ... and also by specifying them in the command itself.
>>
>> [root at goat1 cellauto]# mpiexec -n 24 -f goat.hosts -env
>> MV2_CKPT_INTERVAL=1 -env MV2_CKPT_FILE=/home/zz_ckpt/mpiexec_yea
>> ./mvpch221a_cellauto
>>
>> Using mpiexec or mpiexec.hydra simply doesn't work. The program runs
>> normally, but no checkpointing attempts are made.
>>
>>
>>   One quick observation, you seem to want aggregation disabled.  Can you
>> make the following replacement:
>>
>>      MV2_USE_AGGREGATION=0 -> MV2_CKPT_USE_AGGREGATION=0.
>>
>>
>> Got excited about this option. Maybe I should have used the correct
>> aggregation-disabler. But nope.
>> Tried changing MV2_USE_AGGREGATION=0 to MV2_CKPT_USE_AGGREGATION=0 but got
>> the same result as before.
>>
>>
>>
>>   It seems one of the MPI processes dies while writing out the *2.auto file
>> and then the whole thing just crashes. What could be the reason ?
>>
>> At this point we're not sure but getting a backtrace from the
>> segmentation fault(s) would be helpful.
>>
>> Can you try adding the runtime options mentioned inhttp://mvapich.cse.ohio-state.edu/static/media/mvapich/mvapich2-2.1rc1-userguide.html#x1-1210009.1.11?
>>
>> MV2_DEBUG_CORESIZE=unlimited
>> MV2_DEBUG_SHOW_BACKTRACE=1
>>
>> You may need to make a debug build if these don't give us any more
>> information.
>>
>>
>>   Then I built a debugging build with the options --enable-g=all
>> --enable-error-messages=all
>>
>> mpiname -a shows the following output:
>> [root at goat1 ~]# mpiname -a
>> MVAPICH2 2.1a Sun Sep 21 12:00:00 EDT 2014 ch3:mrail
>>
>> Compilation
>> CC: gcc -DNDEBUG -DNVALGRIND *-g* -O2
>> CXX: g++ -DNDEBUG -DNVALGRIND *-g* -O2
>> F77: gfortran -L/lib -L/lib *-g* -O2
>> FC: gfortran *-g* -O2
>>
>> Configuration
>> --enable-ckpt *--enable-g=all --enable-error-messages=all*
>>
>> Running mpirun_rsh now gives me the following result (this time with the
>> options MV2_DEBUG_CORESIZE=unlimited and MV2_DEBUG_SHOW_BACKTRACE=1)
>>
>> root at goat1 cellauto]# mpirun_rsh -np 24 -hostfile goat.hosts
>> MV2_CKPT_FILE=/home/zz_ckpt/yea_ MV2_CKPT_INTERVAL=1 MV2_DEBUG_FT_VERBOSE=1
>> MV2_CKPT_USE_AGGREGATION=0 MV2_DEBUG_CORESIZE=unlimited
>> MV2_DEBUG_SHOW_BACKTRACE=1 ./mvpch221a_cellauto
>> mpirun_rsh opening file /home/zz_ckpt/yea_.1.auto
>> [Rank 7] opening file /home/zz_ckpt/yea_.1.7..
>> [Rank 8] opening file /home/zz_ckpt/yea_.1.8..
>> [Rank 5] opening file /home/zz_ckpt/yea_.1.5..
>> [Rank 6] opening file /home/zz_ckpt/yea_.1.6..
>> [Rank 4] opening file /home/zz_ckpt/yea_.1.4..
>> [Rank 11] opening file /home/zz_ckpt/yea_.1.11..
>> [Rank 1] opening file /home/zz_ckpt/yea_.1.1..
>> [Rank 10] opening file /home/zz_ckpt/yea_.1.10..
>> [Rank 2] opening file /home/zz_ckpt/yea_.1.2..
>> [Rank 9] opening file /home/zz_ckpt/yea_.1.9..
>> [Rank 3] opening file /home/zz_ckpt/yea_.1.3..
>> [Rank 16] opening file /home/zz_ckpt/yea_.1.16..
>> [Rank 15] opening file /home/zz_ckpt/yea_.1.15..
>> [Rank 23] opening file /home/zz_ckpt/yea_.1.23..
>> [Rank 14] opening file /home/zz_ckpt/yea_.1.14..
>> [Rank 21] opening file /home/zz_ckpt/yea_.1.21..
>> [Rank 13] opening file /home/zz_ckpt/yea_.1.13..
>> [Rank 17] opening file /home/zz_ckpt/yea_.1.17..
>> [Rank 19] opening file /home/zz_ckpt/yea_.1.19..
>> [Rank 20] opening file /home/zz_ckpt/yea_.1.20..
>> [Rank 22] opening file /home/zz_ckpt/yea_.1.22..
>> [Rank 18] opening file /home/zz_ckpt/yea_.1.18..
>> [Rank 0] opening file /home/zz_ckpt/yea_.1.0..
>> [Rank 12] opening file /home/zz_ckpt/yea_.1.12..
>> mpirun_rsh opening file /home/zz_ckpt/yea_.2.auto
>> [goat1:mpi_rank_11][error_sighandler] Caught error: Segmentation fault
>> (signal 11)
>> *[goat1:mpi_rank_11][print_backtrace]   0:
>> /usr/local/lib/libmpi.so.12(print_backtrace+0x1e) [0x7f55b467dd6e]*
>> *[goat1:mpi_rank_11][print_backtrace]   1:
>> /usr/local/lib/libmpi.so.12(error_sighandler+0x59) [0x7f55b467de79]*
>> *[goat1:mpi_rank_11][print_backtrace]   2: /lib64/libc.so.6()
>> [0x3b4c6329a0]*
>> *[goat1:mpi_rank_11][print_backtrace]   3:
>> /lib64/libpthread.so.0(pthread_spin_lock+0) [0x3b4ca0c380]*
>> *[goat1:mpi_rank_11][print_backtrace]   4:
>> /usr/local/lib/libmpi.so.12(MPIDI_CH3I_SHMEM_COLL_finalize+0xc4)
>> [0x7f55b43b3aa4]*
>> *[goat1:mpi_rank_11][print_backtrace]   5:
>> /usr/local/lib/libmpi.so.12(MPIDI_CH3I_SMP_finalize+0x2eb) [0x7f55b46379fb]*
>> *[goat1:mpi_rank_11][print_backtrace]   6:
>> /usr/local/lib/libmpi.so.12(CR_IBU_Release_network+0x2e8) [0x7f55b4680388]*
>> *[goat1:mpi_rank_11][print_backtrace]   7:
>> /usr/local/lib/libmpi.so.12(CR_IBU_Suspend_channels+0x127) [0x7f55b4680607]*
>> *[goat1:mpi_rank_11][print_backtrace]   8:
>> /usr/local/lib/libmpi.so.12(CR_Thread_loop+0x1dd) [0x7f55b46814ed]*
>> *[goat1:mpi_rank_11][print_backtrace]   9:
>> /usr/local/lib/libmpi.so.12(CR_Thread_entry+0xea) [0x7f55b4681caa]*
>> *[goat1:mpi_rank_11][print_backtrace]  10: /lib64/libpthread.so.0()
>> [0x3b4ca079d1]*
>> *[goat1:mpi_rank_11][print_backtrace]  11: /lib64/libc.so.6(clone+0x6d)
>> [0x3b4c6e8b6d]*
>> [goat1:mpi_rank_0][error_sighandler] Caught error: Segmentation fault
>> (signal 11)
>> *[goat1:mpi_rank_0][print_backtrace]   0:
>> /usr/local/lib/libmpi.so.12(print_backtrace+0x1e) [0x7f47b171cd6e]*
>> *[goat1:mpi_rank_0][print_backtrace]   1:
>> /usr/local/lib/libmpi.so.12(error_sighandler+0x59) [0x7f47b171ce79]*
>> *[goat1:mpi_rank_0][print_backtrace]   2: /lib64/libc.so.6()
>> [0x3b4c6329a0]*
>> *[goat1:mpi_rank_0][print_backtrace]   3:
>> /lib64/libpthread.so.0(pthread_spin_lock+0) [0x3b4ca0c380]*
>> *[goat1:mpi_rank_0][print_backtrace]   4:
>> /usr/local/lib/libmpi.so.12(MPIDI_CH3I_SHMEM_COLL_finalize+0xc4)
>> [0x7f47b1452aa4]*
>> *[goat1:mpi_rank_0][print_backtrace]   5:
>> /usr/local/lib/libmpi.so.12(MPIDI_CH3I_SMP_finalize+0x2eb) [0x7f47b16d69fb]*
>> *[goat1:mpi_rank_0][print_backtrace]   6:
>> /usr/local/lib/libmpi.so.12(CR_IBU_Release_network+0x2e8) [0x7f47b171f388]*
>> *[goat1:mpi_rank_0][print_backtrace]   7:
>> /usr/local/lib/libmpi.so.12(CR_IBU_Suspend_channels+0x127) [0x7f47b171f607]*
>> *[goat1:mpi_rank_0][print_backtrace]   8:
>> /usr/local/lib/libmpi.so.12(CR_Thread_loop+0x1dd) [0x7f47b17204ed]*
>> *[goat1:mpi_rank_0][print_backtrace]   9:
>> /usr/local/lib/libmpi.so.12(CR_Thread_entry+0xea) [0x7f47b1720caa]*
>> *[goat1:mpi_rank_0][print_backtrace]  10: /lib64/libpthread.so.0()
>> [0x3b4ca079d1]*
>> *[goat1:mpi_rank_0][print_backtrace]  11: /lib64/libc.so.6(clone+0x6d)
>> [0x3b4c6e8b6d]*
>> [goat2:mpi_rank_23][error_sighandler] Caught error: Segmentation fault
>> (signal 11)
>> *[goat2:mpi_rank_23][print_backtrace]   0:
>> /usr/local/lib/libmpi.so.12(print_backtrace+0x1e) [0x7fed44810d6e]*
>> *[goat2:mpi_rank_23][print_backtrace]   1:
>> /usr/local/lib/libmpi.so.12(error_sighandler+0x59) [0x7fed44810e79]*
>> *[goat2:mpi_rank_23][print_backtrace]   2: /lib64/libc.so.6()
>> [0x3d8f0329a0]*
>> *[goat2:mpi_rank_23][print_backtrace]   3:
>> /lib64/libpthread.so.0(pthread_spin_lock+0) [0x3d8f40c380]*
>> *[goat2:mpi_rank_23][print_backtrace]   4:
>> /usr/local/lib/libmpi.so.12(MPIDI_CH3I_SHMEM_COLL_finalize+0xc4)
>> [0x7fed44546aa4]*
>> *[goat2:mpi_rank_23][print_backtrace]   5:
>> /usr/local/lib/libmpi.so.12(MPIDI_CH3I_SMP_finalize+0x2eb) [0x7fed447ca9fb]*
>> *[goat2:mpi_rank_23][print_backtrace]   6:
>> /usr/local/lib/libmpi.so.12(CR_IBU_Release_network+0x2e8) [0x7fed44813388]*
>> *[goat2:mpi_rank_23][print_backtrace]   7:
>> /usr/local/lib/libmpi.so.12(CR_IBU_Suspend_channels+0x127) [0x7fed44813607]*
>> *[goat2:mpi_rank_23][print_backtrace]   8:
>> /usr/local/lib/libmpi.so.12(CR_Thread_loop+0x1dd) [0x7fed448144ed]*
>> *[goat2:mpi_rank_23][print_backtrace]   9:
>> /usr/local/lib/libmpi.so.12(CR_Thread_entry+0xea) [0x7fed44814caa]*
>> *[goat2:mpi_rank_23][print_backtrace]  10: /lib64/libpthread.so.0()
>> [0x3d8f4079d1]*
>> *[goat2:mpi_rank_23][print_backtrace]  11: /lib64/libc.so.6(clone+0x6d)
>> [0x3d8f0e8b6d]*
>> [goat2:mpi_rank_12][error_sighandler] Caught error: Segmentation fault
>> (signal 11)
>> *[goat2:mpi_rank_12][print_backtrace]   0:
>> /usr/local/lib/libmpi.so.12(print_backtrace+0x1e) [0x7fb2526f9d6e]*
>> *[goat2:mpi_rank_12][print_backtrace]   1:
>> /usr/local/lib/libmpi.so.12(error_sighandler+0x59) [0x7fb2526f9e79]*
>> *[goat2:mpi_rank_12][print_backtrace]   2: /lib64/libc.so.6()
>> [0x3d8f0329a0]*
>> *[goat2:mpi_rank_12][print_backtrace]   3:
>> /lib64/libpthread.so.0(pthread_spin_lock+0) [0x3d8f40c380]*
>> *[goat2:mpi_rank_12][print_backtrace]   4:
>> /usr/local/lib/libmpi.so.12(MPIDI_CH3I_SHMEM_COLL_finalize+0xc4)
>> [0x7fb25242faa4]*
>> *[goat2:mpi_rank_12][print_backtrace]   5:
>> /usr/local/lib/libmpi.so.12(MPIDI_CH3I_SMP_finalize+0x2eb) [0x7fb2526b39fb]*
>> *[goat2:mpi_rank_12][print_backtrace]   6:
>> /usr/local/lib/libmpi.so.12(CR_IBU_Release_network+0x2e8) [0x7fb2526fc388]*
>> *[goat2:mpi_rank_12][print_backtrace]   7:
>> /usr/local/lib/libmpi.so.12(CR_IBU_Suspend_channels+0x127) [0x7fb2526fc607]*
>> *[goat2:mpi_rank_12][print_backtrace]   8:
>> /usr/local/lib/libmpi.so.12(CR_Thread_loop+0x1dd) [0x7fb2526fd4ed]*
>> *[goat2:mpi_rank_12][print_backtrace]   9:
>> /usr/local/lib/libmpi.so.12(CR_Thread_entry+0xea) [0x7fb2526fdcaa]*
>> *[goat2:mpi_rank_12][print_backtrace]  10: /lib64/libpthread.so.0()
>> [0x3d8f4079d1]*
>> *[goat2:mpi_rank_12][print_backtrace]  11: /lib64/libc.so.6(clone+0x6d)
>> [0x3d8f0e8b6d]*
>> [goat2:mpispawn_1][readline] Unexpected End-Of-File on file descriptor 13.
>> MPI process died?
>> [goat2:mpispawn_1][mtpmi_processops] Error while reading PMI socket. MPI
>> process died?
>> [goat2:mpispawn_1][child_handler] MPI process (rank: 23, pid: 3432)
>> terminated with signal 11 -> abort job
>> [goat1:mpispawn_0][readline] Unexpected End-Of-File on file descriptor 22.
>> MPI process died?
>> [goat1:mpispawn_0][mtpmi_processops] Error while reading PMI socket. MPI
>> process died?
>> [goat1:mpispawn_0][child_handler] MPI process (rank: 11, pid: 2940)
>> terminated with signal 11 -> abort job
>> [goat1:mpirun_rsh][process_mpispawn_connection] mpispawn_1 from node goat2
>> aborted: MPI process error (1)
>>
>> The dmesg meanwhile prints out stuff like:
>>
>>
>> blcr: warning: skipped a socket.
>> *.... repeated many times here....*
>> blcr: warning: skipped a socket.
>> blcr: warning: skipped a socket.
>> blcr: chkpt_watchdog: 'mpirun_rsh' (tgid/pid 3138/3139) exited with code 0
>> during checkpoint
>> blcr: chkpt_watchdog: 'mpirun_rsh' (tgid/pid 3138/3144) exited with code 0
>> during checkpoint
>> blcr: chkpt_watchdog: 'mpirun_rsh' (tgid/pid 3138/3138) exited with code 1
>> during checkpoint
>> blcr: chkpt_watchdog: 'mpirun_rsh' (tgid/pid 3138/3140) exited with code 1
>> during checkpoint
>> blcr: chkpt_watchdog: 'mpirun_rsh' (tgid/pid 3138/3141) exited with code 1
>> during checkpoint
>>
>> One particular run i did just before writing out this mail gave a slightly
>> different output near the end :
>>
>> [root at goat1 cellauto]# mpirun_rsh -np 24 -hostfile goat.hosts
>> MV2_CKPT_FILE=/home/zz_ckpt/yea_ MV2_CKPT_INTERVAL=1 MV2_DEBUG_FT_VERBOSE=1
>> MV2_CKPT_USE_AGGREGATION=0 MV2_DEBUG_CORESIZE=unlimited
>> MV2_DEBUG_SHOW_BACKTRACE=1 ./mvpch221a_cellauto
>> mpirun_rsh opening file /home/zz_ckpt/yea_.1.auto
>> [Rank 17] opening file /home/zz_ckpt/yea_.1.17..
>> [Rank 14] opening file /home/zz_ckpt/yea_.1.14..
>> [Rank 21] opening file /home/zz_ckpt/yea_.1.21..
>> [Rank 20] opening file /home/zz_ckpt/yea_.1.20..
>> [Rank 13] opening file /home/zz_ckpt/yea_.1.13..
>> [Rank 16] opening file /home/zz_ckpt/yea_.1.16..
>> [Rank 15] opening file /home/zz_ckpt/yea_.1.15..
>> [Rank 18] opening file /home/zz_ckpt/yea_.1.18..
>> [Rank 23] opening file /home/zz_ckpt/yea_.1.23..
>> [Rank 19] opening file /home/zz_ckpt/yea_.1.19..
>> [Rank 22] opening file /home/zz_ckpt/yea_.1.22..
>> [Rank 9] opening file /home/zz_ckpt/yea_.1.9..
>> [Rank 11] opening file /home/zz_ckpt/yea_.1.11..
>> [Rank 10] opening file /home/zz_ckpt/yea_.1.10..
>> [Rank 8] opening file /home/zz_ckpt/yea_.1.8..
>> [Rank 2] opening file /home/zz_ckpt/yea_.1.2..
>> [Rank 6] opening file /home/zz_ckpt/yea_.1.6..
>> [Rank 3] opening file /home/zz_ckpt/yea_.1.3..
>> [Rank 4] opening file /home/zz_ckpt/yea_.1.4..
>> [Rank 5] opening file /home/zz_ckpt/yea_.1.5..
>> [Rank 1] opening file /home/zz_ckpt/yea_.1.1..
>> [Rank 7] opening file /home/zz_ckpt/yea_.1.7..
>> [Rank 12] opening file /home/zz_ckpt/yea_.1.12..
>> [Rank 0] opening file /home/zz_ckpt/yea_.1.0..
>> ^[[A
>> mpirun_rsh opening file /home/zz_ckpt/yea_.2.auto
>> [goat2:mpi_rank_12][error_sighandler] Caught error: Segmentation fault
>> (signal 11)
>> [goat2:mpi_rank_23][error_sighandler] Caught error: Segmentation fault
>> (signal 11)
>> [goat2:mpi_rank_12][print_backtrace]   0:
>> /usr/local/lib/libmpi.so.12(print_backtrace+0x1e) [0x7f7b0e0e1d6e]
>> [goat2:mpi_rank_12][print_backtrace]   1:
>> /usr/local/lib/libmpi.so.12(error_sighandler+0x59) [0x7f7b0e0e1e79]
>> [goat2:mpi_rank_12][print_backtrace]   2: /lib64/libc.so.6() [0x3d8f0329a0]
>> [goat2:mpi_rank_12][print_backtrace]   3:
>> /lib64/libpthread.so.0(pthread_spin_lock+0) [0x3d8f40c380]
>> [goat2:mpi_rank_12][print_backtrace]   4:
>> /usr/local/lib/libmpi.so.12(MPIDI_CH3I_SHMEM_COLL_finalize+0xc4)
>> [0x7f7b0de17aa4]
>> [goat2:mpi_rank_12][print_backtrace]   5:
>> /usr/local/lib/libmpi.so.12(MPIDI_CH3I_SMP_finalize+0x2eb) [0x7f7b0e09b9fb]
>> [goat2:mpi_rank_12][print_backtrace]   6:
>> /usr/local/lib/libmpi.so.12(CR_IBU_Release_network+0x2e8) [0x7f7b0e0e4388]
>> [goat2:mpi_rank_12][print_backtrace]   7:
>> /usr/local/lib/libmpi.so.12(CR_IBU_Suspend_channels+0x127) [0x7f7b0e0e4607]
>> [goat2:mpi_rank_12][print_backtrace]   8:
>> /usr/local/lib/libmpi.so.12(CR_Thread_loop+0x1dd) [0x7f7b0e0e54ed]
>> [goat2:mpi_rank_12][print_backtrace]   9:
>> /usr/local/lib/libmpi.so.12(CR_Thread_entry+0xea) [0x7f7b0e0e5caa]
>> [goat2:mpi_rank_12][print_backtrace]  10: /lib64/libpthread.so.0()
>> [0x3d8f4079d1]
>> [goat2:mpi_rank_12][print_backtrace]  11: /lib64/libc.so.6(clone+0x6d)
>> [0x3d8f0e8b6d]
>> [goat2:mpi_rank_23][print_backtrace]   0:
>> /usr/local/lib/libmpi.so.12(print_backtrace+0x1e) [0x7f49e8d0fd6e]
>> [goat2:mpi_rank_23][print_backtrace]   1:
>> /usr/local/lib/libmpi.so.12(error_sighandler+0x59) [0x7f49e8d0fe79]
>> [goat2:mpi_rank_23][print_backtrace]   2: /lib64/libc.so.6() [0x3d8f0329a0]
>> [goat2:mpi_rank_23][print_backtrace]   3:
>> /lib64/libpthread.so.0(pthread_spin_lock+0) [0x3d8f40c380]
>> [goat2:mpi_rank_23][print_backtrace]   4:
>> /usr/local/lib/libmpi.so.12(MPIDI_CH3I_SHMEM_COLL_finalize+0xc4)
>> [0x7f49e8a45aa4]
>> [goat2:mpi_rank_23][print_backtrace]   5:
>> /usr/local/lib/libmpi.so.12(MPIDI_CH3I_SMP_finalize+0x2eb) [0x7f49e8cc99fb]
>> [goat2:mpi_rank_23][print_backtrace]   6:
>> /usr/local/lib/libmpi.so.12(CR_IBU_Release_network+0x2e8) [0x7f49e8d12388]
>> [goat2:mpi_rank_23][print_backtrace]   7:
>> /usr/local/lib/libmpi.so.12(CR_IBU_Suspend_channels+0x127) [0x7f49e8d12607]
>> [goat2:mpi_rank_23][print_backtrace]   8:
>> /usr/local/lib/libmpi.so.12(CR_Thread_loop+0x1dd) [0x7f49e8d134ed]
>> [goat2:mpi_rank_23][print_backtrace]   9:
>> /usr/local/lib/libmpi.so.12(CR_Thread_entry+0xea) [0x7f49e8d13caa]
>> [goat2:mpi_rank_23][print_backtrace]  10: /lib64/libpthread.so.0()
>> [0x3d8f4079d1]
>> [goat2:mpi_rank_23][print_backtrace]  11: /lib64/libc.so.6(clone+0x6d)
>> [0x3d8f0e8b6d]
>> [goat1:mpi_rank_11][error_sighandler] Caught error: Segmentation fault
>> (signal 11)
>> [goat1:mpi_rank_0][error_sighandler] Caught error: Segmentation fault
>> (signal 11)
>> [goat1:mpi_rank_11][print_backtrace]   0:
>> /usr/local/lib/libmpi.so.12(print_backtrace+0x1e) [0x7ff93f31bd6e]
>> [goat1:mpi_rank_0][print_backtrace]   0:
>> /usr/local/lib/libmpi.so.12(print_backtrace+0x1e) [0x7f3c5a915d6e]
>> [goat1:mpi_rank_11][print_backtrace]   1:
>> /usr/local/lib/libmpi.so.12(error_sighandler+0x59) [0x7ff93f31be79]
>> [goat1:mpi_rank_0][print_backtrace]   1:
>> /usr/local/lib/libmpi.so.12(error_sighandler+0x59) [0x7f3c5a915e79]
>> [goat1:mpi_rank_11][print_backtrace]   2: /lib64/libc.so.6() [0x3b4c6329a0]
>> [goat1:mpi_rank_0][print_backtrace]   2: /lib64/libc.so.6() [0x3b4c6329a0]
>> [goat1:mpi_rank_11][print_backtrace]   3:
>> /lib64/libpthread.so.0(pthread_spin_lock+0) [0x3b4ca0c380]
>> [goat1:mpi_rank_0][print_backtrace]   3:
>> /lib64/libpthread.so.0(pthread_spin_lock+0) [0x3b4ca0c380]
>> [goat1:mpi_rank_11][print_backtrace]   4:
>> /usr/local/lib/libmpi.so.12(MPIDI_CH3I_SHMEM_COLL_finalize+0xc4)
>> [0x7ff93f051aa4]
>> [goat1:mpi_rank_0][print_backtrace]   4:
>> /usr/local/lib/libmpi.so.12(MPIDI_CH3I_SHMEM_COLL_finalize+0xc4)
>> [0x7f3c5a64baa4]
>> [goat1:mpi_rank_11][print_backtrace]   5:
>> /usr/local/lib/libmpi.so.12(MPIDI_CH3I_SMP_finalize+0x2eb) [0x7ff93f2d59fb]
>> [goat1:mpi_rank_0][print_backtrace]   5:
>> /usr/local/lib/libmpi.so.12(MPIDI_CH3I_SMP_finalize+0x2eb) [0x7f3c5a8cf9fb]
>> [goat1:mpi_rank_11][print_backtrace]   6:
>> /usr/local/lib/libmpi.so.12(CR_IBU_Release_network+0x2e8) [0x7ff93f31e388]
>> [goat1:mpi_rank_0][print_backtrace]   6:
>> /usr/local/lib/libmpi.so.12(CR_IBU_Release_network+0x2e8) [0x7f3c5a918388]
>> [goat1:mpi_rank_11][print_backtrace]   7: /usr/local/lib/libmpi.so.....
>> repeated many times here....12(CR_IBU_Suspend_channels+0x127)
>> [0x7ff93f31e607]
>> [goat1:mpi_rank_0][print_backtrace]   7:
>> /usr/local/lib/libmpi.so.12(CR_IBU_Suspend_channels+0x127) [0x7f3c5a918607]
>> [goat1:mpi_rank_11][print_backtrace]   8:
>> /usr/local/lib/libmpi.so.12(CR_Thread_loop+0x1dd) [0x7ff93f31f4ed]
>> [goat1:mpi_rank_0][print_backtrace]   8:
>> /usr/local/lib/libmpi.so.12(CR_Thread_loop+0x1dd) [0x7f3c5a9194ed]
>> [goat1:mpi_rank_11][print_backtrace]   9:
>> /usr/local/lib/libmpi.so.12(CR_Thread_entry+0xea) [0x7ff93f31fcaa]
>> [goat1:mpi_rank_0][print_backtrace]   9:
>> /usr/local/lib/libmpi.so.12(CR_Thread_entry+0xea) [0x7f3c5a919caa]
>> [goat1:mpi_rank_11][print_backtrace]  10: /lib64/libpthread.so.0()
>> [0x3b4ca079d1]
>> [goat1:mpi_rank_0][print_backtrace]  10: /lib64/libpthread.so.0()
>> [0x3b4ca079d1]
>> [goat1:mpi_rank_11][print_backtrace]  11: /lib64/libc.so.6(clone+0x6d)
>> [0x3b4c6e8b6d]
>> [goat1:mpi_rank_0][print_backtrace]  11: /lib64/libc.so.6(clone+0x6d)
>> [0x3b4c6e8b6d]
>> [goat2:mpispawn_1][readline] Unexpected End-Of-File on file descriptor 16.
>> MPI process died?
>> [goat2:mpispawn_1][mtpmi_processops] Error while reading PMI socket. MPI
>> process died?
>> [goat2:mpispawn_1][child_handler] MPI process (rank: 23, pid: 4029)
>> terminated with signal 11 -> abort job
>> [goat1:mpispawn_0][readline] Unexpected End-Of-File on file descriptor 22.
>> MPI process died?
>> [goat1:mpispawn_0][mtpmi_processops] Error while reading PMI socket. MPI
>> process died?
>> [goat1:mpispawn_0][child_handler] MPI process (rank: 11, pid: 3235)
>> terminated with signal 11 -> abort job
>> [goat1:mpirun_rsh][process_mpispawn_connection] mpispawn_1 from node goat2
>> aborted: MPI process error (1)
>> *[goat1:mpirun_rsh][CR_Callback] Unexpected results from 1: ""*
>> *[goat1:mpirun_rsh][CR_Callback] Some processes failed to checkpoint.
>> Abort checkpoint...*
>> *[goat1:mpirun_rsh][request_checkpoint] BLCR call cr_poll_checkpoint()
>> failed with error 2354: Temporary error: checkpoint cancelled*
>> *[goat1:mpirun_rsh][CR_Loop] Checkpoint failed*
>>
>>
>>  From a cursory glance, it seems that the problems are because of
>> shared-libraries messing up. Could static linking of
>> libraries help ?
>>
>>
>>
>>
>> On Wednesday 14 January 2015 09:30 PM, Jonathan Perkins wrote:
>>
>> On Wed, Jan 14, 2015 at 09:04:05AM +0530, Arjun J Rao wrote:
>>
>>   I'm trying to get some checkpointing done on my testing system of two
>> nodes. Both systems have the following software installed.
>>
>> MVAPICH2 version: MVAPICH2-2.1a
>> BLCR version        : BLCR 0.8.5
>> Linux Kernel          : 2.6.32-431.el6.x86_64 (Scientific Linux 6.5)
>> OFED version       : Mellanox OFED 2.2-1.0.1 for RHEL/CentOS 6.5
>>
>> SELinux and iptables are disabled on both the machines.
>>
>>   Thanks for the information above.  Can you also send the output of
>> mpiname -a?  I'm looking for the options used to builOne quick observation, you seem to want aggregation disabled.  Can you
>> make the following replacement:
>>
>>      MV2_USE_AGGREGATION=0 -> MV2_CKPT_USE_AGGREGATION=0.
>> d MVAPICH2.
>>
>>
>>   Trying to run checkpointing with environment variables for mpiexec or
>> mpiexec.hydra doesn't seem to work at all.
>>
>>   Is there any failures or does the program just run normally?
>>
>>
>>   However, with mpirun_rsh, I get the following output. (Each node has 12
>> cores)
>>
>> mpirun_rsh -np 24 -hostfile hosts MV2_CKPT_FILE=/home/zz_ckpt/yea_
>> MV2_CKPT_INTERVAL=1 MV2_DEBUG_FT_VERBOSE=1 MV2_USE_AGGREGATION=0
>> ./mvpch221a_cellauto
>>
>>   One quick observation, you seem to want aggregation disabled.  Can you
>> make the following replacement:
>>
>>      MV2_USE_AGGREGATION=0 -> MV2_CKPT_USE_AGGREGATION=0.
>> mpirun_rsh opening file /home/zz_ckpt/yea_.1.auto Re: Scientific Linux 6.5 with Mellanox OFED and BLCR 0.8.5
>>        (Jonathan Perkins)
>> [Rank 8] opening file /tmp/cr-1110124613559/wa/yea_.1.8..
>> [Rank 6] opening file /tmp/cr-1110124613559/wa/yea_.1.6..
>> [Rank 5] opening file /tmp/cr-1110124613559/wa/yea_.1.5..
>> [Rank 4] opening file /tmp/cr-1110124613559/wa/yea_.1.4..
>> [Rank 11] opening file /tmp/cr-1110124613559/wa/yea_.1.11..
>> [Rank 3] opening file /tmp/cr-1110124613559/wa/yea_.1.3..
>> [Rank 9] opening file /tmp/cr-1110124613559/wa/yea_.1.9..
>> [Rank 1] opening file /tmp/cr-1110124613559/wa/yea_.1.1..
>> [Rank 2] opening file /tmp/cr-1110124613559/wa/yea_.1.2..
>> [Rank 10] opening file /tmp/cr-1110124613559/wa/yea_.1.10..
>> [Rank 7] opening file /tmp/cr-1110124613559/wa/yea_.1.7..
>> [Rank 0] opening file /tmp/cr-1110124613559/wa/yea_.1.0..
>> [Rank 18] opening file /tmp/cr-1110124613559/wa/yea_.1.18..
>> [Rank 20] opening file /tmp/cr-1110124613559/wa/yea_.1.20..
>> [Rank 21] opening file /tmp/cr-1110124613559/wa/yea_.1.21..
>> [Rank 22] opening file /tmp/cr-1110124613559/wa/yea_.1.22..
>> [Rank 13] opening file /tmp/cr-1110124613559/wa/yea_.1.13..
>> [Rank 16] opening file /tmp/cr-1110124613559/wa/yea_.1.16..
>> [Rank 23] opening file /tmp/cr-1110124613559/wa/yea_.1.23..
>> [Rank 17] opening file /tmp/cr-1110124613559/wa/yea_.1.17..
>> [Rank 15] opening file /tmp/cr-1110124613559/wa/yea_.1.15..
>> [Rank 19] opening file /tmp/cr-1110124613559/wa/yea_.1.19..
>> [Rank 14] opening file /tmp/cr-1110124613559/wa/yea_.1.14..
>> [Rank 12] opening file /tmp/cr-1110124613559/wa/yea_.1.12..
>> mpirun_rsh opening file /home/zz_ckpt/yea_.2.auto
>> [goat1:mpi_rank_11][error_sighandler] Caught error: Segmentation fault
>> (signal 11)
>> [goat1:mpi_rank_0][error_sighandler] Caught error: Segmentation fault
>> (signal 11)
>> [goat1:mpispawn_0][readline] Unexpected End-Of-File on file descriptor 14.
>> MPI process died?
>> [goat1:mpispawn_0][mtpmi_processops] Error while reading PMI socket. MPI
>> process died?
>> [goat1:mpispawn_0][child_handler] MPI process (rank: 11, pid: 13599)
>> terminated with signal 11 -> abort job
>> [goat2:mpi_rank_23][error_sighandler] Caught error: Segmentation fault
>> (signal 11)
>> [goat2:mpispawn_1][readline] Unexpected End-Of-File on file descriptor 14.
>> MPI process died?
>> [goat2:mpispawn_1][mtpmi_processops] Error while reading PMI socket. MPI
>> process died?
>> [goat2:mpispawn_1][child_handler] MPI process (rank: 23, pid: 14273)
>> terminated with signal 11 -> abort job
>> [goat2:mpi_rank_12][error_sighandler] Caught error: Segmentation fault
>> (signal 11)
>> [goat1:mpirun_rsh][process_mpispawn_connection] mpispawn_1 from node goat2
>> aborted: MPI process error (1)
>>
>>
>> It seems one of the MPI processes dies while writing out the *2.auto file
>> and then the whole thing just crashes. What could be the reason ?
>>
>> At this point we're not sure but getting a backtrace from the
>> segmentation fault(s) would be helpful.
>>
>> Can you try adding the runtime options mentioned inhttp://mvapich.cse.ohio-state.edu/static/media/mvapich/mvapich2-2.1rc1-userguide.html#x1-1210009.1.11?
>>
>> MV2_DEBUG_CORESIZE=unlimited
>> MV2_DEBUG_SHOW_BACKTRACE=1
>>
>> You may need to make a debug build if these don't give us any more
>> information.
>>
>>
>>
>>
>> _______________________________________________
>> mvapich-discuss mailing list
>> mvapich-discuss at cse.ohio-state.edu
>> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>
>>
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20150115/0cb90c2a/attachment.html>
>
> ------------------------------
>
> Subject: Digest Footer
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
>
> ------------------------------
>
> End of mvapich-discuss Digest, Vol 109, Issue 17
> ************************************************

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20150121/1c3ea54e/attachment-0001.html>


More information about the mvapich-discuss mailing list