[mvapich-discuss] Scientific Linux 6.5 with Mellanox OFED and BLCR 0.8.5

Jonathan Perkins perkinjo at cse.ohio-state.edu
Wed Jan 14 11:00:01 EST 2015


On Wed, Jan 14, 2015 at 09:04:05AM +0530, Arjun J Rao wrote:
> I'm trying to get some checkpointing done on my testing system of two
> nodes. Both systems have the following software installed.
> 
> MVAPICH2 version: MVAPICH2-2.1a
> BLCR version        : BLCR 0.8.5
> Linux Kernel          : 2.6.32-431.el6.x86_64 (Scientific Linux 6.5)
> OFED version       : Mellanox OFED 2.2-1.0.1 for RHEL/CentOS 6.5
> 
> SELinux and iptables are disabled on both the machines.

Thanks for the information above.  Can you also send the output of
mpiname -a?  I'm looking for the options used to build MVAPICH2.

> 
> Trying to run checkpointing with environment variables for mpiexec or
> mpiexec.hydra doesn't seem to work at all.

Is there any failures or does the program just run normally?

> 
> However, with mpirun_rsh, I get the following output. (Each node has 12
> cores)
> 
> mpirun_rsh -np 24 -hostfile hosts MV2_CKPT_FILE=/home/zz_ckpt/yea_
> MV2_CKPT_INTERVAL=1 MV2_DEBUG_FT_VERBOSE=1 MV2_USE_AGGREGATION=0
> ./mvpch221a_cellauto

One quick observation, you seem to want aggregation disabled.  Can you
make the following replacement:

    MV2_USE_AGGREGATION=0 -> MV2_CKPT_USE_AGGREGATION=0.

> mpirun_rsh opening file /home/zz_ckpt/yea_.1.auto
> [Rank 8] opening file /tmp/cr-1110124613559/wa/yea_.1.8..
> [Rank 6] opening file /tmp/cr-1110124613559/wa/yea_.1.6..
> [Rank 5] opening file /tmp/cr-1110124613559/wa/yea_.1.5..
> [Rank 4] opening file /tmp/cr-1110124613559/wa/yea_.1.4..
> [Rank 11] opening file /tmp/cr-1110124613559/wa/yea_.1.11..
> [Rank 3] opening file /tmp/cr-1110124613559/wa/yea_.1.3..
> [Rank 9] opening file /tmp/cr-1110124613559/wa/yea_.1.9..
> [Rank 1] opening file /tmp/cr-1110124613559/wa/yea_.1.1..
> [Rank 2] opening file /tmp/cr-1110124613559/wa/yea_.1.2..
> [Rank 10] opening file /tmp/cr-1110124613559/wa/yea_.1.10..
> [Rank 7] opening file /tmp/cr-1110124613559/wa/yea_.1.7..
> [Rank 0] opening file /tmp/cr-1110124613559/wa/yea_.1.0..
> [Rank 18] opening file /tmp/cr-1110124613559/wa/yea_.1.18..
> [Rank 20] opening file /tmp/cr-1110124613559/wa/yea_.1.20..
> [Rank 21] opening file /tmp/cr-1110124613559/wa/yea_.1.21..
> [Rank 22] opening file /tmp/cr-1110124613559/wa/yea_.1.22..
> [Rank 13] opening file /tmp/cr-1110124613559/wa/yea_.1.13..
> [Rank 16] opening file /tmp/cr-1110124613559/wa/yea_.1.16..
> [Rank 23] opening file /tmp/cr-1110124613559/wa/yea_.1.23..
> [Rank 17] opening file /tmp/cr-1110124613559/wa/yea_.1.17..
> [Rank 15] opening file /tmp/cr-1110124613559/wa/yea_.1.15..
> [Rank 19] opening file /tmp/cr-1110124613559/wa/yea_.1.19..
> [Rank 14] opening file /tmp/cr-1110124613559/wa/yea_.1.14..
> [Rank 12] opening file /tmp/cr-1110124613559/wa/yea_.1.12..
> mpirun_rsh opening file /home/zz_ckpt/yea_.2.auto
> [goat1:mpi_rank_11][error_sighandler] Caught error: Segmentation fault
> (signal 11)
> [goat1:mpi_rank_0][error_sighandler] Caught error: Segmentation fault
> (signal 11)
> [goat1:mpispawn_0][readline] Unexpected End-Of-File on file descriptor 14.
> MPI process died?
> [goat1:mpispawn_0][mtpmi_processops] Error while reading PMI socket. MPI
> process died?
> [goat1:mpispawn_0][child_handler] MPI process (rank: 11, pid: 13599)
> terminated with signal 11 -> abort job
> [goat2:mpi_rank_23][error_sighandler] Caught error: Segmentation fault
> (signal 11)
> [goat2:mpispawn_1][readline] Unexpected End-Of-File on file descriptor 14.
> MPI process died?
> [goat2:mpispawn_1][mtpmi_processops] Error while reading PMI socket. MPI
> process died?
> [goat2:mpispawn_1][child_handler] MPI process (rank: 23, pid: 14273)
> terminated with signal 11 -> abort job
> [goat2:mpi_rank_12][error_sighandler] Caught error: Segmentation fault
> (signal 11)
> [goat1:mpirun_rsh][process_mpispawn_connection] mpispawn_1 from node goat2
> aborted: MPI process error (1)
> 
> 
> It seems one of the MPI processes dies while writing out the *2.auto file
> and then the whole thing just crashes. What could be the reason ?

At this point we're not sure but getting a backtrace from the
segmentation fault(s) would be helpful.

Can you try adding the runtime options mentioned in
http://mvapich.cse.ohio-state.edu/static/media/mvapich/mvapich2-2.1rc1-userguide.html#x1-1210009.1.11?

MV2_DEBUG_CORESIZE=unlimited
MV2_DEBUG_SHOW_BACKTRACE=1

You may need to make a debug build if these don't give us any more
information.

-- 
Jonathan Perkins


More information about the mvapich-discuss mailing list