[mvapich-discuss] Scientific Linux 6.5 with Mellanox OFED and BLCR 0.8.5

Arjun J Rao rectangle.king at gmail.com
Tue Jan 13 22:34:05 EST 2015


I'm trying to get some checkpointing done on my testing system of two
nodes. Both systems have the following software installed.

MVAPICH2 version: MVAPICH2-2.1a
BLCR version        : BLCR 0.8.5
Linux Kernel          : 2.6.32-431.el6.x86_64 (Scientific Linux 6.5)
OFED version       : Mellanox OFED 2.2-1.0.1 for RHEL/CentOS 6.5

SELinux and iptables are disabled on both the machines.

Trying to run checkpointing with environment variables for mpiexec or
mpiexec.hydra doesn't seem to work at all.

However, with mpirun_rsh, I get the following output. (Each node has 12
cores)

mpirun_rsh -np 24 -hostfile hosts MV2_CKPT_FILE=/home/zz_ckpt/yea_
MV2_CKPT_INTERVAL=1 MV2_DEBUG_FT_VERBOSE=1 MV2_USE_AGGREGATION=0
./mvpch221a_cellauto
mpirun_rsh opening file /home/zz_ckpt/yea_.1.auto
[Rank 8] opening file /tmp/cr-1110124613559/wa/yea_.1.8..
[Rank 6] opening file /tmp/cr-1110124613559/wa/yea_.1.6..
[Rank 5] opening file /tmp/cr-1110124613559/wa/yea_.1.5..
[Rank 4] opening file /tmp/cr-1110124613559/wa/yea_.1.4..
[Rank 11] opening file /tmp/cr-1110124613559/wa/yea_.1.11..
[Rank 3] opening file /tmp/cr-1110124613559/wa/yea_.1.3..
[Rank 9] opening file /tmp/cr-1110124613559/wa/yea_.1.9..
[Rank 1] opening file /tmp/cr-1110124613559/wa/yea_.1.1..
[Rank 2] opening file /tmp/cr-1110124613559/wa/yea_.1.2..
[Rank 10] opening file /tmp/cr-1110124613559/wa/yea_.1.10..
[Rank 7] opening file /tmp/cr-1110124613559/wa/yea_.1.7..
[Rank 0] opening file /tmp/cr-1110124613559/wa/yea_.1.0..
[Rank 18] opening file /tmp/cr-1110124613559/wa/yea_.1.18..
[Rank 20] opening file /tmp/cr-1110124613559/wa/yea_.1.20..
[Rank 21] opening file /tmp/cr-1110124613559/wa/yea_.1.21..
[Rank 22] opening file /tmp/cr-1110124613559/wa/yea_.1.22..
[Rank 13] opening file /tmp/cr-1110124613559/wa/yea_.1.13..
[Rank 16] opening file /tmp/cr-1110124613559/wa/yea_.1.16..
[Rank 23] opening file /tmp/cr-1110124613559/wa/yea_.1.23..
[Rank 17] opening file /tmp/cr-1110124613559/wa/yea_.1.17..
[Rank 15] opening file /tmp/cr-1110124613559/wa/yea_.1.15..
[Rank 19] opening file /tmp/cr-1110124613559/wa/yea_.1.19..
[Rank 14] opening file /tmp/cr-1110124613559/wa/yea_.1.14..
[Rank 12] opening file /tmp/cr-1110124613559/wa/yea_.1.12..
mpirun_rsh opening file /home/zz_ckpt/yea_.2.auto
[goat1:mpi_rank_11][error_sighandler] Caught error: Segmentation fault
(signal 11)
[goat1:mpi_rank_0][error_sighandler] Caught error: Segmentation fault
(signal 11)
[goat1:mpispawn_0][readline] Unexpected End-Of-File on file descriptor 14.
MPI process died?
[goat1:mpispawn_0][mtpmi_processops] Error while reading PMI socket. MPI
process died?
[goat1:mpispawn_0][child_handler] MPI process (rank: 11, pid: 13599)
terminated with signal 11 -> abort job
[goat2:mpi_rank_23][error_sighandler] Caught error: Segmentation fault
(signal 11)
[goat2:mpispawn_1][readline] Unexpected End-Of-File on file descriptor 14.
MPI process died?
[goat2:mpispawn_1][mtpmi_processops] Error while reading PMI socket. MPI
process died?
[goat2:mpispawn_1][child_handler] MPI process (rank: 23, pid: 14273)
terminated with signal 11 -> abort job
[goat2:mpi_rank_12][error_sighandler] Caught error: Segmentation fault
(signal 11)
[goat1:mpirun_rsh][process_mpispawn_connection] mpispawn_1 from node goat2
aborted: MPI process error (1)


It seems one of the MPI processes dies while writing out the *2.auto file
and then the whole thing just crashes. What could be the reason ?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20150114/3ccbc59e/attachment-0001.html>


More information about the mvapich-discuss mailing list