[mvapich-discuss] MVAPICH job fails with 'Unexpected End-Of-File on file descriptor' error
Chaitra Kumar
chaitragkumar at gmail.com
Mon Aug 11 22:14:51 EDT 2014
Hi Team,
I am trying to run Graph500 on MVAPICH2. I am using infiniband. It works
for smaller number of cores. But when I increase the number of cores it
crashes.
I have configured MVAPICH2 in debug mode .
./configure --enable-cxx --enable-threads=multiple
--with-device=ch3:mrail --with-rdma=gen2 *--disable-fast
--enable-g=all --enable-error-messages=all*
The command I am using is:
mpirun_rsh -np 72 -hostfile hostfile MV2_DEBUG_CORESIZE=unlimited
MV2_DEBUG_SHOW_BACKTRACE=1 MV2_ENABLE_AFFINITY=0 ./graph500_mpi_custom_72
28
But the core dump generated is getting truncated so it could not be read.
gdb graph500_mpi_custom_72 core.132419
GNU gdb (GDB) Red Hat Enterprise Linux (7.2-64.el6_5.2)
Copyright (C) 2010 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html
>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /home/padmanac/graph/mpi/graph500_mpi_custom_72...done.
BFD: Warning: /home/padmanac/graph/mpi/core.132419 is truncated: expected
core file size >= 2893844480, found: 1149763584.
warning: core file may not match specified executable file.
[New Thread 132419]
Cannot access memory at address 0x7f12c350d760
(gdb) bt
#0 0x00007f12c27ce6ea in ?? ()
Cannot access memory at address 0x7fff4ff06390
The logs has following information:
[polaris-1:mpispawn_0][child_handler] MPI process (rank: 1, pid: 136169)
termina
ted with signal 11 -> abort job
[polaris-1:mpispawn_0][readline] Unexpected End-Of-File on file descriptor
15. M
PI process died?
[polaris-1:mpispawn_0][mtpmi_processops] Error while reading PMI socket.
MPI pro
cess died?
[polaris-1:mpirun_rsh][process_mpispawn_connection] mpispawn_0 from node
polaris
-1 aborted: MPI process error (1)
ibstat output:
ibstat
CA 'mlx4_0'
CA type: MT26428
Number of ports: 2
Firmware version: 2.9.1200
Hardware version: b0
Node GUID: 0x0002c9030028078c
System image GUID: 0x0002c9030028078f
Port 1:
State: Active
Physical state: LinkUp
Rate: 40
Base lid: 2
LMC: 0
SM lid: 1
Capability mask: 0x0251086a
Port GUID: 0x0002c9030028078d
Link layer: InfiniBand
Port 2:
State: Active
Physical state: LinkUp
Rate: 40
Base lid: 3
LMC: 0
SM lid: 1
Capability mask: 0x02510868
Port GUID: 0x0002c9030028078e
Link layer: InfiniBand
Please help me in resolving this problem. Thanks in advance.
Regards,
Chaitra
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20140812/bcc7b154/attachment.html>
More information about the mvapich-discuss
mailing list