[mvapich-discuss] MVAPICH job fails with 'Unexpected End-Of-File on file descriptor' error

Chaitra Kumar chaitragkumar at gmail.com
Mon Aug 11 22:14:51 EDT 2014


Hi Team,



I am trying to run Graph500 on MVAPICH2.  I am using infiniband. It works
for smaller number of cores.  But when I increase the number of cores it
crashes.

I have configured MVAPICH2 in debug mode .

./configure  --enable-cxx --enable-threads=multiple
--with-device=ch3:mrail --with-rdma=gen2 *--disable-fast
--enable-g=all --enable-error-messages=all*



The command I am using is:

mpirun_rsh -np 72 -hostfile hostfile MV2_DEBUG_CORESIZE=unlimited
MV2_DEBUG_SHOW_BACKTRACE=1  MV2_ENABLE_AFFINITY=0 ./graph500_mpi_custom_72
28



But the core dump generated is getting truncated so it could not be read.



gdb  graph500_mpi_custom_72 core.132419

GNU gdb (GDB) Red Hat Enterprise Linux (7.2-64.el6_5.2)

Copyright (C) 2010 Free Software Foundation, Inc.

License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html
>

This is free software: you are free to change and redistribute it.

There is NO WARRANTY, to the extent permitted by law.  Type "show copying"

and "show warranty" for details.

This GDB was configured as "x86_64-redhat-linux-gnu".

For bug reporting instructions, please see:

<http://www.gnu.org/software/gdb/bugs/>...

Reading symbols from /home/padmanac/graph/mpi/graph500_mpi_custom_72...done.

BFD: Warning: /home/padmanac/graph/mpi/core.132419 is truncated: expected
core file size >= 2893844480, found: 1149763584.



warning: core file may not match specified executable file.

[New Thread 132419]

Cannot access memory at address 0x7f12c350d760

(gdb) bt

#0  0x00007f12c27ce6ea in ?? ()

Cannot access memory at address 0x7fff4ff06390





The logs has following information:

[polaris-1:mpispawn_0][child_handler] MPI process (rank: 1, pid: 136169)
termina

ted with signal 11 -> abort job

[polaris-1:mpispawn_0][readline] Unexpected End-Of-File on file descriptor
15. M

PI process died?

[polaris-1:mpispawn_0][mtpmi_processops] Error while reading PMI socket.
MPI pro

cess died?

[polaris-1:mpirun_rsh][process_mpispawn_connection] mpispawn_0 from node
polaris

-1 aborted: MPI process error (1)





ibstat output:

ibstat

CA 'mlx4_0'

        CA type: MT26428

        Number of ports: 2

        Firmware version: 2.9.1200

        Hardware version: b0

        Node GUID: 0x0002c9030028078c

        System image GUID: 0x0002c9030028078f

        Port 1:

                State: Active

                Physical state: LinkUp

                Rate: 40

                Base lid: 2

                LMC: 0

                SM lid: 1

                Capability mask: 0x0251086a

                Port GUID: 0x0002c9030028078d

                Link layer: InfiniBand

        Port 2:

                State: Active

                Physical state: LinkUp

                Rate: 40

                Base lid: 3

                LMC: 0

                SM lid: 1

                Capability mask: 0x02510868

                Port GUID: 0x0002c9030028078e

                Link layer: InfiniBand






Please help me in resolving this problem.  Thanks in advance.

Regards,
Chaitra
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20140812/bcc7b154/attachment.html>


More information about the mvapich-discuss mailing list