[mvapich-discuss] Segmentation fault

Jonathan Perkins perkinjo at cse.ohio-state.edu
Tue Jun 17 08:51:50 EDT 2014


Hi, sorry to find that you're experiencing trouble with MVAPICH2.  Can
you provide us with some information on how the application was
launched such as the command used?  It looks like the MPI application
is segfaulting.  It'll be good to provide us with the backtrace of one
or more of the applications.

You may need to rebuild the library in addition to your application.
Please take a look at the following section of our 1.9 userguide:
https://mvapich.cse.ohio-state.edu/support/user_guide_mvapich2-1.9.html#x1-1230009.1.11

On Tue, Jun 17, 2014 at 12:42 AM, Srikanth Gumma <sri4mailing at gmail.com> wrote:
> Hi,
>
> I have been trying to work with a simple cpi.c program with mvapich2-1.9b
> and having trouble executing in multiple nodes.
>
> I tried all the options given in FAQs and on-line forums without any
> success.
>
> I got the below error message when I executed the command with mpirun -v
>
> I'm sure I can get some help from some of the experts.I Installed mvapich2
> in several other customer places and I never faced this strange issue.
>
> Thanks in Advance.
>
> [mpiexec at atlas4-c77] Launch arguments:
> /app1/centos6.3/gnu/mvapich2-1.9/bin/hydra_pmi_proxy --control-port
> 172.18.185.212:45735 --debug --rmk user --launcher ssh --demux poll --iface
> eth1 --pgid 0 --retries 10 --usize -2 --proxy-id 0
> [mpiexec at atlas4-c77] Launch arguments: /usr/bin/ssh -x atlas4-c78
> "/app1/centos6.3/gnu/mvapich2-1.9/bin/hydra_pmi_proxy" --control-port
> 172.18.185.212:45735 --debug --rmk user --launcher ssh --demux poll --iface
> eth1 --pgid 0 --retries 10 --usize -2 --proxy-id 1
> [proxy:0:0 at atlas4-c77] got pmi command (from 0): init
> pmi_version=1 pmi_subversion=1
> [proxy:0:0 at atlas4-c77] PMI response: cmd=response_to_init pmi_version=1
> pmi_subversion=1 rc=0
> [proxy:0:0 at atlas4-c77] got pmi command (from 0): get_maxes
>
> [proxy:0:0 at atlas4-c77] PMI response: cmd=maxes kvsname_max=256 keylen_max=64
> vallen_max=1024
> [proxy:0:0 at atlas4-c77] got pmi command (from 0): get_appnum
>
> [proxy:0:0 at atlas4-c77] PMI response: cmd=appnum appnum=0
> [proxy:0:0 at atlas4-c77] got pmi command (from 0): get_my_kvsname
>
> [proxy:0:0 at atlas4-c77] PMI response: cmd=my_kvsname kvsname=kvs_20136_0
> [proxy:0:0 at atlas4-c77] got pmi command (from 0): get_my_kvsname
>
> [proxy:0:0 at atlas4-c77] PMI response: cmd=my_kvsname kvsname=kvs_20136_0
> [proxy:0:0 at atlas4-c77] got pmi command (from 0): get
> kvsname=kvs_20136_0 key=PMI_process_mapping
> [proxy:0:0 at atlas4-c77] PMI response: cmd=get_result rc=0 msg=success
> value=(vector,(0,2,1))
> [proxy:0:0 at atlas4-c77] got pmi command (from 0): put
> kvsname=kvs_20136_0 key=MVAPICH2_0000 value=000000cd:002e0406:002e0407:
> [proxy:0:0 at atlas4-c77] we don't understand this command put; forwarding
> upstream
> [mpiexec at atlas4-c77] [pgid: 0] got PMI command: cmd=put kvsname=kvs_20136_0
> key=MVAPICH2_0000 value=000000cd:002e0406:002e0407:
> [mpiexec at atlas4-c77] PMI response to fd 6 pid 0: cmd=put_result rc=0
> msg=success
> [proxy:0:0 at atlas4-c77] we don't understand the response put_result;
> forwarding downstream
> [proxy:0:0 at atlas4-c77] got pmi command (from 0): barrier_in
>
> [proxy:0:0 at atlas4-c77] forwarding command (cmd=barrier_in) upstream
> [mpiexec at atlas4-c77] [pgid: 0] got PMI command: cmd=barrier_in
> [proxy:0:1 at atlas4-c78] got pmi command (from 4): init
> pmi_version=1 pmi_subversion=1
> [proxy:0:1 at atlas4-c78] PMI response: cmd=response_to_init pmi_version=1
> pmi_subversion=1 rc=0
> [proxy:0:1 at atlas4-c78] got pmi command (from 4): get_maxes
>
> [proxy:0:1 at atlas4-c78] PMI response: cmd=maxes kvsname_max=256 keylen_max=64
> vallen_max=1024
> [proxy:0:1 at atlas4-c78] got pmi command (from 4): get_appnum
>
> [proxy:0:1 at atlas4-c78] PMI response: cmd=appnum appnum=0
> [proxy:0:1 at atlas4-c78] got pmi command (from 4): get_my_kvsname
>
> [proxy:0:1 at atlas4-c78] PMI response: cmd=my_kvsname kvsname=kvs_20136_0
> [proxy:0:1 at atlas4-c78] got pmi command (from 4): get_my_kvsname
>
> [proxy:0:1 at atlas4-c78] PMI response: cmd=my_kvsname kvsname=kvs_20136_0
> [proxy:0:1 at atlas4-c78] got pmi command (from 4): get
> kvsname=kvs_20136_0 key=PMI_process_mapping
> [proxy:0:1 at atlas4-c78] PMI response: cmd=get_result rc=0 msg=success
> value=(vector,(0,2,1))
> [mpiexec at atlas4-c77] [pgid: 0] got PMI command: cmd=put kvsname=kvs_20136_0
> key=MVAPICH2_0001 value=00000129:002b0405:002b0406:
> [mpiexec at atlas4-c77] PMI response to fd 7 pid 4: cmd=put_result rc=0
> msg=success
> [proxy:0:1 at atlas4-c78] got pmi command (from 4): put
> kvsname=kvs_20136_0 key=MVAPICH2_0001 value=00000129:002b0405:002b0406:
> [proxy:0:1 at atlas4-c78] we don't understand this command put; forwarding
> upstream
> [proxy:0:1 at atlas4-c78] we don't understand the response put_result;
> forwarding downstream
> [mpiexec at atlas4-c77] [pgid: 0] got PMI command: cmd=barrier_in
> [mpiexec at atlas4-c77] PMI response to fd 6 pid 4: cmd=barrier_out
> [mpiexec at atlas4-c77] PMI response to fd 7 pid 4: cmd=barrier_out
> [proxy:0:1 at atlas4-c78] got pmi command (from 4): barrier_in
>
> [proxy:0:1 at atlas4-c78] forwarding command (cmd=barrier_in) upstream
> [proxy:0:0 at atlas4-c77] PMI response: cmd=barrier_out
> [proxy:0:0 at atlas4-c77] got pmi command (from 0): get
> kvsname=kvs_20136_0 key=MVAPICH2_0001
> [mpiexec at atlas4-c77] [pgid: 0] got PMI command: cmd=get kvsname=kvs_20136_0
> key=MVAPICH2_0001
> [mpiexec at atlas4-c77] PMI response to fd 6 pid 0: cmd=get_result rc=0
> msg=success value=00000129:002b0405:002b0406:
> [proxy:0:0 at atlas4-c77] forwarding command (cmd=get kvsname=kvs_20136_0
> key=MVAPICH2_0001) upstream
> [proxy:0:0 at atlas4-c77] we don't understand the response get_result;
> forwarding downstream
> [mpiexec at atlas4-c77] [pgid: 0] got PMI command: cmd=get kvsname=kvs_20136_0
> key=MVAPICH2_0000
> [mpiexec at atlas4-c77] PMI response to fd 7 pid 4: cmd=get_result rc=0
> msg=success value=000000cd:002e0406:002e0407:
> [proxy:0:0 at atlas4-c77] got pmi command (from 0): get
> kvsname=kvs_20136_0 key=MVAPICH2_0001
> [proxy:0:0 at atlas4-c77] forwarding command (cmd=get kvsname=kvs_20136_0
> key=MVAPICH2_0001) upstream
> [proxy:0:1 at atlas4-c78] PMI response: cmd=barrier_out
> [proxy:0:1 at atlas4-c78] got pmi command (from 4): get
> kvsname=kvs_20136_0 key=MVAPICH2_0000
> [proxy:0:1 at atlas4-c78] forwarding command (cmd=get kvsname=kvs_20136_0
> key=MVAPICH2_0000) upstream
> [mpiexec at atlas4-c77] [pgid: 0] got PMI command: cmd=get kvsname=kvs_20136_0
> key=MVAPICH2_0001
> [mpiexec at atlas4-c77] PMI response to fd 6 pid 0: cmd=get_result rc=0
> msg=success value=00000129:002b0405:002b0406:
> [proxy:0:0 at atlas4-c77] we don't understand the response get_result;
> forwarding downstream
> [proxy:0:0 at atlas4-c77] got pmi command (from 0): barrier_in
>
> [proxy:0:0 at atlas4-c77] forwarding command (cmd=barrier_in) upstream
> [mpiexec at atlas4-c77] [pgid: 0] got PMI command: cmd=barrier_in
> [proxy:0:1 at atlas4-c78] we don't understand the response get_result;
> forwarding downstream
> [mpiexec at atlas4-c77] [pgid: 0] got PMI command: cmd=get kvsname=kvs_20136_0
> key=MVAPICH2_0000
> [mpiexec at atlas4-c77] PMI response to fd 7 pid 4: cmd=get_result rc=0
> msg=success value=000000cd:002e0406:002e0407:
> [proxy:0:1 at atlas4-c78] got pmi command (from 4): get
> kvsname=kvs_20136_0 key=MVAPICH2_0000
> [proxy:0:1 at atlas4-c78] forwarding command (cmd=get kvsname=kvs_20136_0
> key=MVAPICH2_0000) upstream
> [proxy:0:1 at atlas4-c78] we don't understand the response get_result;
> forwarding downstream
> [mpiexec at atlas4-c77] [pgid: 0] got PMI command: cmd=barrier_in
> [mpiexec at atlas4-c77] PMI response to fd 6 pid 4: cmd=barrier_out
> [mpiexec at atlas4-c77] PMI response to fd 7 pid 4: cmd=barrier_out
> [proxy:0:0 at atlas4-c77] PMI response: cmd=barrier_out
> [proxy:0:1 at atlas4-c78] got pmi command (from 4): barrier_in
>
> [proxy:0:1 at atlas4-c78] forwarding command (cmd=barrier_in) upstream
> [proxy:0:1 at atlas4-c78] PMI response: cmd=barrier_out
> [proxy:0:0 at atlas4-c77] got pmi command (from 0): barrier_in
>
> [proxy:0:0 at atlas4-c77] forwarding command (cmd=barrier_in) upstream
> [mpiexec at atlas4-c77] [pgid: 0] got PMI command: cmd=barrier_in
> [mpiexec at atlas4-c77] [pgid: 0] got PMI command: cmd=barrier_in
> [mpiexec at atlas4-c77] PMI response to fd 6 pid 4: cmd=barrier_out
> [mpiexec at atlas4-c77] PMI response to fd 7 pid 4: cmd=barrier_out
> [proxy:0:0 at atlas4-c77] PMI response: cmd=barrier_out
> [proxy:0:1 at atlas4-c78] got pmi command (from 4): barrier_in
>
> [proxy:0:1 at atlas4-c78] forwarding command (cmd=barrier_in) upstream
> [proxy:0:1 at atlas4-c78] PMI response: cmd=barrier_out
> [atlas4-c77:mpi_rank_0][error_sighandler] Caught error: Segmentation fault
> (signal 11)
> [atlas4-c78:mpi_rank_1][error_sighandler] Caught error: Segmentation fault
> (signal 11)
>
> ===================================================================================
> =   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
> =   EXIT CODE: 139
> =   CLEANING UP REMAINING PROCESSES
> =   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
> ===================================================================================
> [proxy:0:0 at atlas4-c77] HYD_pmcd_pmip_control_cmd_cb
> (./pm/pmiserv/pmip_cb.c:913): assert (!closed) failed
> [proxy:0:0 at atlas4-c77] HYDT_dmxu_poll_wait_for_event
> (./tools/demux/demux_poll.c:77): callback returned error status
> [proxy:0:0 at atlas4-c77] main (./pm/pmiserv/pmip.c:206): demux engine error
> waiting for event
> [mpiexec at atlas4-c77] HYDT_bscu_wait_for_completion
> (./tools/bootstrap/utils/bscu_wait.c:76): one of the processes terminated
> badly; aborting
> [mpiexec at atlas4-c77] HYDT_bsci_wait_for_completion
> (./tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting for
> completion
> [mpiexec at atlas4-c77] HYD_pmci_wait_for_completion
> (./pm/pmiserv/pmiserv_pmci.c:217): launcher returned error waiting for
> completion
> [mpiexec at atlas4-c77] main (./ui/mpich/mpiexec.c:331): process manager error
> waiting for completion
>
>
> Regards
> Srikanth
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>



-- 
Jonathan Perkins
http://www.cse.ohio-state.edu/~perkinjo



More information about the mvapich-discuss mailing list