[mvapich-discuss] mpiexec and mpirun_rsh as non Root issues

Diego Humberto Kalegari kalegari at lactec.org.br
Wed Aug 17 12:50:05 EDT 2011


Ops.. Sorry I put the log in the wrong position.

It works as root because I see the process DETest in all nodes and I also see the output file.

I took a look in the info you sent.
* soft memlock phys_mem_in_KB

I configure this 

* soft memlock unlimited  
* hard memlock unlimited

Not MPI_Initialized but I got the following when running as l0626 and as root now

mpirun_rsh -ssh -np 48 -hostfile hosts ./DETest 13 4700 10000 sequence.txt 47 0.8 0

[n01:mpi_rank_3][error_sighandler] Caught error: Segmentation fault (signal 11)
[n01:mpi_rank_4][error_sighandler] Caught error: Segmentation fault (signal 11)
[n01:mpi_rank_7][error_sighandler] Caught error: Segmentation fault (signal 11)
[n01:mpi_rank_9][error_sighandler] Caught error: Segmentation fault (signal 11)
[n01:mpispawn_0][readline] Unexpected End-Of-File on file descriptor 7. MPI process died?
[n01:mpispawn_0][mtpmi_processops] Error while reading PMI socket. MPI process died?
[n01:mpirun_rsh][process_mpispawn_connection] mpispawn_0 from node n01 aborted: Error while reading a PMI socket (4)
[n01:mpi_rank_11][error_sighandler] Caught error: Segmentation fault (signal 11)
[n01:mpi_rank_14][error_sighandler] Caught error: Segmentation fault (signal 11)
[n01:mpi_rank_15][error_sighandler] Caught error: Segmentation fault (signal 11)
[n01:mpi_rank_16][error_sighandler] Caught error: Segmentation fault (signal 11)
[n01:mpi_rank_18][error_sighandler] Caught error: Segmentation fault (signal 11)
[n01:mpi_rank_20][error_sighandler] Caught error: Segmentation fault (signal 11)
[n01:mpi_rank_21][error_sighandler] Caught error: Segmentation fault (signal 11)
[n02:mpi_rank_25][error_sighandler] Caught error: Segmentation fault (signal 11)
[n02:mpi_rank_26][error_sighandler] Caught error: Segmentation fault (signal 11)
[n02:mpi_rank_27][error_sighandler] Caught error: Segmentation fault (signal 11)
[n02:mpi_rank_28][error_sighandler] Caught error: Segmentation fault (signal 11)
[n02:mpi_rank_29][error_sighandler] Caught error: Segmentation fault (signal 11)
[n02:mpispawn_1][readline] Unexpected End-Of-File on file descriptor 18. MPI process died?
[n02:mpispawn_1][mtpmi_processops] Error while reading PMI socket. MPI process died?
[n02:mpi_rank_32][error_sighandler] Caught error: Segmentation fault (signal 11)
[n02:mpi_rank_34][error_sighandler] Caught error: Segmentation fault (signal 11)
[n02:mpi_rank_36][error_sighandler] Caught error: Segmentation fault (signal 11)
[n02:mpi_rank_38][error_sighandler] Caught error: Segmentation fault (signal 11)
[n01:mpispawn_0][child_handler] MPI process (rank: 3, pid: 9026) terminated with signal 11 -> abort job
[n02:mpi_rank_44][error_sighandler] Caught error: Segmentation fault (signal 11)
[n02:mpi_rank_46][error_sighandler] Caught error: Segmentation fault (signal 11)
[n02:mpispawn_1][child_handler] MPI process (rank: 26, pid: 7211) terminated with signal 11 -> abort job



________________________________________
From: Jonathan Perkins [perkinjo at cse.ohio-state.edu]
Sent: Wednesday, August 17, 2011 1:19 PM
To: Diego Humberto Kalegari
Cc: mvapich-discuss at cse.ohio-state.edu
Subject: Re: [mvapich-discuss] mpiexec and mpirun_rsh as non Root issues

What I see from your output seems to indicate that it is not working
as root either due to cq creation problems.  Please take a look at
section 9.4.3 of our userguide for more information on how to resolve
this.

http://mvapich.cse.ohio-state.edu/support/user_guide_mvapich2-1.7rc1.html#x1-1090009.4.3

On Wed, Aug 17, 2011 at 12:04 PM, Diego Humberto Kalegari
<kalegari at lactec.org.br> wrote:
> Hello Jonathan
>
> I shared the MVAPCHI2 from dvse-cluster to all nodes, and I added the path in the $PATH of all nodes.
> This is path where MVAPICH2 is installed -> home/MPI/bin:/
>
>  echo $PATH
> /home/l0626/bin:/home/MPI/bin:/usr/local/bin:/usr/bin:/bin:/usr/bin/X11:/usr/X11R6/bin:/usr/games:/usr/lib/mit/bin:/usr/lib/mit/sbin
> l0626 at n01:~>
>
> This is the ssh hostname result to some of my nodes
>
> l0626 at n01:~> ssh n03 hostname
> n03
>
> l0626 at n02:~> ssh n03 hostname
> n03
> l0626 at n02:~>
>
>
> Running the following commands as root works.
> mpirun_rsh -np 11 -hostfile hosts ./DETest 55 1000 10000 sequence.txt 10 0.8 0
>
> Fatal error in MPI_Init:
> Other MPI error
>
> Fatal error in MPI_Init:
> Other MPI error
>
> cannot create cq
> cannot create cq
> Fatal error in MPI_Init:
> Other MPI error
>
> but as l0626 for examples it does not. But if I run it only in a sigle node it works. Bellow are the logs
>
> This command also work as root
>
> mpirun -hosts n01:24,n03:24,n04:24,n05:24,n06:24,n07:24,n08:24,n09:24,n10:24,n11:24 -np 240 ./DETest 13 2300 1000 sequence.txt 230 0.8 0
>
> but not as l0626 bellow are the logs
>
> Initializing MPI
> Initializing MPI
> Fatal error in MPI_Init:
> Other MPI error
>
> Fatal error in MPI_Init:
> Other MPI error
>
> Thanks,
>
> I really appreciate your help.
>
> Diego
>
> ________________________________________
> From: Jonathan Perkins [perkinjo at cse.ohio-state.edu]
> Sent: Wednesday, August 17, 2011 11:52 AM
> To: Diego Humberto Kalegari
> Cc: mvapich-discuss at cse.ohio-state.edu
> Subject: Re: [mvapich-discuss] mpiexec and mpirun_rsh as non Root issues
>
> Hello,
>
> To determine what may be wrong I'll just ask that you double check on
> a few things.  Are you just trying to run a 2 process job?  If so, are
> you able to log in as the user mpiexec on the host named dvse-cluster
> and then `ssh second_hostname'?  If so, have you installed mvapich2 on
> all machines in the same location, or use a shared filesystem?
>
> If any of these things fail please send back the full failed message
> included the command that caused the failure.
>
> On Wed, Aug 17, 2011 at 10:12 AM, Diego Humberto Kalegari
> <kalegari at lactec.org.br> wrote:
>> Hello All,
>>
>> I'm trying to set up a environment with MVAPICH2.
>>
>> I was installed this and made all the configurations required if ssh to avoid asking for a user specific PWD.. When I do a ssh to another system with any user it logs in automatically
>>
>> I was successfully able to run mpiexec and  mpirun_rsh as root user.. But when I try to run i as anoher user, any toher in my system I can't it gives me the following
>>
>> Fatal error in MPI_Init:
>> Other MPI error
>>
>> [mpiexec at dvse-cluster] control_cb (./pm/pmiserv/pmiserv_cb.c:215): assert (!closed) failed
>> [mpiexec at dvse-cluster] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status
>> [mpiexec at dvse-cluster] HYD_pmci_wait_for_completion (./pm/pmiserv/pmiserv_pmci.c:179): error waiting for event
>> [mpiexec at dvse-cluster] main (./ui/mpich/mpiexec.c:397): process manager error waiting for completion
>>
>> Could someone please provide me with any support ?
>>
>> Best Regards
>>
>> Dieog
>> _______________________________________________
>> mvapich-discuss mailing list
>> mvapich-discuss at cse.ohio-state.edu
>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>
>>
>
>
>
> --
> Jonathan Perkins
> http://www.cse.ohio-state.edu/~perkinjo
>
>



--
Jonathan Perkins
http://www.cse.ohio-state.edu/~perkinjo



More information about the mvapich-discuss mailing list