[mvapich-discuss] (no subject)
Jim Galarowicz
jeg at krellinst.org
Mon Nov 2 11:17:23 EST 2015
Hi Jonathan,
Thanks for this advice!
I will try and let you know.
Thanks again!
Jim G
On 11/02/2015 07:57 AM, Jonathan Perkins wrote:
> Hi Jim. In addition to what Andy has suggested you may want to try
> adding the following lines to /etc/security/limits.conf on all machines.
> * soft memlock unlimited
> * hard memlock unlimited
>
> After this restart your sshd and slurm services. This is related to
> the following FAQ item in our userguide:
> http://mvapich.cse.ohio-state.edu/static/media/mvapich/mvapich2-2.1-userguide.html#x1-1380009.4.3
>
> Please let us know if this helps.
>
> On Mon, Nov 2, 2015 at 10:47 AM Andy Riebs <andy.riebs at hpe.com
> <mailto:andy.riebs at hpe.com>> wrote:
>
> X-Microsoft-Ex
> Hi Jim,
>
> I assume you did, but just in case... did you restart slurm on the
> compute nodes, as well?
>
> Andy
>
> On 11/02/2015 10:42 AM, Jim Galarowicz wrote:
> > Hi Andy,
> >
> > Thanks for the reply.
> >
> > I restarted slurm with this command:
> >
> > $ sudo /etc/init.d/slurm start
> > [sudo] password for jeg:
> > starting slurmctld:
> >
> > $ !sru
> > srun -n 2 --mpi=pmi2 ulimit.sh
> > ccn001.cc.nx: 64
> > ccn001.cc.nx: 64
> >
> > $ cat ulimit.sh
> > #!/bin/sh
> > echo $(hostname): $(ulimit -l)
> >
> >
> > It looks like I'm still not getting ulimited on the compute
> nodes, but
> > when I do the salloc and do ulimit -l, I see unlimited.
> >
> > [jeg at hdn nbody]$ ulimit -l
> > unlimited
> >
> >
> > [jeg at hdn nbody]$ cat /etc/sysconfig/slurm
> > ulimit -l unlimited
> >
> > Do you see anything wrong in what I'm doing?
> >
> > Thanks again for the reply!
> >
> > Jim G
> >
> > On 11/01/2015 02:41 PM, Andy Riebs wrote:
> >> Jim,
> >>
> >> Did you restart Slurm on the compute nodes after setting up
> >> /etc/sysconfig/slurm?
> >>
> >> Also, in your local job, what does "ulimit -l" show? That will get
> >> propagated to the computes.
> >>
> >> Andy
> >>
> >> On 11/01/2015 05:02 PM, Jim Galarowicz wrote:
> >>> X-MS-Exchange-CrossTenant
> >>> --===============6581539869262316634==
> >>> Content-Type: multipart/alternative;
> >>> boundary="------------020002060905080401040409"
> >>>
> >>> --------------020002060905080401040409
> >>> Content-Type: text/plain; charset="utf-8"; format=flowed
> >>> Content-Transfer-Encoding: 7bit
> >>>
> >>> Hi everyone,
> >>>
> >>> I'm running on a small cluster that has slurm and mvapich2
> version 2.1
> >>> installed.
> >>> However, I'm seeing this error when I try to run a simple mpi
> >>> application.
> >>>
> >>> /srun -n 2 --mpi=pmi2 ./nbody-mvapich2//
> >>> / /
> >>> //In: PMI_Abort(1, Fatal error in MPI_Init://
> >>> //Other MPI error, error stack://
> >>> //MPIR_Init_thread(514).......: //
> >>> //MPID_Init(367)..............: channel initialization
> failed//
> >>> //MPIDI_CH3_Init(492).........: //
> >>> //MPIDI_CH3I_RDMA_init(224)...: //
> >>> //rdma_setup_startup_ring(410): cannot create cq//
> >>> //)//
> >>> //In: PMI_Abort(1, Fatal error in MPI_Init://
> >>> //Other MPI error, error stack://
> >>> //MPIR_Init_thread(514).......: //
> >>> //MPID_Init(367)..............: channel initialization
> failed//
> >>> //MPIDI_CH3_Init(492).........: //
> >>> //MPIDI_CH3I_RDMA_init(224)...: //
> >>> //rdma_setup_startup_ring(410): cannot create cq//
> >>> //)//
> >>> /
> >>>
> >>>
> >>>
> >>> I searched the internet and found this url
> >>>
> (http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/2012-September/004027.html)
> >>>
> >>> on the "cannot create cq" issue, which suggests we need to set
> >>>
> >>> ulimit -l unlimited in /etc/sysconfig/slurm
> >>>
> >>>> If it doesn't show unlimited (or some other number much
> higher than
> >>>> 64)
> >>>> then you'll need to do something to update the limits slurm
> is using.
> >>>> On redhat systems you can put the following in
> /etc/sysconfig/slurm.
> >>>>
> >>>> ulimit -l unlimited
> >>> So, I added that file with the "ulimit -l unlimited" statement
> added.
> >>> But, it didn't seem to make any difference on the issue.
> >>>
> >>> Does anyone have any hints on what might be wrong?
> >>>
> >>> Thank you,
> >>> Jim G
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> --------------020002060905080401040409
> >>> Content-Type: text/html; charset="utf-8"
> >>> Content-Transfer-Encoding: 7bit
> >>>
> >>> <html>
> >>> <head>
> >>>
> >>> <meta http-equiv="content-type" content="text/html;
> >>> charset=utf-8">
> >>> </head>
> >>> <body bgcolor="#FFFFFF" text="#000000">
> >>> Hi everyone,<br>
> >>> <br>
> >>> I'm running on a small cluster that has slurm and
> mvapich2 version
> >>> 2.1 installed.<br>
> >>> However, I'm seeing this error when I try to run a simple mpi
> >>> application.<br>
> >>> <blockquote><i>srun -n 2 --mpi=pmi2
> ./nbody-mvapich2</i><i><br>
> >>> </i>
> >>> <i><br>
> >>> </i><i>
> >>> In: PMI_Abort(1, Fatal error in MPI_Init:</i><i><br>
> >>> </i><i>
> >>> Other MPI error, error stack:</i><i><br>
> >>> </i><i>
> >>> MPIR_Init_thread(514).......: </i><i><br>
> >>> </i><i>
> >>> MPID_Init(367)..............: channel initialization
> >>> failed</i><i><br>
> >>> </i><i>
> >>> MPIDI_CH3_Init(492).........: </i><i><br>
> >>> </i><i>
> >>> MPIDI_CH3I_RDMA_init(224)...: </i><i><br>
> >>> </i><i>
> >>> rdma_setup_startup_ring(410): cannot create cq</i><i><br>
> >>> </i><i>
> >>> )</i><i><br>
> >>> </i><i>
> >>> In: PMI_Abort(1, Fatal error in MPI_Init:</i><i><br>
> >>> </i><i>
> >>> Other MPI error, error stack:</i><i><br>
> >>> </i><i>
> >>> MPIR_Init_thread(514).......: </i><i><br>
> >>> </i><i>
> >>> MPID_Init(367)..............: channel initialization
> >>> failed</i><i><br>
> >>> </i><i>
> >>> MPIDI_CH3_Init(492).........: </i><i><br>
> >>> </i><i>
> >>> MPIDI_CH3I_RDMA_init(224)...: </i><i><br>
> >>> </i><i>
> >>> rdma_setup_startup_ring(410): cannot create cq</i><i><br>
> >>> </i><i>
> >>> )</i><i><br>
> >>> </i></blockquote>
> >>> <br>
> >>> <br>
> >>> I searched the internet and found this url (<a
> >>> moz-do-not-send="true"
> >>>
> href="http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/2012-September/004027.html"
> >>>
> >>>
> target="_blank">http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/2012-September/004027.html</a>)
> >>>
> >>> on the "cannot create cq" issue, which suggests we need
> to set
> >>> <br>
> >>> <pre>ulimit -l unlimited in /etc/sysconfig/slurm</pre>
> >>> <blockquote type="cite">
> >>> <pre>If it doesn't show unlimited (or some other number
> much
> >>> higher than 64)
> >>> then you'll need to do something to update the limits slurm is
> using.
> >>> On redhat systems you can put the following in
> /etc/sysconfig/slurm.
> >>>
> >>> ulimit -l unlimited
> >>> </pre>
> >>> </blockquote>
> >>> So, I added that file with the "ulimit -l unlimited"
> statement
> >>> added.<br>
> >>> But, it didn't seem to make any difference on the issue.<br>
> >>> <br>
> >>> Does anyone have any hints on what might be wrong?<br>
> >>> <br>
> >>> Thank you,<br>
> >>> Jim G<br>
> >>> <br>
> >>> <br>
> >>> <br>
> >>> <br>
> >>> </body>
> >>> </html>
> >>>
> >>> --------------020002060905080401040409--
> >>>
> >>> --===============6581539869262316634==
> >>> Content-Type: text/plain; charset="us-ascii"
> >>> MIME-Version: 1.0
> >>> Content-Transfer-Encoding: 7bit
> >>> Content-Disposition: inline
> >>>
> >>> _______________________________________________
> >>> mvapich-discuss mailing list
> >>> mvapich-discuss at cse.ohio-state.edu
> <mailto:mvapich-discuss at cse.ohio-state.edu>
> >>> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> >>>
> >>> --===============6581539869262316634==--
> >>
> >
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> <mailto:mvapich-discuss at cse.ohio-state.edu>
> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20151102/b8617455/attachment-0001.html>
More information about the mvapich-discuss
mailing list