[mvapich-discuss] (no subject)

Jim Galarowicz jeg at krellinst.org
Mon Nov 2 11:17:23 EST 2015


Hi Jonathan,

Thanks for this advice!

I will try and let you know.

Thanks again!
Jim G

On 11/02/2015 07:57 AM, Jonathan Perkins wrote:
> Hi Jim.  In addition to what Andy has suggested you may want to try 
> adding the following lines to /etc/security/limits.conf on all machines.
> * soft memlock unlimited
> * hard memlock unlimited
>
> After this restart your sshd and slurm services.  This is related to 
> the following FAQ item in our userguide:
> http://mvapich.cse.ohio-state.edu/static/media/mvapich/mvapich2-2.1-userguide.html#x1-1380009.4.3
>
> Please let us know if this helps.
>
> On Mon, Nov 2, 2015 at 10:47 AM Andy Riebs <andy.riebs at hpe.com 
> <mailto:andy.riebs at hpe.com>> wrote:
>
>     X-Microsoft-Ex
>     Hi Jim,
>
>     I assume you did, but just in case... did you restart slurm on the
>     compute nodes, as well?
>
>     Andy
>
>     On 11/02/2015 10:42 AM, Jim Galarowicz wrote:
>     > Hi Andy,
>     >
>     > Thanks for the reply.
>     >
>     > I restarted slurm with this command:
>     >
>     > $ sudo /etc/init.d/slurm start
>     > [sudo] password for jeg:
>     > starting slurmctld:
>     >
>     > $ !sru
>     > srun -n 2 --mpi=pmi2 ulimit.sh
>     > ccn001.cc.nx: 64
>     > ccn001.cc.nx: 64
>     >
>     > $  cat ulimit.sh
>     > #!/bin/sh
>     >     echo $(hostname): $(ulimit -l)
>     >
>     >
>     > It looks like I'm still not getting ulimited on the compute
>     nodes, but
>     > when I do the salloc and do ulimit -l, I see unlimited.
>     >
>     > [jeg at hdn nbody]$ ulimit -l
>     > unlimited
>     >
>     >
>     > [jeg at hdn nbody]$ cat   /etc/sysconfig/slurm
>     > ulimit -l unlimited
>     >
>     > Do you see anything wrong in what I'm doing?
>     >
>     > Thanks again for the reply!
>     >
>     > Jim G
>     >
>     > On 11/01/2015 02:41 PM, Andy Riebs wrote:
>     >> Jim,
>     >>
>     >> Did you restart Slurm on the compute nodes after setting up
>     >> /etc/sysconfig/slurm?
>     >>
>     >> Also, in your local job, what does "ulimit -l" show? That will get
>     >> propagated to the computes.
>     >>
>     >> Andy
>     >>
>     >> On 11/01/2015 05:02 PM, Jim Galarowicz wrote:
>     >>> X-MS-Exchange-CrossTenant
>     >>> --===============6581539869262316634==
>     >>> Content-Type: multipart/alternative;
>     >>>  boundary="------------020002060905080401040409"
>     >>>
>     >>> --------------020002060905080401040409
>     >>> Content-Type: text/plain; charset="utf-8"; format=flowed
>     >>> Content-Transfer-Encoding: 7bit
>     >>>
>     >>> Hi everyone,
>     >>>
>     >>> I'm running on a small cluster that has slurm and mvapich2
>     version 2.1
>     >>> installed.
>     >>> However, I'm seeing this error when I try to run a simple mpi
>     >>> application.
>     >>>
>     >>>      /srun -n 2 --mpi=pmi2 ./nbody-mvapich2//
>     >>>      / /
>     >>>      //In: PMI_Abort(1, Fatal error in MPI_Init://
>     >>>      //Other MPI error, error stack://
>     >>>      //MPIR_Init_thread(514).......: //
>     >>>      //MPID_Init(367)..............: channel initialization
>     failed//
>     >>>      //MPIDI_CH3_Init(492).........: //
>     >>>      //MPIDI_CH3I_RDMA_init(224)...: //
>     >>>      //rdma_setup_startup_ring(410): cannot create cq//
>     >>>      //)//
>     >>>      //In: PMI_Abort(1, Fatal error in MPI_Init://
>     >>>      //Other MPI error, error stack://
>     >>>      //MPIR_Init_thread(514).......: //
>     >>>      //MPID_Init(367)..............: channel initialization
>     failed//
>     >>>      //MPIDI_CH3_Init(492).........: //
>     >>>      //MPIDI_CH3I_RDMA_init(224)...: //
>     >>>      //rdma_setup_startup_ring(410): cannot create cq//
>     >>>      //)//
>     >>>      /
>     >>>
>     >>>
>     >>>
>     >>> I searched the internet and found this url
>     >>>
>     (http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/2012-September/004027.html)
>     >>>
>     >>> on the "cannot create cq" issue, which suggests we need to set
>     >>>
>     >>> ulimit -l unlimited  in /etc/sysconfig/slurm
>     >>>
>     >>>> If it doesn't show unlimited (or some other number much
>     higher than
>     >>>> 64)
>     >>>> then you'll need to do something to update the limits slurm
>     is using.
>     >>>> On redhat systems you can put the following in
>     /etc/sysconfig/slurm.
>     >>>>
>     >>>>       ulimit -l unlimited
>     >>> So, I added that file with the "ulimit -l unlimited" statement
>     added.
>     >>> But, it didn't seem to make any difference on the issue.
>     >>>
>     >>> Does anyone have any hints on what might be wrong?
>     >>>
>     >>> Thank you,
>     >>> Jim G
>     >>>
>     >>>
>     >>>
>     >>>
>     >>>
>     >>> --------------020002060905080401040409
>     >>> Content-Type: text/html; charset="utf-8"
>     >>> Content-Transfer-Encoding: 7bit
>     >>>
>     >>> <html>
>     >>>    <head>
>     >>>
>     >>>      <meta http-equiv="content-type" content="text/html;
>     >>> charset=utf-8">
>     >>>    </head>
>     >>>    <body bgcolor="#FFFFFF" text="#000000">
>     >>>      Hi everyone,<br>
>     >>>      <br>
>     >>>      I'm running on a small cluster that has slurm and
>     mvapich2 version
>     >>>      2.1 installed.<br>
>     >>>      However, I'm seeing this error when I try to run a simple mpi
>     >>>      application.<br>
>     >>>      <blockquote><i>srun -n 2 --mpi=pmi2
>     ./nbody-mvapich2</i><i><br>
>     >>>        </i>
>     >>>        <i><br>
>     >>>        </i><i>
>     >>>          In: PMI_Abort(1, Fatal error in MPI_Init:</i><i><br>
>     >>>        </i><i>
>     >>>          Other MPI error, error stack:</i><i><br>
>     >>>        </i><i>
>     >>>          MPIR_Init_thread(514).......: </i><i><br>
>     >>>        </i><i>
>     >>>          MPID_Init(367)..............: channel initialization
>     >>> failed</i><i><br>
>     >>>        </i><i>
>     >>>          MPIDI_CH3_Init(492).........: </i><i><br>
>     >>>        </i><i>
>     >>>          MPIDI_CH3I_RDMA_init(224)...: </i><i><br>
>     >>>        </i><i>
>     >>>          rdma_setup_startup_ring(410): cannot create cq</i><i><br>
>     >>>        </i><i>
>     >>>          )</i><i><br>
>     >>>        </i><i>
>     >>>          In: PMI_Abort(1, Fatal error in MPI_Init:</i><i><br>
>     >>>        </i><i>
>     >>>          Other MPI error, error stack:</i><i><br>
>     >>>        </i><i>
>     >>>          MPIR_Init_thread(514).......: </i><i><br>
>     >>>        </i><i>
>     >>>          MPID_Init(367)..............: channel initialization
>     >>> failed</i><i><br>
>     >>>        </i><i>
>     >>>          MPIDI_CH3_Init(492).........: </i><i><br>
>     >>>        </i><i>
>     >>>          MPIDI_CH3I_RDMA_init(224)...: </i><i><br>
>     >>>        </i><i>
>     >>>          rdma_setup_startup_ring(410): cannot create cq</i><i><br>
>     >>>        </i><i>
>     >>>          )</i><i><br>
>     >>>        </i></blockquote>
>     >>>      <br>
>     >>>      <br>
>     >>>      I searched the internet and found this url (<a
>     >>>        moz-do-not-send="true"
>     >>>
>     href="http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/2012-September/004027.html"
>     >>>
>     >>>
>     target="_blank">http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/2012-September/004027.html</a>)
>     >>>
>     >>>      on the "cannot create cq" issue, which suggests we need
>     to set
>     >>> <br>
>     >>>      <pre>ulimit -l unlimited  in /etc/sysconfig/slurm</pre>
>     >>>      <blockquote type="cite">
>     >>>        <pre>If it doesn't show unlimited (or some other number
>     much
>     >>> higher than 64)
>     >>> then you'll need to do something to update the limits slurm is
>     using.
>     >>> On redhat systems you can put the following in
>     /etc/sysconfig/slurm.
>     >>>
>     >>>      ulimit -l unlimited
>     >>> </pre>
>     >>>      </blockquote>
>     >>>      So, I added that file with the "ulimit -l unlimited"
>     statement
>     >>>      added.<br>
>     >>>      But, it didn't seem to make any difference on the issue.<br>
>     >>>      <br>
>     >>>      Does anyone have any hints on what might be wrong?<br>
>     >>>      <br>
>     >>>      Thank you,<br>
>     >>>      Jim G<br>
>     >>>      <br>
>     >>>      <br>
>     >>>      <br>
>     >>>      <br>
>     >>>    </body>
>     >>> </html>
>     >>>
>     >>> --------------020002060905080401040409--
>     >>>
>     >>> --===============6581539869262316634==
>     >>> Content-Type: text/plain; charset="us-ascii"
>     >>> MIME-Version: 1.0
>     >>> Content-Transfer-Encoding: 7bit
>     >>> Content-Disposition: inline
>     >>>
>     >>> _______________________________________________
>     >>> mvapich-discuss mailing list
>     >>> mvapich-discuss at cse.ohio-state.edu
>     <mailto:mvapich-discuss at cse.ohio-state.edu>
>     >>> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>     >>>
>     >>> --===============6581539869262316634==--
>     >>
>     >
>
>     _______________________________________________
>     mvapich-discuss mailing list
>     mvapich-discuss at cse.ohio-state.edu
>     <mailto:mvapich-discuss at cse.ohio-state.edu>
>     http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20151102/b8617455/attachment-0001.html>


More information about the mvapich-discuss mailing list