[mvapich-discuss] slurm and mvapich2

Jim Galarowicz jeg at krellinst.org
Mon Nov 2 15:19:10 EST 2015


Hi Jonathan, Andy,

Your suggestions worked.   I'm able to run srun with mvapich2 on the 
cluster I referenced in the previous emails.
Thank you very much for your help!

Jim G


On 11/02/2015 08:17 AM, Jim Galarowicz wrote:
> Hi Jonathan,
>
> Thanks for this advice!
>
> I will try and let you know.
>
> Thanks again!
> Jim G
>
> On 11/02/2015 07:57 AM, Jonathan Perkins wrote:
>> Hi Jim.  In addition to what Andy has suggested you may want to try 
>> adding the following lines to /etc/security/limits.conf on all machines.
>> * soft memlock unlimited
>> * hard memlock unlimited
>>
>> After this restart your sshd and slurm services.  This is related to 
>> the following FAQ item in our userguide:
>> http://mvapich.cse.ohio-state.edu/static/media/mvapich/mvapich2-2.1-userguide.html#x1-1380009.4.3
>>
>> Please let us know if this helps.
>>
>> On Mon, Nov 2, 2015 at 10:47 AM Andy Riebs <andy.riebs at hpe.com 
>> <mailto:andy.riebs at hpe.com>> wrote:
>>
>>     X-Microsoft-Ex
>>     Hi Jim,
>>
>>     I assume you did, but just in case... did you restart slurm on the
>>     compute nodes, as well?
>>
>>     Andy
>>
>>     On 11/02/2015 10:42 AM, Jim Galarowicz wrote:
>>     > Hi Andy,
>>     >
>>     > Thanks for the reply.
>>     >
>>     > I restarted slurm with this command:
>>     >
>>     > $ sudo /etc/init.d/slurm start
>>     > [sudo] password for jeg:
>>     > starting slurmctld:
>>     >
>>     > $ !sru
>>     > srun -n 2 --mpi=pmi2 ulimit.sh
>>     > ccn001.cc.nx: 64
>>     > ccn001.cc.nx: 64
>>     >
>>     > $  cat ulimit.sh
>>     > #!/bin/sh
>>     >     echo $(hostname): $(ulimit -l)
>>     >
>>     >
>>     > It looks like I'm still not getting ulimited on the compute
>>     nodes, but
>>     > when I do the salloc and do ulimit -l, I see unlimited.
>>     >
>>     > [jeg at hdn nbody]$ ulimit -l
>>     > unlimited
>>     >
>>     >
>>     > [jeg at hdn nbody]$ cat   /etc/sysconfig/slurm
>>     > ulimit -l unlimited
>>     >
>>     > Do you see anything wrong in what I'm doing?
>>     >
>>     > Thanks again for the reply!
>>     >
>>     > Jim G
>>     >
>>     > On 11/01/2015 02:41 PM, Andy Riebs wrote:
>>     >> Jim,
>>     >>
>>     >> Did you restart Slurm on the compute nodes after setting up
>>     >> /etc/sysconfig/slurm?
>>     >>
>>     >> Also, in your local job, what does "ulimit -l" show? That will get
>>     >> propagated to the computes.
>>     >>
>>     >> Andy
>>     >>
>>     >> On 11/01/2015 05:02 PM, Jim Galarowicz wrote:
>>     >>> X-MS-Exchange-CrossTenant
>>     >>> --===============6581539869262316634==
>>     >>> Content-Type: multipart/alternative;
>>     >>>  boundary="------------020002060905080401040409"
>>     >>>
>>     >>> --------------020002060905080401040409
>>     >>> Content-Type: text/plain; charset="utf-8"; format=flowed
>>     >>> Content-Transfer-Encoding: 7bit
>>     >>>
>>     >>> Hi everyone,
>>     >>>
>>     >>> I'm running on a small cluster that has slurm and mvapich2
>>     version 2.1
>>     >>> installed.
>>     >>> However, I'm seeing this error when I try to run a simple mpi
>>     >>> application.
>>     >>>
>>     >>>      /srun -n 2 --mpi=pmi2 ./nbody-mvapich2//
>>     >>>      / /
>>     >>>      //In: PMI_Abort(1, Fatal error in MPI_Init://
>>     >>>      //Other MPI error, error stack://
>>     >>>      //MPIR_Init_thread(514).......: //
>>     >>>      //MPID_Init(367)..............: channel initialization
>>     failed//
>>     >>>      //MPIDI_CH3_Init(492).........: //
>>     >>>      //MPIDI_CH3I_RDMA_init(224)...: //
>>     >>>      //rdma_setup_startup_ring(410): cannot create cq//
>>     >>>      //)//
>>     >>>      //In: PMI_Abort(1, Fatal error in MPI_Init://
>>     >>>      //Other MPI error, error stack://
>>     >>>      //MPIR_Init_thread(514).......: //
>>     >>>      //MPID_Init(367)..............: channel initialization
>>     failed//
>>     >>>      //MPIDI_CH3_Init(492).........: //
>>     >>>      //MPIDI_CH3I_RDMA_init(224)...: //
>>     >>>      //rdma_setup_startup_ring(410): cannot create cq//
>>     >>>      //)//
>>     >>>      /
>>     >>>
>>     >>>
>>     >>>
>>     >>> I searched the internet and found this url
>>     >>>
>>     (http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/2012-September/004027.html)
>>     >>>
>>     >>> on the "cannot create cq" issue, which suggests we need to set
>>     >>>
>>     >>> ulimit -l unlimited  in /etc/sysconfig/slurm
>>     >>>
>>     >>>> If it doesn't show unlimited (or some other number much
>>     higher than
>>     >>>> 64)
>>     >>>> then you'll need to do something to update the limits slurm
>>     is using.
>>     >>>> On redhat systems you can put the following in
>>     /etc/sysconfig/slurm.
>>     >>>>
>>     >>>>       ulimit -l unlimited
>>     >>> So, I added that file with the "ulimit -l unlimited"
>>     statement added.
>>     >>> But, it didn't seem to make any difference on the issue.
>>     >>>
>>     >>> Does anyone have any hints on what might be wrong?
>>     >>>
>>     >>> Thank you,
>>     >>> Jim G
>>     >>>
>>     >>>
>>     >>>
>>     >>>
>>     >>>
>>     >>> --------------020002060905080401040409
>>     >>> Content-Type: text/html; charset="utf-8"
>>     >>> Content-Transfer-Encoding: 7bit
>>     >>>
>>     >>> <html>
>>     >>>    <head>
>>     >>>
>>     >>>      <meta http-equiv="content-type" content="text/html;
>>     >>> charset=utf-8">
>>     >>>    </head>
>>     >>>    <body bgcolor="#FFFFFF" text="#000000">
>>     >>>      Hi everyone,<br>
>>     >>>      <br>
>>     >>>      I'm running on a small cluster that has slurm and
>>     mvapich2 version
>>     >>>      2.1 installed.<br>
>>     >>>      However, I'm seeing this error when I try to run a
>>     simple mpi
>>     >>>      application.<br>
>>     >>>      <blockquote><i>srun -n 2 --mpi=pmi2
>>     ./nbody-mvapich2</i><i><br>
>>     >>>        </i>
>>     >>>        <i><br>
>>     >>>        </i><i>
>>     >>>          In: PMI_Abort(1, Fatal error in MPI_Init:</i><i><br>
>>     >>>        </i><i>
>>     >>>          Other MPI error, error stack:</i><i><br>
>>     >>>        </i><i>
>>     >>>          MPIR_Init_thread(514).......: </i><i><br>
>>     >>>        </i><i>
>>     >>>          MPID_Init(367)..............: channel initialization
>>     >>> failed</i><i><br>
>>     >>>        </i><i>
>>     >>>          MPIDI_CH3_Init(492).........: </i><i><br>
>>     >>>        </i><i>
>>     >>>          MPIDI_CH3I_RDMA_init(224)...: </i><i><br>
>>     >>>        </i><i>
>>     >>>          rdma_setup_startup_ring(410): cannot create
>>     cq</i><i><br>
>>     >>>        </i><i>
>>     >>>          )</i><i><br>
>>     >>>        </i><i>
>>     >>>          In: PMI_Abort(1, Fatal error in MPI_Init:</i><i><br>
>>     >>>        </i><i>
>>     >>>          Other MPI error, error stack:</i><i><br>
>>     >>>        </i><i>
>>     >>>          MPIR_Init_thread(514).......: </i><i><br>
>>     >>>        </i><i>
>>     >>>          MPID_Init(367)..............: channel initialization
>>     >>> failed</i><i><br>
>>     >>>        </i><i>
>>     >>>          MPIDI_CH3_Init(492).........: </i><i><br>
>>     >>>        </i><i>
>>     >>>          MPIDI_CH3I_RDMA_init(224)...: </i><i><br>
>>     >>>        </i><i>
>>     >>>          rdma_setup_startup_ring(410): cannot create
>>     cq</i><i><br>
>>     >>>        </i><i>
>>     >>>          )</i><i><br>
>>     >>>        </i></blockquote>
>>     >>>      <br>
>>     >>>      <br>
>>     >>>      I searched the internet and found this url (<a
>>     >>>        moz-do-not-send="true"
>>     >>>
>>     href="http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/2012-September/004027.html"
>>     >>>
>>     >>>
>>     target="_blank">http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/2012-September/004027.html</a>)
>>     >>>
>>     >>>      on the "cannot create cq" issue, which suggests we need
>>     to set
>>     >>> <br>
>>     >>>      <pre>ulimit -l unlimited  in /etc/sysconfig/slurm</pre>
>>     >>>      <blockquote type="cite">
>>     >>>        <pre>If it doesn't show unlimited (or some other
>>     number much
>>     >>> higher than 64)
>>     >>> then you'll need to do something to update the limits slurm
>>     is using.
>>     >>> On redhat systems you can put the following in
>>     /etc/sysconfig/slurm.
>>     >>>
>>     >>>      ulimit -l unlimited
>>     >>> </pre>
>>     >>>      </blockquote>
>>     >>>      So, I added that file with the "ulimit -l unlimited"
>>     statement
>>     >>>      added.<br>
>>     >>>      But, it didn't seem to make any difference on the issue.<br>
>>     >>>      <br>
>>     >>>      Does anyone have any hints on what might be wrong?<br>
>>     >>>      <br>
>>     >>>      Thank you,<br>
>>     >>>      Jim G<br>
>>     >>>      <br>
>>     >>>      <br>
>>     >>>      <br>
>>     >>>      <br>
>>     >>>    </body>
>>     >>> </html>
>>     >>>
>>     >>> --------------020002060905080401040409--
>>     >>>
>>     >>> --===============6581539869262316634==
>>     >>> Content-Type: text/plain; charset="us-ascii"
>>     >>> MIME-Version: 1.0
>>     >>> Content-Transfer-Encoding: 7bit
>>     >>> Content-Disposition: inline
>>     >>>
>>     >>> _______________________________________________
>>     >>> mvapich-discuss mailing list
>>     >>> mvapich-discuss at cse.ohio-state.edu
>>     <mailto:mvapich-discuss at cse.ohio-state.edu>
>>     >>>
>>     http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>     >>>
>>     >>> --===============6581539869262316634==--
>>     >>
>>     >
>>
>>     _______________________________________________
>>     mvapich-discuss mailing list
>>     mvapich-discuss at cse.ohio-state.edu
>>     <mailto:mvapich-discuss at cse.ohio-state.edu>
>>     http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20151102/736730e2/attachment-0001.html>


More information about the mvapich-discuss mailing list