[mvapich-discuss] Slurm

Andy Riebs andy.riebs at hpe.com
Mon Nov 2 10:46:16 EST 2015


Hi Jim,

I assume you did, but just in case... did you restart slurm on the 
compute nodes, as well?

Andy

On 11/02/2015 10:42 AM, Jim Galarowicz wrote:
> Hi Andy,
>
> Thanks for the reply.
>
> I restarted slurm with this command:
>
> $ sudo /etc/init.d/slurm start
> [sudo] password for jeg:
> starting slurmctld:
>
> $ !sru
> srun -n 2 --mpi=pmi2 ulimit.sh
> ccn001.cc.nx: 64
> ccn001.cc.nx: 64
>
> $  cat ulimit.sh
> #!/bin/sh
>     echo $(hostname): $(ulimit -l)
>
>
> It looks like I'm still not getting ulimited on the compute nodes, but 
> when I do the salloc and do ulimit -l, I see unlimited.
>
> [jeg at hdn nbody]$ ulimit -l
> unlimited
>
>
> [jeg at hdn nbody]$ cat   /etc/sysconfig/slurm
> ulimit -l unlimited
>
> Do you see anything wrong in what I'm doing?
>
> Thanks again for the reply!
>
> Jim G
>
> On 11/01/2015 02:41 PM, Andy Riebs wrote:
>> Jim,
>>
>> Did you restart Slurm on the compute nodes after setting up 
>> /etc/sysconfig/slurm?
>>
>> Also, in your local job, what does "ulimit -l" show? That will get 
>> propagated to the computes.
>>
>> Andy
>>
>> On 11/01/2015 05:02 PM, Jim Galarowicz wrote:
>>> X-MS-Exchange-CrossTenant
>>> --===============6581539869262316634==
>>> Content-Type: multipart/alternative;
>>>     boundary="------------020002060905080401040409"
>>>
>>> --------------020002060905080401040409
>>> Content-Type: text/plain; charset="utf-8"; format=flowed
>>> Content-Transfer-Encoding: 7bit
>>>
>>> Hi everyone,
>>>
>>> I'm running on a small cluster that has slurm and mvapich2 version 2.1
>>> installed.
>>> However, I'm seeing this error when I try to run a simple mpi 
>>> application.
>>>
>>>      /srun -n 2 --mpi=pmi2 ./nbody-mvapich2//
>>>      / /
>>>      //In: PMI_Abort(1, Fatal error in MPI_Init://
>>>      //Other MPI error, error stack://
>>>      //MPIR_Init_thread(514).......: //
>>>      //MPID_Init(367)..............: channel initialization failed//
>>>      //MPIDI_CH3_Init(492).........: //
>>>      //MPIDI_CH3I_RDMA_init(224)...: //
>>>      //rdma_setup_startup_ring(410): cannot create cq//
>>>      //)//
>>>      //In: PMI_Abort(1, Fatal error in MPI_Init://
>>>      //Other MPI error, error stack://
>>>      //MPIR_Init_thread(514).......: //
>>>      //MPID_Init(367)..............: channel initialization failed//
>>>      //MPIDI_CH3_Init(492).........: //
>>>      //MPIDI_CH3I_RDMA_init(224)...: //
>>>      //rdma_setup_startup_ring(410): cannot create cq//
>>>      //)//
>>>      /
>>>
>>>
>>>
>>> I searched the internet and found this url
>>> (http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/2012-September/004027.html) 
>>>
>>> on the "cannot create cq" issue, which suggests we need to set
>>>
>>> ulimit -l unlimited  in  /etc/sysconfig/slurm
>>>
>>>> If it doesn't show unlimited (or some other number much higher than 
>>>> 64)
>>>> then you'll need to do something to update the limits slurm is using.
>>>> On redhat systems you can put the following in /etc/sysconfig/slurm.
>>>>
>>>>       ulimit -l unlimited
>>> So, I added that file with the "ulimit -l unlimited" statement added.
>>> But, it didn't seem to make any difference on the issue.
>>>
>>> Does anyone have any hints on what might be wrong?
>>>
>>> Thank you,
>>> Jim G
>>>
>>>
>>>
>>>
>>>
>>> --------------020002060905080401040409
>>> Content-Type: text/html; charset="utf-8"
>>> Content-Transfer-Encoding: 7bit
>>>
>>> <html>
>>>    <head>
>>>
>>>      <meta http-equiv="content-type" content="text/html; 
>>> charset=utf-8">
>>>    </head>
>>>    <body bgcolor="#FFFFFF" text="#000000">
>>>      Hi everyone,<br>
>>>      <br>
>>>      I'm running on a small cluster that has slurm and mvapich2 version
>>>      2.1 installed.<br>
>>>      However, I'm seeing this error when I try to run a simple mpi
>>>      application.<br>
>>>      <blockquote><i>srun -n 2 --mpi=pmi2 ./nbody-mvapich2</i><i><br>
>>>        </i>
>>>        <i><br>
>>>        </i><i>
>>>          In: PMI_Abort(1, Fatal error in MPI_Init:</i><i><br>
>>>        </i><i>
>>>          Other MPI error, error stack:</i><i><br>
>>>        </i><i>
>>>          MPIR_Init_thread(514).......: </i><i><br>
>>>        </i><i>
>>>          MPID_Init(367)..............: channel initialization 
>>> failed</i><i><br>
>>>        </i><i>
>>>          MPIDI_CH3_Init(492).........: </i><i><br>
>>>        </i><i>
>>>          MPIDI_CH3I_RDMA_init(224)...: </i><i><br>
>>>        </i><i>
>>>          rdma_setup_startup_ring(410): cannot create cq</i><i><br>
>>>        </i><i>
>>>          )</i><i><br>
>>>        </i><i>
>>>          In: PMI_Abort(1, Fatal error in MPI_Init:</i><i><br>
>>>        </i><i>
>>>          Other MPI error, error stack:</i><i><br>
>>>        </i><i>
>>>          MPIR_Init_thread(514).......: </i><i><br>
>>>        </i><i>
>>>          MPID_Init(367)..............: channel initialization 
>>> failed</i><i><br>
>>>        </i><i>
>>>          MPIDI_CH3_Init(492).........: </i><i><br>
>>>        </i><i>
>>>          MPIDI_CH3I_RDMA_init(224)...: </i><i><br>
>>>        </i><i>
>>>          rdma_setup_startup_ring(410): cannot create cq</i><i><br>
>>>        </i><i>
>>>          )</i><i><br>
>>>        </i></blockquote>
>>>      <br>
>>>      <br>
>>>      I searched the internet and found this url (<a
>>>        moz-do-not-send="true"
>>> href="http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/2012-September/004027.html" 
>>>
>>> target="_blank">http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/2012-September/004027.html</a>) 
>>>
>>>      on the "cannot create cq" issue, which suggests we need to set 
>>> <br>
>>>      <pre>ulimit -l unlimited  in /etc/sysconfig/slurm</pre>
>>>      <blockquote type="cite">
>>>        <pre>If it doesn't show unlimited (or some other number much 
>>> higher than 64)
>>> then you'll need to do something to update the limits slurm is using.
>>> On redhat systems you can put the following in /etc/sysconfig/slurm.
>>>
>>>      ulimit -l unlimited
>>> </pre>
>>>      </blockquote>
>>>      So, I added that file with the "ulimit -l unlimited" statement
>>>      added.<br>
>>>      But, it didn't seem to make any difference on the issue.<br>
>>>      <br>
>>>      Does anyone have any hints on what might be wrong?<br>
>>>      <br>
>>>      Thank you,<br>
>>>      Jim G<br>
>>>      <br>
>>>      <br>
>>>      <br>
>>>      <br>
>>>    </body>
>>> </html>
>>>
>>> --------------020002060905080401040409--
>>>
>>> --===============6581539869262316634==
>>> Content-Type: text/plain; charset="us-ascii"
>>> MIME-Version: 1.0
>>> Content-Transfer-Encoding: 7bit
>>> Content-Disposition: inline
>>>
>>> _______________________________________________
>>> mvapich-discuss mailing list
>>> mvapich-discuss at cse.ohio-state.edu
>>> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>>
>>> --===============6581539869262316634==--
>>
>



More information about the mvapich-discuss mailing list