[mvapich-discuss] Slurm

Jim Galarowicz jeg at krellinst.org
Mon Nov 2 11:16:40 EST 2015


Hi Andy,

Thanks! - I did not.

I tried now, but I don't have su privileges.  I will ask the 
administrators of the cluster to do that.

Thanks much!

Jim G

On 11/02/2015 07:46 AM, Andy Riebs wrote:
> Hi Jim,
>
> I assume you did, but just in case... did you restart slurm on the 
> compute nodes, as well?
>
> Andy
>
> On 11/02/2015 10:42 AM, Jim Galarowicz wrote:
>> Hi Andy,
>>
>> Thanks for the reply.
>>
>> I restarted slurm with this command:
>>
>> $ sudo /etc/init.d/slurm start
>> [sudo] password for jeg:
>> starting slurmctld:
>>
>> $ !sru
>> srun -n 2 --mpi=pmi2 ulimit.sh
>> ccn001.cc.nx: 64
>> ccn001.cc.nx: 64
>>
>> $  cat ulimit.sh
>> #!/bin/sh
>>     echo $(hostname): $(ulimit -l)
>>
>>
>> It looks like I'm still not getting ulimited on the compute nodes, 
>> but when I do the salloc and do ulimit -l, I see unlimited.
>>
>> [jeg at hdn nbody]$ ulimit -l
>> unlimited
>>
>>
>> [jeg at hdn nbody]$ cat   /etc/sysconfig/slurm
>> ulimit -l unlimited
>>
>> Do you see anything wrong in what I'm doing?
>>
>> Thanks again for the reply!
>>
>> Jim G
>>
>> On 11/01/2015 02:41 PM, Andy Riebs wrote:
>>> Jim,
>>>
>>> Did you restart Slurm on the compute nodes after setting up 
>>> /etc/sysconfig/slurm?
>>>
>>> Also, in your local job, what does "ulimit -l" show? That will get 
>>> propagated to the computes.
>>>
>>> Andy
>>>
>>> On 11/01/2015 05:02 PM, Jim Galarowicz wrote:
>>>> X-MS-Exchange-CrossTenant
>>>> --===============6581539869262316634==
>>>> Content-Type: multipart/alternative;
>>>>     boundary="------------020002060905080401040409"
>>>>
>>>> --------------020002060905080401040409
>>>> Content-Type: text/plain; charset="utf-8"; format=flowed
>>>> Content-Transfer-Encoding: 7bit
>>>>
>>>> Hi everyone,
>>>>
>>>> I'm running on a small cluster that has slurm and mvapich2 version 2.1
>>>> installed.
>>>> However, I'm seeing this error when I try to run a simple mpi 
>>>> application.
>>>>
>>>>      /srun -n 2 --mpi=pmi2 ./nbody-mvapich2//
>>>>      / /
>>>>      //In: PMI_Abort(1, Fatal error in MPI_Init://
>>>>      //Other MPI error, error stack://
>>>>      //MPIR_Init_thread(514).......: //
>>>>      //MPID_Init(367)..............: channel initialization failed//
>>>>      //MPIDI_CH3_Init(492).........: //
>>>>      //MPIDI_CH3I_RDMA_init(224)...: //
>>>>      //rdma_setup_startup_ring(410): cannot create cq//
>>>>      //)//
>>>>      //In: PMI_Abort(1, Fatal error in MPI_Init://
>>>>      //Other MPI error, error stack://
>>>>      //MPIR_Init_thread(514).......: //
>>>>      //MPID_Init(367)..............: channel initialization failed//
>>>>      //MPIDI_CH3_Init(492).........: //
>>>>      //MPIDI_CH3I_RDMA_init(224)...: //
>>>>      //rdma_setup_startup_ring(410): cannot create cq//
>>>>      //)//
>>>>      /
>>>>
>>>>
>>>>
>>>> I searched the internet and found this url
>>>> (http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/2012-September/004027.html) 
>>>>
>>>> on the "cannot create cq" issue, which suggests we need to set
>>>>
>>>> ulimit -l unlimited  in  /etc/sysconfig/slurm
>>>>
>>>>> If it doesn't show unlimited (or some other number much higher 
>>>>> than 64)
>>>>> then you'll need to do something to update the limits slurm is using.
>>>>> On redhat systems you can put the following in /etc/sysconfig/slurm.
>>>>>
>>>>>       ulimit -l unlimited
>>>> So, I added that file with the "ulimit -l unlimited" statement added.
>>>> But, it didn't seem to make any difference on the issue.
>>>>
>>>> Does anyone have any hints on what might be wrong?
>>>>
>>>> Thank you,
>>>> Jim G
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> --------------020002060905080401040409
>>>> Content-Type: text/html; charset="utf-8"
>>>> Content-Transfer-Encoding: 7bit
>>>>
>>>> <html>
>>>>    <head>
>>>>
>>>>      <meta http-equiv="content-type" content="text/html; 
>>>> charset=utf-8">
>>>>    </head>
>>>>    <body bgcolor="#FFFFFF" text="#000000">
>>>>      Hi everyone,<br>
>>>>      <br>
>>>>      I'm running on a small cluster that has slurm and mvapich2 
>>>> version
>>>>      2.1 installed.<br>
>>>>      However, I'm seeing this error when I try to run a simple mpi
>>>>      application.<br>
>>>>      <blockquote><i>srun -n 2 --mpi=pmi2 ./nbody-mvapich2</i><i><br>
>>>>        </i>
>>>>        <i><br>
>>>>        </i><i>
>>>>          In: PMI_Abort(1, Fatal error in MPI_Init:</i><i><br>
>>>>        </i><i>
>>>>          Other MPI error, error stack:</i><i><br>
>>>>        </i><i>
>>>>          MPIR_Init_thread(514).......: </i><i><br>
>>>>        </i><i>
>>>>          MPID_Init(367)..............: channel initialization 
>>>> failed</i><i><br>
>>>>        </i><i>
>>>>          MPIDI_CH3_Init(492).........: </i><i><br>
>>>>        </i><i>
>>>>          MPIDI_CH3I_RDMA_init(224)...: </i><i><br>
>>>>        </i><i>
>>>>          rdma_setup_startup_ring(410): cannot create cq</i><i><br>
>>>>        </i><i>
>>>>          )</i><i><br>
>>>>        </i><i>
>>>>          In: PMI_Abort(1, Fatal error in MPI_Init:</i><i><br>
>>>>        </i><i>
>>>>          Other MPI error, error stack:</i><i><br>
>>>>        </i><i>
>>>>          MPIR_Init_thread(514).......: </i><i><br>
>>>>        </i><i>
>>>>          MPID_Init(367)..............: channel initialization 
>>>> failed</i><i><br>
>>>>        </i><i>
>>>>          MPIDI_CH3_Init(492).........: </i><i><br>
>>>>        </i><i>
>>>>          MPIDI_CH3I_RDMA_init(224)...: </i><i><br>
>>>>        </i><i>
>>>>          rdma_setup_startup_ring(410): cannot create cq</i><i><br>
>>>>        </i><i>
>>>>          )</i><i><br>
>>>>        </i></blockquote>
>>>>      <br>
>>>>      <br>
>>>>      I searched the internet and found this url (<a
>>>>        moz-do-not-send="true"
>>>> href="http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/2012-September/004027.html" 
>>>>
>>>> target="_blank">http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/2012-September/004027.html</a>) 
>>>>
>>>>      on the "cannot create cq" issue, which suggests we need to set 
>>>> <br>
>>>>      <pre>ulimit -l unlimited  in /etc/sysconfig/slurm</pre>
>>>>      <blockquote type="cite">
>>>>        <pre>If it doesn't show unlimited (or some other number much 
>>>> higher than 64)
>>>> then you'll need to do something to update the limits slurm is using.
>>>> On redhat systems you can put the following in /etc/sysconfig/slurm.
>>>>
>>>>      ulimit -l unlimited
>>>> </pre>
>>>>      </blockquote>
>>>>      So, I added that file with the "ulimit -l unlimited" statement
>>>>      added.<br>
>>>>      But, it didn't seem to make any difference on the issue.<br>
>>>>      <br>
>>>>      Does anyone have any hints on what might be wrong?<br>
>>>>      <br>
>>>>      Thank you,<br>
>>>>      Jim G<br>
>>>>      <br>
>>>>      <br>
>>>>      <br>
>>>>      <br>
>>>>    </body>
>>>> </html>
>>>>
>>>> --------------020002060905080401040409--
>>>>
>>>> --===============6581539869262316634==
>>>> Content-Type: text/plain; charset="us-ascii"
>>>> MIME-Version: 1.0
>>>> Content-Transfer-Encoding: 7bit
>>>> Content-Disposition: inline
>>>>
>>>> _______________________________________________
>>>> mvapich-discuss mailing list
>>>> mvapich-discuss at cse.ohio-state.edu
>>>> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>>>
>>>> --===============6581539869262316634==--
>>>
>>
>



More information about the mvapich-discuss mailing list