[mvapich-discuss] Slurm
Jim Galarowicz
jeg at krellinst.org
Mon Nov 2 11:16:40 EST 2015
Hi Andy,
Thanks! - I did not.
I tried now, but I don't have su privileges. I will ask the
administrators of the cluster to do that.
Thanks much!
Jim G
On 11/02/2015 07:46 AM, Andy Riebs wrote:
> Hi Jim,
>
> I assume you did, but just in case... did you restart slurm on the
> compute nodes, as well?
>
> Andy
>
> On 11/02/2015 10:42 AM, Jim Galarowicz wrote:
>> Hi Andy,
>>
>> Thanks for the reply.
>>
>> I restarted slurm with this command:
>>
>> $ sudo /etc/init.d/slurm start
>> [sudo] password for jeg:
>> starting slurmctld:
>>
>> $ !sru
>> srun -n 2 --mpi=pmi2 ulimit.sh
>> ccn001.cc.nx: 64
>> ccn001.cc.nx: 64
>>
>> $ cat ulimit.sh
>> #!/bin/sh
>> echo $(hostname): $(ulimit -l)
>>
>>
>> It looks like I'm still not getting ulimited on the compute nodes,
>> but when I do the salloc and do ulimit -l, I see unlimited.
>>
>> [jeg at hdn nbody]$ ulimit -l
>> unlimited
>>
>>
>> [jeg at hdn nbody]$ cat /etc/sysconfig/slurm
>> ulimit -l unlimited
>>
>> Do you see anything wrong in what I'm doing?
>>
>> Thanks again for the reply!
>>
>> Jim G
>>
>> On 11/01/2015 02:41 PM, Andy Riebs wrote:
>>> Jim,
>>>
>>> Did you restart Slurm on the compute nodes after setting up
>>> /etc/sysconfig/slurm?
>>>
>>> Also, in your local job, what does "ulimit -l" show? That will get
>>> propagated to the computes.
>>>
>>> Andy
>>>
>>> On 11/01/2015 05:02 PM, Jim Galarowicz wrote:
>>>> X-MS-Exchange-CrossTenant
>>>> --===============6581539869262316634==
>>>> Content-Type: multipart/alternative;
>>>> boundary="------------020002060905080401040409"
>>>>
>>>> --------------020002060905080401040409
>>>> Content-Type: text/plain; charset="utf-8"; format=flowed
>>>> Content-Transfer-Encoding: 7bit
>>>>
>>>> Hi everyone,
>>>>
>>>> I'm running on a small cluster that has slurm and mvapich2 version 2.1
>>>> installed.
>>>> However, I'm seeing this error when I try to run a simple mpi
>>>> application.
>>>>
>>>> /srun -n 2 --mpi=pmi2 ./nbody-mvapich2//
>>>> / /
>>>> //In: PMI_Abort(1, Fatal error in MPI_Init://
>>>> //Other MPI error, error stack://
>>>> //MPIR_Init_thread(514).......: //
>>>> //MPID_Init(367)..............: channel initialization failed//
>>>> //MPIDI_CH3_Init(492).........: //
>>>> //MPIDI_CH3I_RDMA_init(224)...: //
>>>> //rdma_setup_startup_ring(410): cannot create cq//
>>>> //)//
>>>> //In: PMI_Abort(1, Fatal error in MPI_Init://
>>>> //Other MPI error, error stack://
>>>> //MPIR_Init_thread(514).......: //
>>>> //MPID_Init(367)..............: channel initialization failed//
>>>> //MPIDI_CH3_Init(492).........: //
>>>> //MPIDI_CH3I_RDMA_init(224)...: //
>>>> //rdma_setup_startup_ring(410): cannot create cq//
>>>> //)//
>>>> /
>>>>
>>>>
>>>>
>>>> I searched the internet and found this url
>>>> (http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/2012-September/004027.html)
>>>>
>>>> on the "cannot create cq" issue, which suggests we need to set
>>>>
>>>> ulimit -l unlimited in /etc/sysconfig/slurm
>>>>
>>>>> If it doesn't show unlimited (or some other number much higher
>>>>> than 64)
>>>>> then you'll need to do something to update the limits slurm is using.
>>>>> On redhat systems you can put the following in /etc/sysconfig/slurm.
>>>>>
>>>>> ulimit -l unlimited
>>>> So, I added that file with the "ulimit -l unlimited" statement added.
>>>> But, it didn't seem to make any difference on the issue.
>>>>
>>>> Does anyone have any hints on what might be wrong?
>>>>
>>>> Thank you,
>>>> Jim G
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> --------------020002060905080401040409
>>>> Content-Type: text/html; charset="utf-8"
>>>> Content-Transfer-Encoding: 7bit
>>>>
>>>> <html>
>>>> <head>
>>>>
>>>> <meta http-equiv="content-type" content="text/html;
>>>> charset=utf-8">
>>>> </head>
>>>> <body bgcolor="#FFFFFF" text="#000000">
>>>> Hi everyone,<br>
>>>> <br>
>>>> I'm running on a small cluster that has slurm and mvapich2
>>>> version
>>>> 2.1 installed.<br>
>>>> However, I'm seeing this error when I try to run a simple mpi
>>>> application.<br>
>>>> <blockquote><i>srun -n 2 --mpi=pmi2 ./nbody-mvapich2</i><i><br>
>>>> </i>
>>>> <i><br>
>>>> </i><i>
>>>> In: PMI_Abort(1, Fatal error in MPI_Init:</i><i><br>
>>>> </i><i>
>>>> Other MPI error, error stack:</i><i><br>
>>>> </i><i>
>>>> MPIR_Init_thread(514).......: </i><i><br>
>>>> </i><i>
>>>> MPID_Init(367)..............: channel initialization
>>>> failed</i><i><br>
>>>> </i><i>
>>>> MPIDI_CH3_Init(492).........: </i><i><br>
>>>> </i><i>
>>>> MPIDI_CH3I_RDMA_init(224)...: </i><i><br>
>>>> </i><i>
>>>> rdma_setup_startup_ring(410): cannot create cq</i><i><br>
>>>> </i><i>
>>>> )</i><i><br>
>>>> </i><i>
>>>> In: PMI_Abort(1, Fatal error in MPI_Init:</i><i><br>
>>>> </i><i>
>>>> Other MPI error, error stack:</i><i><br>
>>>> </i><i>
>>>> MPIR_Init_thread(514).......: </i><i><br>
>>>> </i><i>
>>>> MPID_Init(367)..............: channel initialization
>>>> failed</i><i><br>
>>>> </i><i>
>>>> MPIDI_CH3_Init(492).........: </i><i><br>
>>>> </i><i>
>>>> MPIDI_CH3I_RDMA_init(224)...: </i><i><br>
>>>> </i><i>
>>>> rdma_setup_startup_ring(410): cannot create cq</i><i><br>
>>>> </i><i>
>>>> )</i><i><br>
>>>> </i></blockquote>
>>>> <br>
>>>> <br>
>>>> I searched the internet and found this url (<a
>>>> moz-do-not-send="true"
>>>> href="http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/2012-September/004027.html"
>>>>
>>>> target="_blank">http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/2012-September/004027.html</a>)
>>>>
>>>> on the "cannot create cq" issue, which suggests we need to set
>>>> <br>
>>>> <pre>ulimit -l unlimited in /etc/sysconfig/slurm</pre>
>>>> <blockquote type="cite">
>>>> <pre>If it doesn't show unlimited (or some other number much
>>>> higher than 64)
>>>> then you'll need to do something to update the limits slurm is using.
>>>> On redhat systems you can put the following in /etc/sysconfig/slurm.
>>>>
>>>> ulimit -l unlimited
>>>> </pre>
>>>> </blockquote>
>>>> So, I added that file with the "ulimit -l unlimited" statement
>>>> added.<br>
>>>> But, it didn't seem to make any difference on the issue.<br>
>>>> <br>
>>>> Does anyone have any hints on what might be wrong?<br>
>>>> <br>
>>>> Thank you,<br>
>>>> Jim G<br>
>>>> <br>
>>>> <br>
>>>> <br>
>>>> <br>
>>>> </body>
>>>> </html>
>>>>
>>>> --------------020002060905080401040409--
>>>>
>>>> --===============6581539869262316634==
>>>> Content-Type: text/plain; charset="us-ascii"
>>>> MIME-Version: 1.0
>>>> Content-Transfer-Encoding: 7bit
>>>> Content-Disposition: inline
>>>>
>>>> _______________________________________________
>>>> mvapich-discuss mailing list
>>>> mvapich-discuss at cse.ohio-state.edu
>>>> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>>>
>>>> --===============6581539869262316634==--
>>>
>>
>
More information about the mvapich-discuss
mailing list