[mvapich-discuss] slurm and mvapich2
Jim Galarowicz
jeg at krellinst.org
Mon Nov 2 15:19:10 EST 2015
Hi Jonathan, Andy,
Your suggestions worked. I'm able to run srun with mvapich2 on the
cluster I referenced in the previous emails.
Thank you very much for your help!
Jim G
On 11/02/2015 08:17 AM, Jim Galarowicz wrote:
> Hi Jonathan,
>
> Thanks for this advice!
>
> I will try and let you know.
>
> Thanks again!
> Jim G
>
> On 11/02/2015 07:57 AM, Jonathan Perkins wrote:
>> Hi Jim. In addition to what Andy has suggested you may want to try
>> adding the following lines to /etc/security/limits.conf on all machines.
>> * soft memlock unlimited
>> * hard memlock unlimited
>>
>> After this restart your sshd and slurm services. This is related to
>> the following FAQ item in our userguide:
>> http://mvapich.cse.ohio-state.edu/static/media/mvapich/mvapich2-2.1-userguide.html#x1-1380009.4.3
>>
>> Please let us know if this helps.
>>
>> On Mon, Nov 2, 2015 at 10:47 AM Andy Riebs <andy.riebs at hpe.com
>> <mailto:andy.riebs at hpe.com>> wrote:
>>
>> X-Microsoft-Ex
>> Hi Jim,
>>
>> I assume you did, but just in case... did you restart slurm on the
>> compute nodes, as well?
>>
>> Andy
>>
>> On 11/02/2015 10:42 AM, Jim Galarowicz wrote:
>> > Hi Andy,
>> >
>> > Thanks for the reply.
>> >
>> > I restarted slurm with this command:
>> >
>> > $ sudo /etc/init.d/slurm start
>> > [sudo] password for jeg:
>> > starting slurmctld:
>> >
>> > $ !sru
>> > srun -n 2 --mpi=pmi2 ulimit.sh
>> > ccn001.cc.nx: 64
>> > ccn001.cc.nx: 64
>> >
>> > $ cat ulimit.sh
>> > #!/bin/sh
>> > echo $(hostname): $(ulimit -l)
>> >
>> >
>> > It looks like I'm still not getting ulimited on the compute
>> nodes, but
>> > when I do the salloc and do ulimit -l, I see unlimited.
>> >
>> > [jeg at hdn nbody]$ ulimit -l
>> > unlimited
>> >
>> >
>> > [jeg at hdn nbody]$ cat /etc/sysconfig/slurm
>> > ulimit -l unlimited
>> >
>> > Do you see anything wrong in what I'm doing?
>> >
>> > Thanks again for the reply!
>> >
>> > Jim G
>> >
>> > On 11/01/2015 02:41 PM, Andy Riebs wrote:
>> >> Jim,
>> >>
>> >> Did you restart Slurm on the compute nodes after setting up
>> >> /etc/sysconfig/slurm?
>> >>
>> >> Also, in your local job, what does "ulimit -l" show? That will get
>> >> propagated to the computes.
>> >>
>> >> Andy
>> >>
>> >> On 11/01/2015 05:02 PM, Jim Galarowicz wrote:
>> >>> X-MS-Exchange-CrossTenant
>> >>> --===============6581539869262316634==
>> >>> Content-Type: multipart/alternative;
>> >>> boundary="------------020002060905080401040409"
>> >>>
>> >>> --------------020002060905080401040409
>> >>> Content-Type: text/plain; charset="utf-8"; format=flowed
>> >>> Content-Transfer-Encoding: 7bit
>> >>>
>> >>> Hi everyone,
>> >>>
>> >>> I'm running on a small cluster that has slurm and mvapich2
>> version 2.1
>> >>> installed.
>> >>> However, I'm seeing this error when I try to run a simple mpi
>> >>> application.
>> >>>
>> >>> /srun -n 2 --mpi=pmi2 ./nbody-mvapich2//
>> >>> / /
>> >>> //In: PMI_Abort(1, Fatal error in MPI_Init://
>> >>> //Other MPI error, error stack://
>> >>> //MPIR_Init_thread(514).......: //
>> >>> //MPID_Init(367)..............: channel initialization
>> failed//
>> >>> //MPIDI_CH3_Init(492).........: //
>> >>> //MPIDI_CH3I_RDMA_init(224)...: //
>> >>> //rdma_setup_startup_ring(410): cannot create cq//
>> >>> //)//
>> >>> //In: PMI_Abort(1, Fatal error in MPI_Init://
>> >>> //Other MPI error, error stack://
>> >>> //MPIR_Init_thread(514).......: //
>> >>> //MPID_Init(367)..............: channel initialization
>> failed//
>> >>> //MPIDI_CH3_Init(492).........: //
>> >>> //MPIDI_CH3I_RDMA_init(224)...: //
>> >>> //rdma_setup_startup_ring(410): cannot create cq//
>> >>> //)//
>> >>> /
>> >>>
>> >>>
>> >>>
>> >>> I searched the internet and found this url
>> >>>
>> (http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/2012-September/004027.html)
>> >>>
>> >>> on the "cannot create cq" issue, which suggests we need to set
>> >>>
>> >>> ulimit -l unlimited in /etc/sysconfig/slurm
>> >>>
>> >>>> If it doesn't show unlimited (or some other number much
>> higher than
>> >>>> 64)
>> >>>> then you'll need to do something to update the limits slurm
>> is using.
>> >>>> On redhat systems you can put the following in
>> /etc/sysconfig/slurm.
>> >>>>
>> >>>> ulimit -l unlimited
>> >>> So, I added that file with the "ulimit -l unlimited"
>> statement added.
>> >>> But, it didn't seem to make any difference on the issue.
>> >>>
>> >>> Does anyone have any hints on what might be wrong?
>> >>>
>> >>> Thank you,
>> >>> Jim G
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>> --------------020002060905080401040409
>> >>> Content-Type: text/html; charset="utf-8"
>> >>> Content-Transfer-Encoding: 7bit
>> >>>
>> >>> <html>
>> >>> <head>
>> >>>
>> >>> <meta http-equiv="content-type" content="text/html;
>> >>> charset=utf-8">
>> >>> </head>
>> >>> <body bgcolor="#FFFFFF" text="#000000">
>> >>> Hi everyone,<br>
>> >>> <br>
>> >>> I'm running on a small cluster that has slurm and
>> mvapich2 version
>> >>> 2.1 installed.<br>
>> >>> However, I'm seeing this error when I try to run a
>> simple mpi
>> >>> application.<br>
>> >>> <blockquote><i>srun -n 2 --mpi=pmi2
>> ./nbody-mvapich2</i><i><br>
>> >>> </i>
>> >>> <i><br>
>> >>> </i><i>
>> >>> In: PMI_Abort(1, Fatal error in MPI_Init:</i><i><br>
>> >>> </i><i>
>> >>> Other MPI error, error stack:</i><i><br>
>> >>> </i><i>
>> >>> MPIR_Init_thread(514).......: </i><i><br>
>> >>> </i><i>
>> >>> MPID_Init(367)..............: channel initialization
>> >>> failed</i><i><br>
>> >>> </i><i>
>> >>> MPIDI_CH3_Init(492).........: </i><i><br>
>> >>> </i><i>
>> >>> MPIDI_CH3I_RDMA_init(224)...: </i><i><br>
>> >>> </i><i>
>> >>> rdma_setup_startup_ring(410): cannot create
>> cq</i><i><br>
>> >>> </i><i>
>> >>> )</i><i><br>
>> >>> </i><i>
>> >>> In: PMI_Abort(1, Fatal error in MPI_Init:</i><i><br>
>> >>> </i><i>
>> >>> Other MPI error, error stack:</i><i><br>
>> >>> </i><i>
>> >>> MPIR_Init_thread(514).......: </i><i><br>
>> >>> </i><i>
>> >>> MPID_Init(367)..............: channel initialization
>> >>> failed</i><i><br>
>> >>> </i><i>
>> >>> MPIDI_CH3_Init(492).........: </i><i><br>
>> >>> </i><i>
>> >>> MPIDI_CH3I_RDMA_init(224)...: </i><i><br>
>> >>> </i><i>
>> >>> rdma_setup_startup_ring(410): cannot create
>> cq</i><i><br>
>> >>> </i><i>
>> >>> )</i><i><br>
>> >>> </i></blockquote>
>> >>> <br>
>> >>> <br>
>> >>> I searched the internet and found this url (<a
>> >>> moz-do-not-send="true"
>> >>>
>> href="http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/2012-September/004027.html"
>> >>>
>> >>>
>> target="_blank">http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/2012-September/004027.html</a>)
>> >>>
>> >>> on the "cannot create cq" issue, which suggests we need
>> to set
>> >>> <br>
>> >>> <pre>ulimit -l unlimited in /etc/sysconfig/slurm</pre>
>> >>> <blockquote type="cite">
>> >>> <pre>If it doesn't show unlimited (or some other
>> number much
>> >>> higher than 64)
>> >>> then you'll need to do something to update the limits slurm
>> is using.
>> >>> On redhat systems you can put the following in
>> /etc/sysconfig/slurm.
>> >>>
>> >>> ulimit -l unlimited
>> >>> </pre>
>> >>> </blockquote>
>> >>> So, I added that file with the "ulimit -l unlimited"
>> statement
>> >>> added.<br>
>> >>> But, it didn't seem to make any difference on the issue.<br>
>> >>> <br>
>> >>> Does anyone have any hints on what might be wrong?<br>
>> >>> <br>
>> >>> Thank you,<br>
>> >>> Jim G<br>
>> >>> <br>
>> >>> <br>
>> >>> <br>
>> >>> <br>
>> >>> </body>
>> >>> </html>
>> >>>
>> >>> --------------020002060905080401040409--
>> >>>
>> >>> --===============6581539869262316634==
>> >>> Content-Type: text/plain; charset="us-ascii"
>> >>> MIME-Version: 1.0
>> >>> Content-Transfer-Encoding: 7bit
>> >>> Content-Disposition: inline
>> >>>
>> >>> _______________________________________________
>> >>> mvapich-discuss mailing list
>> >>> mvapich-discuss at cse.ohio-state.edu
>> <mailto:mvapich-discuss at cse.ohio-state.edu>
>> >>>
>> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>> >>>
>> >>> --===============6581539869262316634==--
>> >>
>> >
>>
>> _______________________________________________
>> mvapich-discuss mailing list
>> mvapich-discuss at cse.ohio-state.edu
>> <mailto:mvapich-discuss at cse.ohio-state.edu>
>> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20151102/736730e2/attachment-0001.html>
More information about the mvapich-discuss
mailing list