[mvapich-discuss] mvapich2+slurm+blcr Oh, My! (fwd)

xiangyong ouyang ouyangx at cse.ohio-state.edu
Thu Jun 24 15:10:06 EDT 2010


Currently "MV2_CKPT_INTERVAL" is in minutes.  Sonya, is that correct?


-xiangyong ouyang


On Thu, Jun 24, 2010 at 11:58 AM, David Brown <dmlb2000 at gmail.com> wrote:
> Thanks, I rebuilt mvapich2 to use mpirun_rsh and now its running fine.
>
> Question about the environment variables.
>
> MV2_CKPT_INTERVAL is this in minutes? seconds? hours?
>
> Also, what would need to be done to get this working? Is slurm
> supposed to setup the controlling daemon and open all the ports to the
> processes for communication?
>
> Just curious, Thanks.
>  - David Brown
>
> On Wed, Jun 23, 2010 at 9:29 PM, Dhabaleswar Panda
> <panda at cse.ohio-state.edu> wrote:
>> Hi,
>>
>> Thanks for reporting this issue. This is to let you know that the
>> following combinations (in addition to other combinations like hydra,
>> etc.) have been tested:
>>
>>   - mvapich2 + slurm
>>   - mvapich2 + mpirun_rsh
>>   - mvapich2 + mpirun_rsh + blcr
>>
>> However, the combination you are trying to use (mvapich2 + slurm + blcr)
>> has not been tested. We will take a look at it.
>>
>> In the mean time, I will suggest you to try the `mvapich2 + mpirun_rsh +
>> blcr' combination.
>>
>> Sections 5.2.1 and 6.3 of the latest MVAPICH2 1.5 user guide mentions
>> about the detailed steps in using mpirun_rsh and check-point restart with
>> BLCR, respectively.
>>
>> The User Guide can be obtained from the following URL:
>>
>> http://mvapich.cse.ohio-state.edu/support/user_guide_mvapich2-1.5rc2.html
>>
>> Thanks,
>>
>> DK
>>
>>
>> On Wed, 23 Jun 2010, David Brown wrote:
>>
>>> So I'm not sure who to ask this question to and what setup I'm doing
>>> wrong, but something isn't working right.
>>>
>>> I built mvapich2 1.5rc2 this way:
>>>
>>> ./configure \
>>>                 --prefix=%{mpidir} \
>>>                 --mandir=%{mpidir}/man \
>>>                 --enable-error-checking=runtime \
>>>                 --enable-timing=none \
>>>                 --enable-g=mem,dbg,meminit \
>>>                 --enable-sharedlibs=gcc \
>>>                 --with-rdma=gen2 \
>>>                 --enable-romio \
>>>                 --with-file-system=lustre+nfs \
>>>                 --with-slurm=/usr \
>>>                 --with-pmi=slurm \
>>>                 --with-pm=no \
>>>                 --enable-threads=multiple \
>>>                 --with-thread-package=pthreads \
>>>                 --disable-mpe \
>>>                 --without-mpe \
>>>                 --disable-nmpi-as-mpi \
>>>                 --enable-f77 \
>>>                 --enable-f90 \
>>>                 --enable-cxx \
>>>                 --enable-blcr
>>>
>>> I built slurm this way:
>>>
>>> rpmbuild -ta --with blcr --with postgresql slurm-2.1.9.tar.bz2
>>>
>>> I configured slurm with:
>>>
>>> CheckpointType=checkpoint/blcr
>>>
>>> And I built IOR and launched it this way:
>>>
>>> $ srun --checkpoint-dir=/lustre -n 8 -N 4  ./IOR -i 4 -b 32g -T 600 -E
>>> -k -e -t 1m -o /lustre/testFile
>>> [Rank 1][cr.c: line 781]MV2_CKPT_MPD_BASE_PORT is not set
>>> [Rank 3][cr.c: line 781]MV2_CKPT_MPD_BASE_PORT is not set
>>> [Rank 0][cr.c: line 781]MV2_CKPT_MPD_BASE_PORT is not set
>>> [Rank 2][cr.c: line 781]MV2_CKPT_MPD_BASE_PORT is not set
>>> [Rank 5][cr.c: line 781]MV2_CKPT_MPD_BASE_PORT is not set
>>> [Rank 4][cr.c: line 781]MV2_CKPT_MPD_BASE_PORT is not set
>>> [Rank 7][cr.c: line 781]MV2_CKPT_MPD_BASE_PORT is not set
>>> [Rank 6][cr.c: line 781]MV2_CKPT_MPD_BASE_PORT is not set
>>> srun: error: x11: tasks 2-3: Exited with exit code 255
>>> srun: error: x10: tasks 0-1: Exited with exit code 255
>>> srun: error: x12: tasks 4-5: Exited with exit code 255
>>> srun: error: x13: tasks 6-7: Exited with exit code 255
>>>
>>> So I'm confused, this is obviously not working. Do I need to use some
>>> other mechanism to launch? am I missing configuration somewhere?
>>>
>>> Thanks,
>>> - David Brown
>>> _______________________________________________
>>> mvapich-discuss mailing list
>>> mvapich-discuss at cse.ohio-state.edu
>>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>>
>>
>> _______________________________________________
>> mvapich-discuss mailing list
>> mvapich-discuss at cse.ohio-state.edu
>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>
>>
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>



More information about the mvapich-discuss mailing list