[mvapich-discuss] mvapich2+slurm+blcr Oh, My! (fwd)

Sonya smarcare at cse.ohio-state.edu
Thu Jun 24 15:15:15 EDT 2010


Il 24/06/2010 15:10, xiangyong ouyang ha scritto:
> Currently "MV2_CKPT_INTERVAL" is in minutes.  Sonya, is that correct?
>
>
>    
Yes, it is in minutes.

Sonya
> -xiangyong ouyang
>
>
> On Thu, Jun 24, 2010 at 11:58 AM, David Brown<dmlb2000 at gmail.com>  wrote:
>    
>> Thanks, I rebuilt mvapich2 to use mpirun_rsh and now its running fine.
>>
>> Question about the environment variables.
>>
>> MV2_CKPT_INTERVAL is this in minutes? seconds? hours?
>>
>> Also, what would need to be done to get this working? Is slurm
>> supposed to setup the controlling daemon and open all the ports to the
>> processes for communication?
>>
>> Just curious, Thanks.
>>   - David Brown
>>
>> On Wed, Jun 23, 2010 at 9:29 PM, Dhabaleswar Panda
>> <panda at cse.ohio-state.edu>  wrote:
>>      
>>> Hi,
>>>
>>> Thanks for reporting this issue. This is to let you know that the
>>> following combinations (in addition to other combinations like hydra,
>>> etc.) have been tested:
>>>
>>>    - mvapich2 + slurm
>>>    - mvapich2 + mpirun_rsh
>>>    - mvapich2 + mpirun_rsh + blcr
>>>
>>> However, the combination you are trying to use (mvapich2 + slurm + blcr)
>>> has not been tested. We will take a look at it.
>>>
>>> In the mean time, I will suggest you to try the `mvapich2 + mpirun_rsh +
>>> blcr' combination.
>>>
>>> Sections 5.2.1 and 6.3 of the latest MVAPICH2 1.5 user guide mentions
>>> about the detailed steps in using mpirun_rsh and check-point restart with
>>> BLCR, respectively.
>>>
>>> The User Guide can be obtained from the following URL:
>>>
>>> http://mvapich.cse.ohio-state.edu/support/user_guide_mvapich2-1.5rc2.html
>>>
>>> Thanks,
>>>
>>> DK
>>>
>>>
>>> On Wed, 23 Jun 2010, David Brown wrote:
>>>
>>>        
>>>> So I'm not sure who to ask this question to and what setup I'm doing
>>>> wrong, but something isn't working right.
>>>>
>>>> I built mvapich2 1.5rc2 this way:
>>>>
>>>> ./configure \
>>>>                  --prefix=%{mpidir} \
>>>>                  --mandir=%{mpidir}/man \
>>>>                  --enable-error-checking=runtime \
>>>>                  --enable-timing=none \
>>>>                  --enable-g=mem,dbg,meminit \
>>>>                  --enable-sharedlibs=gcc \
>>>>                  --with-rdma=gen2 \
>>>>                  --enable-romio \
>>>>                  --with-file-system=lustre+nfs \
>>>>                  --with-slurm=/usr \
>>>>                  --with-pmi=slurm \
>>>>                  --with-pm=no \
>>>>                  --enable-threads=multiple \
>>>>                  --with-thread-package=pthreads \
>>>>                  --disable-mpe \
>>>>                  --without-mpe \
>>>>                  --disable-nmpi-as-mpi \
>>>>                  --enable-f77 \
>>>>                  --enable-f90 \
>>>>                  --enable-cxx \
>>>>                  --enable-blcr
>>>>
>>>> I built slurm this way:
>>>>
>>>> rpmbuild -ta --with blcr --with postgresql slurm-2.1.9.tar.bz2
>>>>
>>>> I configured slurm with:
>>>>
>>>> CheckpointType=checkpoint/blcr
>>>>
>>>> And I built IOR and launched it this way:
>>>>
>>>> $ srun --checkpoint-dir=/lustre -n 8 -N 4  ./IOR -i 4 -b 32g -T 600 -E
>>>> -k -e -t 1m -o /lustre/testFile
>>>> [Rank 1][cr.c: line 781]MV2_CKPT_MPD_BASE_PORT is not set
>>>> [Rank 3][cr.c: line 781]MV2_CKPT_MPD_BASE_PORT is not set
>>>> [Rank 0][cr.c: line 781]MV2_CKPT_MPD_BASE_PORT is not set
>>>> [Rank 2][cr.c: line 781]MV2_CKPT_MPD_BASE_PORT is not set
>>>> [Rank 5][cr.c: line 781]MV2_CKPT_MPD_BASE_PORT is not set
>>>> [Rank 4][cr.c: line 781]MV2_CKPT_MPD_BASE_PORT is not set
>>>> [Rank 7][cr.c: line 781]MV2_CKPT_MPD_BASE_PORT is not set
>>>> [Rank 6][cr.c: line 781]MV2_CKPT_MPD_BASE_PORT is not set
>>>> srun: error: x11: tasks 2-3: Exited with exit code 255
>>>> srun: error: x10: tasks 0-1: Exited with exit code 255
>>>> srun: error: x12: tasks 4-5: Exited with exit code 255
>>>> srun: error: x13: tasks 6-7: Exited with exit code 255
>>>>
>>>> So I'm confused, this is obviously not working. Do I need to use some
>>>> other mechanism to launch? am I missing configuration somewhere?
>>>>
>>>> Thanks,
>>>> - David Brown
>>>> _______________________________________________
>>>> mvapich-discuss mailing list
>>>> mvapich-discuss at cse.ohio-state.edu
>>>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>>>
>>>>          
>>> _______________________________________________
>>> mvapich-discuss mailing list
>>> mvapich-discuss at cse.ohio-state.edu
>>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>>
>>>
>>>        
>> _______________________________________________
>> mvapich-discuss mailing list
>> mvapich-discuss at cse.ohio-state.edu
>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>
>>      



More information about the mvapich-discuss mailing list