[mvapich-discuss] MPI-IO Inconsistency over Lustre using MVAPICH

Weikuan Yu weikuan.yu at gmail.com
Wed Mar 4 10:42:36 EST 2009


Hi, Nathan,

Thanks for reporting the problem. I will be looking into it. I saw you 
had three outputs from the same program. Could you please clarify if it 
is the case that the program run correctly for the first time, but 
failed later on?

--Weikuan

Nathan Baca wrote:
> Hello,
> 
> I am seeing inconsistent mpi-io behavior when writing to a Lustre file 
> system using mvapich2 1.2p1 and mvapich 1.1 both with romio. What 
> follows is a simple reproducer and output. Essentially one or more of 
> the running processes does not read or write the correct amount of data 
> to its part of a file residing on a Lustre (parallel) file system.
> 
> I have tried both isolating the output to a single OST and striping 
> across multiple OSTs. Both will reproduce the same result. I have tried 
> compiling with multiple versions of both pathscale and intel compilers 
> all with the same result.
> 
> The odd thing is that this seems to work using hpmpi 2.03 with pathscale 
> 3.2 and intel 10.1.018. The operating system is XC 3.2.1 which is 
> essentially rhel4.5. The kernel is 2.6.9-67.9hp.7sp.XCsmp. Lustre 
> version is lustre-1.4.11-2.3_0.6_xc3.2.1_k2.6.9_67.9hp.7sp.XCsmp.
> 
> Any help figuring out what is happening is greatly appreciated. Thanks, Nate
> 
> program gcrm_test_io
>   implicit none
>   include "mpif.h"
>  
>   integer X_SIZE
>  
>       integer w_me, w_nprocs
>       integer  my_info
>  
>       integer i
>       integer (kind=4) :: ierr
>       integer (kind=4) :: fileID
>       
>       integer (kind=MPI_OFFSET_KIND)        :: mylen
>       integer (kind=MPI_OFFSET_KIND)        :: offset
>       integer status(MPI_STATUS_SIZE)
>       integer count
>       integer ncells
>       real (kind=4), allocatable, dimension (:)     :: array2
>       logical sync
>  
>       call mpi_init(ierr)
>       call MPI_COMM_SIZE(MPI_COMM_WORLD,w_nprocs,ierr)
>       call MPI_COMM_RANK(MPI_COMM_WORLD,w_me,ierr)
>  
>       call mpi_info_create(my_info, ierr)
> !     optional ways to set things in mpi-io
> !     call mpi_info_set   (my_info, "romio_ds_read" , "enable"   , ierr)
> !     call mpi_info_set   (my_info, "romio_ds_write", "enable"   , ierr)
> !     call mpi_info_set   (my_info, "romio_cb_write", "enable"    , ierr)
>  
>       x_size = 410011  ! A 'big' number, with bigger numbers it is more 
> likely to fail
>       sync = .true.  ! Extra file synchronization
>  
>       ncells = (X_SIZE * w_nprocs)
>  
> !  Use node zero to fill it with nines
>       if (w_me .eq. 0) then
>           call MPI_FILE_OPEN  (MPI_COMM_SELF, "output.dat", 
> MPI_MODE_CREATE+MPI_MODE_WRONLY, my_info, fileID, ierr)
>           allocate (array2(ncells)) 
>           array2(:) = 9.0
>           mylen = ncells
>           offset = 0 * 4
>           call MPI_FILE_SET_VIEW(fileID,offset, MPI_REAL,MPI_REAL, 
> "native",MPI_INFO_NULL,ierr)
>           call MPI_File_write(fileID, array2, mylen , MPI_REAL, 
> status,ierr)
>           call MPI_Get_count(status,MPI_INTEGER, count, ierr)
>           if (count .ne. mylen) print*, "Wrong initial write count:", 
> count,mylen
>           deallocate(array2)
>           if (sync) call MPI_FILE_SYNC (fileID,ierr)
>           call MPI_FILE_CLOSE (fileID,ierr)
>       endif
>  
> !  All nodes now fill their area with ones
>       call MPI_BARRIER(MPI_COMM_WORLD,ierr)
>       allocate (array2( X_SIZE))
>       array2(:) = 1.0
>       offset = (w_me * X_SIZE) * 4 ! multiply by four, since it is real*4
>       mylen = X_SIZE
>       call MPI_FILE_OPEN  (MPI_COMM_WORLD,"output.dat",MPI_MODE_WRONLY, 
> my_info, fileID, ierr)
>       print*,"node",w_me,"starting",(offset/4) + 
> 1,"ending",(offset/4)+mylen
>       call MPI_FILE_SET_VIEW(fileID,offset, MPI_REAL,MPI_REAL, 
> "native",MPI_INFO_NULL,ierr)
>       call MPI_File_write(fileID, array2, mylen , MPI_REAL, status,ierr)
>       call MPI_Get_count(status,MPI_INTEGER, count, ierr)
>       if (count .ne. mylen) print*, "Wrong write count:", count,mylen,w_me
>       deallocate(array2)
>       if (sync) call MPI_FILE_SYNC (fileID,ierr)
>       call MPI_FILE_CLOSE (fileID,ierr)
>  
> !  Read it back on node zero to see if it is ok data
>       if (w_me .eq. 0) then
>           call MPI_FILE_OPEN  (MPI_COMM_SELF, "output.dat", 
> MPI_MODE_RDONLY, my_info, fileID, ierr)
>           mylen = ncells
>           allocate (array2(ncells))
>           call MPI_File_read(fileID, array2, mylen , MPI_REAL, status,ierr)
>           call MPI_Get_count(status,MPI_INTEGER, count, ierr)
>           if (count .ne. mylen) print*, "Wrong read count:", count,mylen
>           do i=1,ncells
>                if (array2(i) .ne. 1) then
>                   print*, "ERROR", i,array2(i), ((i-1)*4), 
> ((i-1)*4)/(1024d0*1024d0) ! Index, value, # of good bytes,MB
>                   goto 999
>                end if
>           end do
>           print*, "All done with nothing wrong"
>  999      deallocate(array2)
>           call MPI_FILE_CLOSE (fileID,ierr)
>           call MPI_file_delete ("output.dat",MPI_INFO_NULL,ierr)
>       endif
>  
>       call mpi_finalize(ierr)
>  
> end program gcrm_test_io  
> 
> 1.2p1 MVAPICH 2
>  node 1 starting 410012 ending 820022
>  node 2 starting 820023 ending 1230033
>  node 3 starting 1230034 ending 1640044
>  node 4 starting 1640045 ending 2050055
>  node 5 starting 2050056 ending 2460066
>  node 0 starting 1 ending 410011
>  All done with nothing wrong
> 
> 
>  node 1 starting 410012 ending 820022
>  node 4 starting 1640045 ending 2050055
>  node 3 starting 1230034 ending 1640044
>  node 5 starting 2050056 ending 2460066
>  node 2 starting 820023 ending 1230033
>  Wrong write count: 228554 410011 2
>  node 0 starting 1 ending 410011
>  Wrong read count: 1048576 2460066
>  ERROR 1048577 0.E+0 4194304 4.
> 
> 
>  node 1 starting 410012 ending 820022
>  node 3 starting 1230034 ending 1640044
>  node 4 starting 1640045 ending 2050055
>  node 2 starting 820023 ending 1230033
>  node 5 starting 2050056 ending 2460066
>  node 0 starting 1 ending 410011
>  Wrong read count: 1048576 2460066
>  ERROR 1048577 0.E+0 4194304 4.
> 
> 
> 1.1 MVAPICH
>  node           0 starting                     1 ending                
> 410011
>  node           4 starting               1640045 ending               
> 2050055
>  node           3 starting               1230034 ending               
> 1640044
>  node           2 starting                820023 ending               
> 1230033
>  node           1 starting                410012 ending                
> 820022
>  node           5 starting               2050056 ending               
> 2460066
>  All done with nothing wrong
> 
> 
>  node           0 starting                     1 ending                
> 410011
>  node           5 starting               2050056 ending               
> 2460066
>  node           2 starting                820023 ending               
> 1230033
>  node           1 starting                410012 ending                
> 820022
>  Wrong write count:      228554                410011           2
>  node           3 starting               1230034 ending               
> 1640044
>  node           4 starting               1640045 ending               
> 2050055
>  Wrong read count:     1048576               2460066
>  ERROR     1048577  0.0000000E+00     4194304   4.00000000000000   
> 
> 
>  node           0 starting                     1 ending                
> 410011
>  node           3 starting               1230034 ending               
> 1640044
>  node           4 starting               1640045 ending               
> 2050055
>  node           1 starting                410012 ending                
> 820022
>  node           5 starting               2050056 ending               
> 2460066
>  node           2 starting                820023 ending               
> 1230033
>  Wrong read count:     1229824               2460066
>  ERROR     1229825  0.0000000E+00     4919296   4.69140625000000   
> 
>  
> -- 
> Nathan Baca
> nathan.baca at gmail.com <mailto:nathan.baca at gmail.com>
> 
> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss


More information about the mvapich-discuss mailing list