[mvapich-discuss] MPI-IO Inconsistency over Lustre using MVAPICH
Rajeev Thakur
thakur at mcs.anl.gov
Wed Mar 4 10:50:16 EST 2009
Nathan,
Can you check if it works if you add the prefix "ufs:" to the file
name in all opens?
Rajeev
> From: Nathan Baca <nathan.baca at gmail.com>
> Subject: [mvapich-discuss] MPI-IO Inconsistency over Lustre using
> MVAPICH
> To: mvapich-discuss at cse.ohio-state.edu
> Message-ID:
> <d1196de80903031945k3e7ac0c4yc04f2fad7f1a8b3b at mail.gmail.com>
> Content-Type: text/plain; charset="utf-8"
>
> Hello,
>
> I am seeing inconsistent mpi-io behavior when writing to a Lustre file
> system using mvapich2 1.2p1 and mvapich 1.1 both with romio.
> What follows is
> a simple reproducer and output. Essentially one or more of the running
> processes does not read or write the correct amount of data
> to its part of a
> file residing on a Lustre (parallel) file system.
>
> I have tried both isolating the output to a single OST and
> striping across
> multiple OSTs. Both will reproduce the same result. I have
> tried compiling
> with multiple versions of both pathscale and intel compilers
> all with the
> same result.
>
> The odd thing is that this seems to work using hpmpi 2.03
> with pathscale 3.2
> and intel 10.1.018. The operating system is XC 3.2.1 which is
> essentially
> rhel4.5. The kernel is 2.6.9-67.9hp.7sp.XCsmp. Lustre version is
> lustre-1.4.11-2.3_0.6_xc3.2.1_k2.6.9_67.9hp.7sp.XCsmp.
>
> Any help figuring out what is happening is greatly
> appreciated. Thanks, Nate
>
> program gcrm_test_io
> implicit none
> include "mpif.h"
>
> integer X_SIZE
>
> integer w_me, w_nprocs
> integer my_info
>
> integer i
> integer (kind=4) :: ierr
> integer (kind=4) :: fileID
>
> integer (kind=MPI_OFFSET_KIND) :: mylen
> integer (kind=MPI_OFFSET_KIND) :: offset
> integer status(MPI_STATUS_SIZE)
> integer count
> integer ncells
> real (kind=4), allocatable, dimension (:) :: array2
> logical sync
>
> call mpi_init(ierr)
> call MPI_COMM_SIZE(MPI_COMM_WORLD,w_nprocs,ierr)
> call MPI_COMM_RANK(MPI_COMM_WORLD,w_me,ierr)
>
> call mpi_info_create(my_info, ierr)
> ! optional ways to set things in mpi-io
> ! call mpi_info_set (my_info, "romio_ds_read" ,
> "enable" , ierr)
> ! call mpi_info_set (my_info, "romio_ds_write",
> "enable" , ierr)
> ! call mpi_info_set (my_info, "romio_cb_write",
> "enable" , ierr)
>
> x_size = 410011 ! A 'big' number, with bigger numbers
> it is more
> likely to fail
> sync = .true. ! Extra file synchronization
>
> ncells = (X_SIZE * w_nprocs)
>
> ! Use node zero to fill it with nines
> if (w_me .eq. 0) then
> call MPI_FILE_OPEN (MPI_COMM_SELF, "output.dat",
> MPI_MODE_CREATE+MPI_MODE_WRONLY, my_info, fileID, ierr)
> allocate (array2(ncells))
> array2(:) = 9.0
> mylen = ncells
> offset = 0 * 4
> call MPI_FILE_SET_VIEW(fileID,offset, MPI_REAL,MPI_REAL,
> "native",MPI_INFO_NULL,ierr)
> call MPI_File_write(fileID, array2, mylen ,
> MPI_REAL, status,ierr)
>
> call MPI_Get_count(status,MPI_INTEGER, count, ierr)
> if (count .ne. mylen) print*, "Wrong initial write count:",
> count,mylen
> deallocate(array2)
> if (sync) call MPI_FILE_SYNC (fileID,ierr)
> call MPI_FILE_CLOSE (fileID,ierr)
> endif
>
> ! All nodes now fill their area with ones
> call MPI_BARRIER(MPI_COMM_WORLD,ierr)
> allocate (array2( X_SIZE))
> array2(:) = 1.0
> offset = (w_me * X_SIZE) * 4 ! multiply by four, since
> it is real*4
> mylen = X_SIZE
> call MPI_FILE_OPEN
> (MPI_COMM_WORLD,"output.dat",MPI_MODE_WRONLY,
> my_info, fileID, ierr)
> print*,"node",w_me,"starting",(offset/4) +
> 1,"ending",(offset/4)+mylen
>
> call MPI_FILE_SET_VIEW(fileID,offset, MPI_REAL,MPI_REAL,
> "native",MPI_INFO_NULL,ierr)
> call MPI_File_write(fileID, array2, mylen , MPI_REAL,
> status,ierr)
> call MPI_Get_count(status,MPI_INTEGER, count, ierr)
> if (count .ne. mylen) print*, "Wrong write count:",
> count,mylen,w_me
> deallocate(array2)
> if (sync) call MPI_FILE_SYNC (fileID,ierr)
> call MPI_FILE_CLOSE (fileID,ierr)
>
> ! Read it back on node zero to see if it is ok data
> if (w_me .eq. 0) then
> call MPI_FILE_OPEN (MPI_COMM_SELF, "output.dat",
> MPI_MODE_RDONLY,
> my_info, fileID, ierr)
> mylen = ncells
> allocate (array2(ncells))
> call MPI_File_read(fileID, array2, mylen ,
> MPI_REAL, status,ierr)
> call MPI_Get_count(status,MPI_INTEGER, count, ierr)
> if (count .ne. mylen) print*, "Wrong read count:",
> count,mylen
> do i=1,ncells
> if (array2(i) .ne. 1) then
> print*, "ERROR", i,array2(i), ((i-1)*4),
> ((i-1)*4)/(1024d0*1024d0) ! Index, value, # of good bytes,MB
> goto 999
> end if
> end do
> print*, "All done with nothing wrong"
> 999 deallocate(array2)
> call MPI_FILE_CLOSE (fileID,ierr)
> call MPI_file_delete ("output.dat",MPI_INFO_NULL,ierr)
> endif
>
> call mpi_finalize(ierr)
>
> end program gcrm_test_io
>
> 1.2p1 MVAPICH 2
> node 1 starting 410012 ending 820022
> node 2 starting 820023 ending 1230033
> node 3 starting 1230034 ending 1640044
> node 4 starting 1640045 ending 2050055
> node 5 starting 2050056 ending 2460066
> node 0 starting 1 ending 410011
> All done with nothing wrong
>
>
> node 1 starting 410012 ending 820022
> node 4 starting 1640045 ending 2050055
> node 3 starting 1230034 ending 1640044
> node 5 starting 2050056 ending 2460066
> node 2 starting 820023 ending 1230033
> Wrong write count: 228554 410011 2
> node 0 starting 1 ending 410011
> Wrong read count: 1048576 2460066
> ERROR 1048577 0.E+0 4194304 4.
>
>
> node 1 starting 410012 ending 820022
> node 3 starting 1230034 ending 1640044
> node 4 starting 1640045 ending 2050055
> node 2 starting 820023 ending 1230033
> node 5 starting 2050056 ending 2460066
> node 0 starting 1 ending 410011
> Wrong read count: 1048576 2460066
> ERROR 1048577 0.E+0 4194304 4.
>
>
> 1.1 MVAPICH
> node 0 starting 1 ending
> 410011
> node 4 starting 1640045 ending
> 2050055
> node 3 starting 1230034 ending
> 1640044
> node 2 starting 820023 ending
> 1230033
> node 1 starting 410012 ending
> 820022
> node 5 starting 2050056 ending
> 2460066
> All done with nothing wrong
>
>
> node 0 starting 1 ending
> 410011
> node 5 starting 2050056 ending
> 2460066
> node 2 starting 820023 ending
> 1230033
> node 1 starting 410012 ending
> 820022
> Wrong write count: 228554 410011 2
> node 3 starting 1230034 ending
> 1640044
> node 4 starting 1640045 ending
> 2050055
> Wrong read count: 1048576 2460066
> ERROR 1048577 0.0000000E+00 4194304 4.00000000000000
>
>
> node 0 starting 1 ending
> 410011
> node 3 starting 1230034 ending
> 1640044
> node 4 starting 1640045 ending
> 2050055
> node 1 starting 410012 ending
> 820022
> node 5 starting 2050056 ending
> 2460066
> node 2 starting 820023 ending
> 1230033
> Wrong read count: 1229824 2460066
> ERROR 1229825 0.0000000E+00 4919296 4.69140625000000
More information about the mvapich-discuss
mailing list