[mvapich-discuss] MPI-IO Inconsistency over Lustre using MVAPICH

Rajeev Thakur thakur at mcs.anl.gov
Wed Mar 4 10:50:16 EST 2009


Nathan,
       Can you check if it works if you add the prefix "ufs:" to the file
name in all opens?

Rajeev 


> From: Nathan Baca <nathan.baca at gmail.com>
> Subject: [mvapich-discuss] MPI-IO Inconsistency over Lustre using
> 	MVAPICH
> To: mvapich-discuss at cse.ohio-state.edu
> Message-ID:
> 	<d1196de80903031945k3e7ac0c4yc04f2fad7f1a8b3b at mail.gmail.com>
> Content-Type: text/plain; charset="utf-8"
> 
> Hello,
> 
> I am seeing inconsistent mpi-io behavior when writing to a Lustre file
> system using mvapich2 1.2p1 and mvapich 1.1 both with romio. 
> What follows is
> a simple reproducer and output. Essentially one or more of the running
> processes does not read or write the correct amount of data 
> to its part of a
> file residing on a Lustre (parallel) file system.
> 
> I have tried both isolating the output to a single OST and 
> striping across
> multiple OSTs. Both will reproduce the same result. I have 
> tried compiling
> with multiple versions of both pathscale and intel compilers 
> all with the
> same result.
> 
> The odd thing is that this seems to work using hpmpi 2.03 
> with pathscale 3.2
> and intel 10.1.018. The operating system is XC 3.2.1 which is 
> essentially
> rhel4.5. The kernel is 2.6.9-67.9hp.7sp.XCsmp. Lustre version is
> lustre-1.4.11-2.3_0.6_xc3.2.1_k2.6.9_67.9hp.7sp.XCsmp.
> 
> Any help figuring out what is happening is greatly 
> appreciated. Thanks, Nate
> 
> program gcrm_test_io
>   implicit none
>   include "mpif.h"
> 
>   integer X_SIZE
> 
>       integer w_me, w_nprocs
>       integer  my_info
> 
>       integer i
>       integer (kind=4) :: ierr
>       integer (kind=4) :: fileID
> 
>       integer (kind=MPI_OFFSET_KIND)        :: mylen
>       integer (kind=MPI_OFFSET_KIND)        :: offset
>       integer status(MPI_STATUS_SIZE)
>       integer count
>       integer ncells
>       real (kind=4), allocatable, dimension (:)     :: array2
>       logical sync
> 
>       call mpi_init(ierr)
>       call MPI_COMM_SIZE(MPI_COMM_WORLD,w_nprocs,ierr)
>       call MPI_COMM_RANK(MPI_COMM_WORLD,w_me,ierr)
> 
>       call mpi_info_create(my_info, ierr)
> !     optional ways to set things in mpi-io
> !     call mpi_info_set   (my_info, "romio_ds_read" , 
> "enable"   , ierr)
> !     call mpi_info_set   (my_info, "romio_ds_write", 
> "enable"   , ierr)
> !     call mpi_info_set   (my_info, "romio_cb_write", 
> "enable"    , ierr)
> 
>       x_size = 410011  ! A 'big' number, with bigger numbers 
> it is more
> likely to fail
>       sync = .true.  ! Extra file synchronization
> 
>       ncells = (X_SIZE * w_nprocs)
> 
> !  Use node zero to fill it with nines
>       if (w_me .eq. 0) then
>           call MPI_FILE_OPEN  (MPI_COMM_SELF, "output.dat",
> MPI_MODE_CREATE+MPI_MODE_WRONLY, my_info, fileID, ierr)
>           allocate (array2(ncells))
>           array2(:) = 9.0
>           mylen = ncells
>           offset = 0 * 4
>           call MPI_FILE_SET_VIEW(fileID,offset, MPI_REAL,MPI_REAL,
> "native",MPI_INFO_NULL,ierr)
>           call MPI_File_write(fileID, array2, mylen , 
> MPI_REAL, status,ierr)
> 
>           call MPI_Get_count(status,MPI_INTEGER, count, ierr)
>           if (count .ne. mylen) print*, "Wrong initial write count:",
> count,mylen
>           deallocate(array2)
>           if (sync) call MPI_FILE_SYNC (fileID,ierr)
>           call MPI_FILE_CLOSE (fileID,ierr)
>       endif
> 
> !  All nodes now fill their area with ones
>       call MPI_BARRIER(MPI_COMM_WORLD,ierr)
>       allocate (array2( X_SIZE))
>       array2(:) = 1.0
>       offset = (w_me * X_SIZE) * 4 ! multiply by four, since 
> it is real*4
>       mylen = X_SIZE
>       call MPI_FILE_OPEN  
> (MPI_COMM_WORLD,"output.dat",MPI_MODE_WRONLY,
> my_info, fileID, ierr)
>       print*,"node",w_me,"starting",(offset/4) + 
> 1,"ending",(offset/4)+mylen
> 
>       call MPI_FILE_SET_VIEW(fileID,offset, MPI_REAL,MPI_REAL,
> "native",MPI_INFO_NULL,ierr)
>       call MPI_File_write(fileID, array2, mylen , MPI_REAL, 
> status,ierr)
>       call MPI_Get_count(status,MPI_INTEGER, count, ierr)
>       if (count .ne. mylen) print*, "Wrong write count:", 
> count,mylen,w_me
>       deallocate(array2)
>       if (sync) call MPI_FILE_SYNC (fileID,ierr)
>       call MPI_FILE_CLOSE (fileID,ierr)
> 
> !  Read it back on node zero to see if it is ok data
>       if (w_me .eq. 0) then
>           call MPI_FILE_OPEN  (MPI_COMM_SELF, "output.dat", 
> MPI_MODE_RDONLY,
> my_info, fileID, ierr)
>           mylen = ncells
>           allocate (array2(ncells))
>           call MPI_File_read(fileID, array2, mylen , 
> MPI_REAL, status,ierr)
>           call MPI_Get_count(status,MPI_INTEGER, count, ierr)
>           if (count .ne. mylen) print*, "Wrong read count:", 
> count,mylen
>           do i=1,ncells
>                if (array2(i) .ne. 1) then
>                   print*, "ERROR", i,array2(i), ((i-1)*4),
> ((i-1)*4)/(1024d0*1024d0) ! Index, value, # of good bytes,MB
>                   goto 999
>                end if
>           end do
>           print*, "All done with nothing wrong"
>  999      deallocate(array2)
>           call MPI_FILE_CLOSE (fileID,ierr)
>           call MPI_file_delete ("output.dat",MPI_INFO_NULL,ierr)
>       endif
> 
>       call mpi_finalize(ierr)
> 
> end program gcrm_test_io
> 
> 1.2p1 MVAPICH 2
>  node 1 starting 410012 ending 820022
>  node 2 starting 820023 ending 1230033
>  node 3 starting 1230034 ending 1640044
>  node 4 starting 1640045 ending 2050055
>  node 5 starting 2050056 ending 2460066
>  node 0 starting 1 ending 410011
>  All done with nothing wrong
> 
> 
>  node 1 starting 410012 ending 820022
>  node 4 starting 1640045 ending 2050055
>  node 3 starting 1230034 ending 1640044
>  node 5 starting 2050056 ending 2460066
>  node 2 starting 820023 ending 1230033
>  Wrong write count: 228554 410011 2
>  node 0 starting 1 ending 410011
>  Wrong read count: 1048576 2460066
>  ERROR 1048577 0.E+0 4194304 4.
> 
> 
>  node 1 starting 410012 ending 820022
>  node 3 starting 1230034 ending 1640044
>  node 4 starting 1640045 ending 2050055
>  node 2 starting 820023 ending 1230033
>  node 5 starting 2050056 ending 2460066
>  node 0 starting 1 ending 410011
>  Wrong read count: 1048576 2460066
>  ERROR 1048577 0.E+0 4194304 4.
> 
> 
> 1.1 MVAPICH
>  node           0 starting                     1 ending
> 410011
>  node           4 starting               1640045 ending
> 2050055
>  node           3 starting               1230034 ending
> 1640044
>  node           2 starting                820023 ending
> 1230033
>  node           1 starting                410012 ending
> 820022
>  node           5 starting               2050056 ending
> 2460066
>  All done with nothing wrong
> 
> 
>  node           0 starting                     1 ending
> 410011
>  node           5 starting               2050056 ending
> 2460066
>  node           2 starting                820023 ending
> 1230033
>  node           1 starting                410012 ending
> 820022
>  Wrong write count:      228554                410011           2
>  node           3 starting               1230034 ending
> 1640044
>  node           4 starting               1640045 ending
> 2050055
>  Wrong read count:     1048576               2460066
>  ERROR     1048577  0.0000000E+00     4194304   4.00000000000000
> 
> 
>  node           0 starting                     1 ending
> 410011
>  node           3 starting               1230034 ending
> 1640044
>  node           4 starting               1640045 ending
> 2050055
>  node           1 starting                410012 ending
> 820022
>  node           5 starting               2050056 ending
> 2460066
>  node           2 starting                820023 ending
> 1230033
>  Wrong read count:     1229824               2460066
>  ERROR     1229825  0.0000000E+00     4919296   4.69140625000000



More information about the mvapich-discuss mailing list