[mvapich-discuss] parallel file read -> "cannot allocate memory for the file buffer"

Nathan Dauchy Nathan.Dauchy at noaa.gov
Thu Oct 18 13:37:56 EDT 2007


Greetings all, and apologies in advance for the long posting,

Some of the FORTRAN MPI applications at our site rely on having all
processes open the same file for input data.  (I know this is not
necessarily optimal, but we can't change all the codes at this time.)

Several of the applications fail with *some* (5-20) of the MPI tasks
crashing with an error like the following:

forrtl: severe (98): cannot allocate memory for the file buffer - out of
memory, unit 101, file /misc/whome/ndauchy/src/ParaRead/ParaData
Image              PC                Routine            Line        Source
ParaRead           00000000004832AB  Unknown               Unknown  Unknown
ParaRead           0000000000481E5E  Unknown               Unknown  Unknown
ParaRead           0000000000466C3E  Unknown               Unknown  Unknown
ParaRead           0000000000445C2E  Unknown               Unknown  Unknown
ParaRead           000000000044588F  Unknown               Unknown  Unknown
ParaRead           0000000000452D30  Unknown               Unknown  Unknown
ParaRead           0000000000404C45  MAIN__                     18
ParaRead.F90
ParaRead           00000000004049AA  Unknown               Unknown  Unknown
libc.so.6          0000002A95C6C4BB  Unknown               Unknown  Unknown
ParaRead           00000000004048EA  Unknown               Unknown  Unknown

It takes only a moderately size file (1.6M) and 36 to 65 MPI tasks to
trigger the error.  At smaller sizes everything works correctly.  We
have seen this problem on both our Rapidscale/Terragrid filesystem and
on NFS.

We have constructed a simple (fortran) test case to duplicate the
problem.  The first program creates the data file, the second reads it
from many nodes simultaneously.

=====================================================
program writeParaRead
implicit none

integer,parameter :: size=100000
integer,parameter :: u=101
real              :: a(size),b(size),c(size),d(size)

a=1
b=2
c=3
d=4
open(u,file="ParaData",form='unformatted')
write(u) a,b,c,d
print*,a(size),b(size),c(size),d(size)

end program writeParaRead
=====================================================
program testParaRead
implicit none
include 'mpif.h'
integer           :: Rank,numprocs,MyError,i
integer,parameter :: size=100000
integer,parameter :: u=101
real              :: a(size),b(size),c(size),d(size)

call MPI_INIT( MyError )
call MPI_COMM_RANK( MPI_COMM_WORLD, Rank, MyError )
call MPI_COMM_SIZE( MPI_COMM_WORLD, numprocs, MyError )
print *, 'Process ', Rank, ' of ', numprocs, ' is alive'

open(u,file="ParaData",form='unformatted')

call MPI_BARRIER( MPI_COMM_WORLD )

read(u) a,b,c,d
close(u)
print"('output',1i5,4f10.5)",rank,a(size),b(size),c(size),d(size)

call MPI_FINALIZE( MyError)

end program testParaRead
=====================================================


The error has shown up on several combinations of:
  * kernel 2.6.9-55.ELsmp, 2.6.9-55.0.6ELsmp, 2.6.20.20
  * OFED-1.2, OFED-1.2.5.1
  * MVAPICH-0.9.9, MVAPICH2-0.9.8, MVAPICH2-1.0
All tests use the Intel ifort compiler, and the code was simply built
with "mpif90".

Why do I think this is an MVAPICH problem?  The error DID NOT occur when
using MVAPICH-0.9.8 with Shared Receive Queue disabled!

We disabled SRQ with the following simple change:

# diff mvapich-0.9.8_clean/mpid/ch_gen2/viaparam.h
mvapich-0.9.8_single_rail_intel_9.1/mpid/ch_gen2/viaparam.h
50a51
> #if 0
53a55
> #endif

I have not yet figured out how to disable SRQ in MVAPICH2.

Initial testing with linux-2.6.20.20, OFED-1.2.5.1, and MVAPICH2-1.0
seemed to raise the number of MPI tasks necessary to trigger the problem
from roughly 36 up to 65.

One last note: I ported the 2nd fortran program to C to try to duplicate
the error there.  However, it ran to completion cleanly on 256 cores.
So perhaps the problem is specific to the fortran libraries.


SO, now the questions:

1) Can anyone duplicate our problem with the above code?

2) Does the code violate MPI standards or exceed MVAPICH limitations?

3) Is there a change to the MPI stack or runtime environment that will
avoid the problem?

4) Is there a *simple* change that can be made to the user code to avoid
the problem?

5) How do I disable SRQ in MVAPICH2 to see if that helps at all?


Thanks for your help,
Nathan


More information about the mvapich-discuss mailing list