[mvapich-discuss] parallel file read -> "cannot allocate memory for the file buffer"

Abhinav Vishnu vishnu at cse.ohio-state.edu
Thu Oct 18 15:57:40 EDT 2007


Hi Nathan,

Thanks for trying MVAPICH/MVAPICH2 and reporting the problem to
us.

> The error has shown up on several combinations of:
>   * kernel 2.6.9-55.ELsmp, 2.6.9-55.0.6ELsmp, 2.6.20.20
>   * OFED-1.2, OFED-1.2.5.1
>   * MVAPICH-0.9.9, MVAPICH2-0.9.8, MVAPICH2-1.0
> All tests use the Intel ifort compiler, and the code was simply built
> with "mpif90".
>
> Why do I think this is an MVAPICH problem?  The error DID NOT occur when
> using MVAPICH-0.9.8 with Shared Receive Queue disabled!
>
> We disabled SRQ with the following simple change:
>
> # diff mvapich-0.9.8_clean/mpid/ch_gen2/viaparam.h
> mvapich-0.9.8_single_rail_intel_9.1/mpid/ch_gen2/viaparam.h
> 50a51
>   
>> #if 0
>>     
> 53a55
>   
>> #endif
>>     
>
>   
Sorry to know that you had to make changes in the MPI source code
for SRQ. From MVAPICH 0.9.9 onwards, we have converted almost all features
as run-time variable. With this support, you will not need to change the MPI
source for enabling/disabling features. For MVAPICH, the SRQ usage
variable can be controlled by VIADEV_USE_SRQ:

http://mvapich.cse.ohio-state.edu/support/mvapich_user_guide.html#x1-1100009.4.1
> I have not yet figured out how to disable SRQ in MVAPICH2.
>   
In MVAPICH2, the usage of SRQ can be controlled by using MV2_USE_SRQ.
Details with respect to this variable are present here:
http://mvapich.cse.ohio-state.edu/support/user_guide_mvapich2.html#x1-12000010.46

> Initial testing with linux-2.6.20.20, OFED-1.2.5.1, and MVAPICH2-1.0
> seemed to raise the number of MPI tasks necessary to trigger the problem
> from roughly 36 up to 65.
>
> One last note: I ported the 2nd fortran program to C to try to duplicate
> the error there.  However, it ran to completion cleanly on 256 cores.
> So perhaps the problem is specific to the fortran libraries.
>
>
> 1) Can anyone duplicate our problem with the above code?
>   
We are taking a look at it.
> 2) Does the code violate MPI standards or exceed MVAPICH limitations?
>   
 From MVAPICH/MVAPICH2 perspective, there should be no limitations IMO.
> 3) Is there a change to the MPI stack or runtime environment that will
> avoid the problem?
>   
As you have mentioned earlier, disabling SRQ seems to solve the problem
for you. Unfortunately, at this point, we do not have much insight with
respect to the root cause of the problem. Please let us know the outcome
of your experimentation by disabling SRQ for MVAPICH/MVAPICH2.

Thanks,

:- Abhinav




More information about the mvapich-discuss mailing list