[mvapich-discuss] MPI and posix shared memory.

Devendar Bureddy bureddy at cse.ohio-state.edu
Thu Sep 26 23:00:55 EDT 2013


Good to know that things are working fine.

I am closing this issue on mvapich-discuss list for everybody's
information. This issue is because of wrong interaction of IB registration
cache with application's shmem allocations which are not going through the
ptmalloc. Using R3 rndv protocol will work around this issue.

-Devendar

On Thu, Sep 26, 2013 at 1:46 PM, Ben <Benjamin.M.Auer at nasa.gov> wrote:

>  Using the R3 protocol did indeed fix the problem when I tested it a
> little while ago.
> Thanks
>
>
> On 09/26/2013 11:25 AM, Devendar Bureddy wrote:
>
> It seems testbcast() method in this program allocates a shared memory
> buffer each time. The problem is coming when the virtual address of a shm
> segment overlapped with the earlier segments. In this case, infiniband
> registration cache maintained in library picks a wrong registration entry
> as we do not have mechanism to track memory allocation/deallocation with
> shmget/shmdt.
>
>  You should be able to fix this problem with a run-time parameter
> MV2_RNDV_PROTOCOL=R3
>
>  Let us know if you are able to run the complete application with this
> flag.
>
>  -Devendar
>
>
> On Tue, Sep 17, 2013 at 3:56 PM, Ben <Benjamin.M.Auer at nasa.gov> wrote:
>
>>  Hi Devendar,
>> I finally had some time and was able to make a reproducer for the issue
>> described below that you can take a look at. I have attached a tar file
>> with a makefile, our shared memory module, and a small program using it.
>> The shared memory module uses MPI_get_processor_name to determine which
>> node a process is on. All the program does is make several calls to a
>> subroutine that allocates shared memory, sets it to a constant on the first
>> node, does the node broadcast, checks if the broadcast was successful, and
>> deallocates. I consistently find that the first node broadcast works fine
>> but the subsequent ones fail (although the status from the broadcast
>> returns no error) even with barriers between the broadcasts and turning off
>> all shared memory collective operations. Just running on 2 nodes is
>> sufficient in my experience. The other interesting thing is that this also
>> fails with openmpi and intel mpi. However, I have access to another 2nd
>> cluster. I tested the reproducer with openmpi on that cluster and it
>> failed, but that cluster has mpt, the sgi mpi stack since it is an sgi
>> machine. The reproducer worked with mpt so there is at least one mpi
>> stack/hardwear configuration that does work.
>>
>>
>> On 08/30/2013 05:19 PM, Devendar Bureddy wrote:
>>
>>  Hi Ben
>>
>>  Since you are using your own shared-memory design, we are not sure what
>> could be happening here. Could you please send us a reproducer? Also, can
>> you try adding a barrier between broadcasts and see whether you see any
>> different behavior
>>
>>  -Devendar
>>
>>
>>
>> On Fri, Aug 30, 2013 at 3:02 PM, Ben <Benjamin.M.Auer at nasa.gov> wrote:
>>
>>> We have a code that makes use of posix shared memory on each node to
>>> help with the memory footprint. As part of the larger shared memory package
>>> in the code we have been trying to add a set of node broadcast routines
>>> that broadcast a piece of data in shared memory on one node to the shared
>>> memory in the other nodes. This code has not been working and we have
>>> traced it to the actual call to MPI_BCast. We also tried just doing sends
>>> and recieves with no luck as well. It seems as though the first time the
>>> broadcast is called the routine functions properly but on subsequent calls
>>> it fails. The mpi_status itself returns without error but the results of
>>> the broadcast are just plain wrong.
>>>
>>> If before calling the MPI_Bcast we allocate a local, non-shared memory
>>> variable of the same size of the data to be broadcast on each process in
>>> the communicator, copy from the shared memory to the local memory. Then
>>> MPI_Bcast with the local copy and finally copy the from the local back to
>>> the shared memory the routine functions properly. It seems as though the
>>> broadcast itself just does not function properly when then data is posix
>>> shared memory.  I tried setting MV2_USE_SHARED_MEM=0 to turn off the shared
>>> memory routines in mvapich itself which did not fix the bcasts.
>>>
>>> Are there just issues with trying to do mpi communications with shared
>>> memory data? Is it possible this is a bug? We are using mvapich 1.9. If
>>> this is a possible bug I can try to come up with a reproducer.
>>>
>>> --
>>> Ben Auer, PhD   SSAI, Scientific Programmer/Analyst
>>> NASA GSFC,  Global Modeling and Assimilation Office
>>> Code 610.1, 8800 Greenbelt Rd, Greenbelt, MD  20771
>>> Phone: 301-286-9176               Fax: 301-614-6246
>>>
>>> _______________________________________________
>>> mvapich-discuss mailing list
>>> mvapich-discuss at cse.ohio-state.edu
>>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>>
>>
>>
>>
>>  --
>> Devendar
>>
>>
>>
>> --
>> Ben Auer, PhD   SSAI, Scientific Programmer/Analyst
>> NASA GSFC,  Global Modeling and Assimilation Office
>> Code 610.1, 8800 Greenbelt Rd, Greenbelt, MD  20771
>> Phone: 301-286-9176               Fax: 301-614-6246
>>
>>
>
>
>  --
> Devendar
>
>
>
> --
> Ben Auer, PhD   SSAI, Scientific Programmer/Analyst
> NASA GSFC,  Global Modeling and Assimilation Office
> Code 610.1, 8800 Greenbelt Rd, Greenbelt, MD  20771
> Phone: 301-286-9176               Fax: 301-614-6246
>
>


-- 
Devendar
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20130926/875fb1ba/attachment-0001.html


More information about the mvapich-discuss mailing list