[mvapich-discuss] MVAPICH2 (2.1rc1) on PSM error

Balaji, Pavan balaji at anl.gov
Tue Oct 4 04:24:37 EDT 2016


> On Oct 4, 2016, at 12:12 AM, Wasko, Wojciech <wojciech.wasko at intel.com> wrote:
> However, it doesn’t solve the fundamental problem, it just pushes the boundary further out. Bottomline, I think any transport will quit when queuing up billions of puts on it… To deal with the problem correctly, I think the workload should insert a barrier every certain number of puts. That’d make sure you don’t overflow the transport.

I believe that's the responsibility of the MPI stack.  Either PSM should give such flow control or MPI should do it internally.  FWIW, adding a barrier is a bad idea.  I think what you meant was to add a "flush local" call.  That is possible, but seems like a weird workaround for something the MPI implementation should really take care of.  The user does not know what network is being used, so she cannot estimate after how many operations a flush local is needed: depending on what message rate the network adapter gives and how many cores are driving it, that number can be very small or very large.

Note that blocking for resources for flow-control would still be "nonblocking" in the way MPI defines it, i.e., it does not depend on any other process for the operation to eventually complete.  So that'll still be compliant with the MPI standard.

  -- Pavan




More information about the mvapich-discuss mailing list