[mvapich-discuss] RMA failure with many win_allocate

Min Si msi at anl.gov
Tue Jun 28 18:43:43 EDT 2016


Hi Mingzhe,

Thanks a lot for your prompt response.

The program works fine by increasing the value of 
MV2_SHMEM_COLL_NUM_COMM to the maximum number of concurrently 
outstanding windows.

However, I have to manually detect this number for large applications 
like NWChem. Does it waste too much memory if I just set it to a very 
large number ?

Best regards,
Min

On 6/28/16 5:28 PM, Mingzhe Li wrote:
> Hi Min,
>
> Could you please try setting this parameter (MV2_SHMEM_COLL_NUM_COMM) 
> to a value that is larger than the number of windows in your program?
>
> Thanks,
> Mingzhe
>
> On Tue, Jun 28, 2016 at 6:09 PM, Mingzhe Li <li.2192 at osu.edu 
> <mailto:li.2192 at osu.edu>> wrote:
>
>     Hi Min,
>
>     Thanks for your note. We are looking at it and will get back to
>     you soon.
>
>     Thanks,
>     Mingzhe
>
>     On Tue, Jun 28, 2016 at 6:00 PM, Min Si <msi at anl.gov
>     <mailto:msi at anl.gov>> wrote:
>
>         Hi MVAPICH team,
>
>         I got an error when running an RMA program with many
>         win_allocate calls using MVAPICH PSM network. I have noticed
>         the same error in both MVAPICH2-2.2rc1 and MVAPICH2-2.2b. The
>         attached program can reproduce this error. When allocating the
>         9th window, the following error is reported:
>
>         $ mpiexec -np 2 -ppn 2 ./win_allocate.mva_psm_222rc1
>         Assertion failed in file
>         ../src/mpid/ch3/channels/psm/src/ch3_win_fns.c at line 273:
>         node_comm_ptr != NULL
>         Assertion failed in file
>         ../src/mpid/ch3/channels/psm/src/ch3_win_fns.c at line 273:
>         node_comm_ptr != NULL
>         internal ABORT - process 0
>         internal ABORT - process 1
>
>         This issue seems related to the shared memory collective
>         routines, which is internally called in win_allocate. At the
>         9th win_allocate, function
>         MPIDI_CH3I_SHMEM_Coll_get_free_block cannot return a valid
>         block because all of the blocks are being used in the
>         outstanding windows, thus the shmem_comm cannot be created and
>         get NULL node_comm_ptr.
>
>         I also tried disable the shared memory collective
>         optimization, but it does not solve this issue.
>         export MV2_USE_SHMEM_COLL=0
>
>         This issue breaks the NWChem application with ARMCI/MPI, which
>         allocates many windows.  Could you please look into this
>         problem ? Thanks !
>
>         Best regards,
>         Min
>
>         _______________________________________________
>         mvapich-discuss mailing list
>         mvapich-discuss at cse.ohio-state.edu
>         <mailto:mvapich-discuss at cse.ohio-state.edu>
>         http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20160628/88458e77/attachment.html>


More information about the mvapich-discuss mailing list