[mvapich-discuss] Oversubscription support

Maksym Planeta mplaneta at os.inf.tu-dresden.de
Mon Oct 26 12:16:28 EDT 2015


Sorry for somewhat long response.

With up to 250 processors it works:

MV2_ON_DEMAND_THRESHOLD=250 MV2_ENABLE_AFFINITY=0 MV2_USE_BLOCKING=1 
mpirun -prepend-rank   -n 250  ./hello_world

But with 400 it breaks:

[165] [cli_165]: aborting job:
[165] Fatal error in MPI_Init:
[165] Other MPI error, error stack:
[165] MPIR_Init_thread(514)....:
[165] MPID_Init(359)...........: channel initialization failed
[165] MPIDI_CH3_Init(429)......:
[165] MPIDI_CH3I_RDMA_init(340):
[165] rdma_iba_hca_init(1053)..: Failed to create qp for rank 152
[165]
[181] [cli_181]: aborting job:
[181] Fatal error in MPI_Init:
[181] Other MPI error, error stack:
[181] MPIR_Init_thread(514)....:
[181] MPID_Init(359)...........: channel initialization failed
[181] MPIDI_CH3_Init(429)......:
[181] MPIDI_CH3I_RDMA_init(340):
[181] rdma_iba_hca_init(1053)..: Failed to create qp for rank 153

I have to admit, I do not know the desired number or oversubscription 
factor, which I want to achieve (currently I test with 24 cores). But I 
rather want to know the possible limit.

On 10/26/2015 03:57 PM, Jonathan Perkins wrote:
> One more question before we may have to take this offline for further
> debugging.
>
> Does this error still happen when you set MV2_ENABLE_AFFINITY to 0 in
> addition to setting MV2_USE_BLOCKING to 1?
>
> If so, can you provide how you are launching your app (full command line).
>
> On Mon, Oct 26, 2015 at 10:47 AM Maksym Planeta
> <mplaneta at os.inf.tu-dresden.de <mailto:mplaneta at os.inf.tu-dresden.de>>
> wrote:
>
>
>
>     On 10/26/2015 03:45 PM, Jonathan Perkins wrote:
>      > Sorry, I meant to ask if you were setting MV2_USE_BLOCKING to 1.
>      >
>     No problem. I've got it
>
>     The error is:
>
>     Anyway the error is not related to blocking:
>
>     [54] Error parsing CPU mapping string
>     [54] INTERNAL ERROR: invalid error code ffffffff (Ring Index out of
>     range) in MPIDI_CH3I_set_affinity:119
>     [54] [cli_54]: aborting job:
>     [54] Fatal error in MPI_Init:
>     [54] Other MPI error, error stack:
>     [54] MPIR_Init_thread(514):
>     [54] MPID_Init(359).......: channel initialization failed
>     [54] MPIDI_CH3_Init(469)..:
>     [54]
>
>     And it happens, because mv2_get_assigned_cpu_core returns -1 for ranks,
>     which local_id is bigger than number of cores.
>
>      > On Mon, Oct 26, 2015 at 10:41 AM Jonathan Perkins
>      > <perkinjo at cse.ohio-state.edu <mailto:perkinjo at cse.ohio-state.edu>
>     <mailto:perkinjo at cse.ohio-state.edu
>     <mailto:perkinjo at cse.ohio-state.edu>>> wrote:
>      >
>      >     When you're running with oversubscription, were you
>      >     setting MV2_USE_BLOCKING to 0?  If so, what type of errors
>     were you
>      >     hitting?
>      >
>      >     On Mon, Oct 26, 2015 at 10:34 AM Maksym Planeta
>      >     <mplaneta at os.inf.tu-dresden.de
>     <mailto:mplaneta at os.inf.tu-dresden.de>
>      >     <mailto:mplaneta at os.inf.tu-dresden.de
>     <mailto:mplaneta at os.inf.tu-dresden.de>>> wrote:
>      >
>      >         Hi,
>      >
>      >         I'm interested in using MVAPICH library with
>     oversubscription,
>      >         i.e. with
>      >         more than one rank per core. In version 2.1
>     oversubscription worked
>      >         until certain limit and then the library was just breaking
>      >         because of bugs.
>      >
>      >         So I updated to 2.2a and found out that the new version
>     contains
>      >         additional checks (for example in function
>      >         mv2_get_assigned_cpu_core),
>      >         which basically forbids to have more than one rank per core.
>      >
>      >         Could you tell me the reason for that? Have you ever
>     tried  about
>      >         running MVAPICH with oversubscription? And would you at least
>      >         consider
>      >         the patches for oversubscription support?
>      >
>      >         --
>      >         Regards,
>      >         Maksym Planeta
>      >
>      >         _______________________________________________
>      >         mvapich-discuss mailing list
>      > mvapich-discuss at cse.ohio-state.edu
>     <mailto:mvapich-discuss at cse.ohio-state.edu>
>      >         <mailto:mvapich-discuss at cse.ohio-state.edu
>     <mailto:mvapich-discuss at cse.ohio-state.edu>>
>      > http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>      >
>
>     --
>     Regards,
>     Maksym Planeta
>

-- 
Regards,
Maksym Planeta

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 5154 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20151026/d1ae1651/attachment.p7s>


More information about the mvapich-discuss mailing list