[mvapich-discuss] Oversubscription support
Maksym Planeta
mplaneta at os.inf.tu-dresden.de
Mon Oct 26 12:16:28 EDT 2015
Sorry for somewhat long response.
With up to 250 processors it works:
MV2_ON_DEMAND_THRESHOLD=250 MV2_ENABLE_AFFINITY=0 MV2_USE_BLOCKING=1
mpirun -prepend-rank -n 250 ./hello_world
But with 400 it breaks:
[165] [cli_165]: aborting job:
[165] Fatal error in MPI_Init:
[165] Other MPI error, error stack:
[165] MPIR_Init_thread(514)....:
[165] MPID_Init(359)...........: channel initialization failed
[165] MPIDI_CH3_Init(429)......:
[165] MPIDI_CH3I_RDMA_init(340):
[165] rdma_iba_hca_init(1053)..: Failed to create qp for rank 152
[165]
[181] [cli_181]: aborting job:
[181] Fatal error in MPI_Init:
[181] Other MPI error, error stack:
[181] MPIR_Init_thread(514)....:
[181] MPID_Init(359)...........: channel initialization failed
[181] MPIDI_CH3_Init(429)......:
[181] MPIDI_CH3I_RDMA_init(340):
[181] rdma_iba_hca_init(1053)..: Failed to create qp for rank 153
I have to admit, I do not know the desired number or oversubscription
factor, which I want to achieve (currently I test with 24 cores). But I
rather want to know the possible limit.
On 10/26/2015 03:57 PM, Jonathan Perkins wrote:
> One more question before we may have to take this offline for further
> debugging.
>
> Does this error still happen when you set MV2_ENABLE_AFFINITY to 0 in
> addition to setting MV2_USE_BLOCKING to 1?
>
> If so, can you provide how you are launching your app (full command line).
>
> On Mon, Oct 26, 2015 at 10:47 AM Maksym Planeta
> <mplaneta at os.inf.tu-dresden.de <mailto:mplaneta at os.inf.tu-dresden.de>>
> wrote:
>
>
>
> On 10/26/2015 03:45 PM, Jonathan Perkins wrote:
> > Sorry, I meant to ask if you were setting MV2_USE_BLOCKING to 1.
> >
> No problem. I've got it
>
> The error is:
>
> Anyway the error is not related to blocking:
>
> [54] Error parsing CPU mapping string
> [54] INTERNAL ERROR: invalid error code ffffffff (Ring Index out of
> range) in MPIDI_CH3I_set_affinity:119
> [54] [cli_54]: aborting job:
> [54] Fatal error in MPI_Init:
> [54] Other MPI error, error stack:
> [54] MPIR_Init_thread(514):
> [54] MPID_Init(359).......: channel initialization failed
> [54] MPIDI_CH3_Init(469)..:
> [54]
>
> And it happens, because mv2_get_assigned_cpu_core returns -1 for ranks,
> which local_id is bigger than number of cores.
>
> > On Mon, Oct 26, 2015 at 10:41 AM Jonathan Perkins
> > <perkinjo at cse.ohio-state.edu <mailto:perkinjo at cse.ohio-state.edu>
> <mailto:perkinjo at cse.ohio-state.edu
> <mailto:perkinjo at cse.ohio-state.edu>>> wrote:
> >
> > When you're running with oversubscription, were you
> > setting MV2_USE_BLOCKING to 0? If so, what type of errors
> were you
> > hitting?
> >
> > On Mon, Oct 26, 2015 at 10:34 AM Maksym Planeta
> > <mplaneta at os.inf.tu-dresden.de
> <mailto:mplaneta at os.inf.tu-dresden.de>
> > <mailto:mplaneta at os.inf.tu-dresden.de
> <mailto:mplaneta at os.inf.tu-dresden.de>>> wrote:
> >
> > Hi,
> >
> > I'm interested in using MVAPICH library with
> oversubscription,
> > i.e. with
> > more than one rank per core. In version 2.1
> oversubscription worked
> > until certain limit and then the library was just breaking
> > because of bugs.
> >
> > So I updated to 2.2a and found out that the new version
> contains
> > additional checks (for example in function
> > mv2_get_assigned_cpu_core),
> > which basically forbids to have more than one rank per core.
> >
> > Could you tell me the reason for that? Have you ever
> tried about
> > running MVAPICH with oversubscription? And would you at least
> > consider
> > the patches for oversubscription support?
> >
> > --
> > Regards,
> > Maksym Planeta
> >
> > _______________________________________________
> > mvapich-discuss mailing list
> > mvapich-discuss at cse.ohio-state.edu
> <mailto:mvapich-discuss at cse.ohio-state.edu>
> > <mailto:mvapich-discuss at cse.ohio-state.edu
> <mailto:mvapich-discuss at cse.ohio-state.edu>>
> > http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> >
>
> --
> Regards,
> Maksym Planeta
>
--
Regards,
Maksym Planeta
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 5154 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20151026/d1ae1651/attachment.p7s>
More information about the mvapich-discuss
mailing list