[mvapich-discuss] hangs when running MUMPS w/ MVAPICH2.2 built for PSM

Hari Subramoni subramoni.1 at osu.edu
Thu Oct 13 13:54:23 EDT 2016


Hello John,

Thanks for the report. Sorry to hear that MV2 2.2 is hanging. We've not
seen this before.

Can you send us the following details

1. Output of mpiname -a
3. At what scale (number of processes / nodes) at which the issue occurs?
2. The source code and build instructions of "MUMPS" so that we can try it
out locally?

Thx,
Hari.

On Thu, Oct 13, 2016 at 1:41 PM, Westlund, John A <john.a.westlund at intel.com
> wrote:

> Also posted this to the MUMPS list, but I’m only seeing these hangs on
> v2.2 -- it works with v2.1:
>
>
>
> I’ve been running tests using the MUMPS tests: csimpletest.F,
> dsimpletest.F, ssimpletest.F and zsimpletest.F -- and I’m getting
> successful runs using OpenMPI, or using MVAPICH2 v2.2 (built for verbs) on
> Mellanox. But on QLogic HW with a MVAPICH2 v2.2 built for PSM the above
> tests hang in the Factorization step:
>
>
>
> #  =================================================
>
> #  MUMPS compiled with option -DALLOW_NON_INIT
>
> #  =================================================
>
> # L U Solver for unsymmetric matrices
>
> # Type of parallelism: Working host
>
> #
>
> #  ****** ANALYSIS STEP ********
>
> #
>
> #  ... Structural symmetry (in percent)=   92
>
> #  ... No column permutation
>
> #  Ordering based on AMF
>
> #
>
> # Leaving analysis phase with  ...
>
> # INFOG(1)                                       =               0
>
> # INFOG(2)                                       =               0
>
> #  -- (20) Number of entries in factors (estim.) =              15
>
> #  --  (3) Storage of factors  (REAL, estimated) =              15
>
> #  --  (4) Storage of factors  (INT , estimated) =              59
>
> #  --  (5) Maximum frontal size      (estimated) =               3
>
> #  --  (6) Number of nodes in the tree           =               3
>
> #  -- (32) Type of analysis effectively used     =               1
>
> #  --  (7) Ordering option effectively used      =               2
>
> # ICNTL(6) Maximum transversal option            =               0
>
> # ICNTL(7) Pivot order option                    =               7
>
> # Percentage of memory relaxation (effective)    =              20
>
> # Number of level 2 nodes                        =               0
>
> # Number of split nodes                          =               0
>
> # RINFOG(1) Operations during elimination (estim)=   1.900D+01
>
> #  ** Rank of proc needing largest memory in IC facto        :         0
>
> #  ** Estimated corresponding MBYTES for IC facto            :         1
>
> #  ** Estimated avg. MBYTES per work. proc at facto (IC)     :         1
>
> #  ** TOTAL     space in MBYTES for IC factorization         :         4
>
> #  ** Rank of proc needing largest memory for OOC facto      :         0
>
> #  ** Estimated corresponding MBYTES for OOC facto           :         1
>
> #  ** Estimated avg. MBYTES per work. proc at facto (OOC)    :         1
>
> #  ** TOTAL     space in MBYTES for OOC factorization        :         4
>
> #  ELAPSED TIME IN ANALYSIS DRIVER=       0.0020
>
> #
>
> #  ****** FACTORIZATION STEP ********
>
> #
>
> #
>
> #  GLOBAL STATISTICS PRIOR NUMERICAL FACTORIZATION ...
>
> #  NUMBER OF WORKING PROCESSES              =             4
>
> #  OUT-OF-CORE OPTION (ICNTL(22))           =             0
>
> #  REAL SPACE FOR FACTORS                   =            15
>
> #  INTEGER SPACE FOR FACTORS                =            59
>
> #  MAXIMUM FRONTAL SIZE (ESTIMATED)         =             3
>
> #  NUMBER OF NODES IN THE TREE              =             3
>
> #  MEMORY ALLOWED (MB -- 0: N/A )           =             0
>
> #  Convergence error after scaling for ONE-NORM (option 7/8)   = 0.38D+00
>
> #  Maximum effective relaxed size of S              =           359
>
> #  Average effective relaxed size of S              =           351
>
> #  GLOBAL TIME FOR MATRIX DISTRIBUTION       =      0.0000
>
> #  ** Memory relaxation parameter ( ICNTL(14)  )            :        20
>
> #  ** Rank of processor needing largest memory in facto     :         0
>
> #  ** Space in MBYTES used by this processor for facto      :         1
>
> #  ** Avg. Space in MBYTES per working proc during facto    :         1
>
> # srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
>
> # slurmstepd: error: *** JOB 104 ON c3 CANCELLED AT 2016-10-08T19:51:29
> DUE TO TIME LIMIT ***
>
> # slurmstepd: error: *** STEP 104.0 ON c3 CANCELLED AT 2016-10-08T19:51:29
> DUE TO TIME LIMIT ***
>
>
>
> Not sure yet why I’m not completing the Factorization and getting the next
> message:
>
> ELAPSED TIME FOR FACTORIZATION           =      0.0013
>
>
>
> Thoughts?
>
> John
>
>
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20161013/b0a7c1c4/attachment-0001.html>


More information about the mvapich-discuss mailing list