[mvapich-discuss] [MVAPICH2] 0.9.5 reduce_scatter

Yann K. yann.kalemkarian at bull.net
Thu Nov 23 09:19:34 EST 2006


Hello everybody,

I was wondering if anybody had problems with IMB reduce scatter tests 
using the 0.9.5 mvapich2 library ? Here are the stacks I have when my 16 
processes (2x 8) in Reduce_Scatter hang. Things go well on 15 processes, 
but hang on 16 processes ? Things hang as well on 16 procs spread on 4 
machines.

 I have 4 x 8 cores IA64 with 4x DDR Voltaire stuff.



Thanks for the feedback

Yann





10 processes are waiting after having completed the red/scat :
-------------------------------------------------------------------------------------

0x4000000000086531 in MPIDI_CH3I_SMP_read_progress??unw ()
#0  0x4000000000086531 in MPIDI_CH3I_SMP_read_progress??unw ()
#1  0x400000000007ee20 in MPIDI_CH3I_Progress??unw ()
#2  0x4000000000043a80 in MPIC_Sendrecv??unw ()
#3  0x400000000001d320 in PMPI_Barrier??unw ()
#4  0x4000000000004490 in main (argc=-57548248, argv=0x600ffffffc91e1dc)
    at IMB.c:277



6 processes are stuck here :
----------------------------------------------
0x4000000000085a10 in MPIDI_CH3I_SMP_write_progress??unw ()
#0  0x4000000000085a10 in MPIDI_CH3I_SMP_write_progress??unw ()
#1  0x400000000007ee40 in MPIDI_CH3I_Progress??unw ()
#2  0x4000000000042e60 in MPIC_Recv??unw ()
#3  0x4000000000041600 in MPIR_Reduce_scatter??unw ()
#4  0x400000000003bee0 in PMPI_Reduce_scatter??unw ()
#5  0x4000000000012320 in IMB_reduce_scatter (c_info=0x6000000000013890,
    size=-40771196, n_sample=1000, RUN_MODE=0x600ffffffd91e188,
    time=0x600ffffffd91e220) at IMB_reduce_scatter.c:150
#6  0x4000000000004480 in main (argc=-40771032, argv=0x600ffffffd91e1dc)
    at IMB.c:273

0x20000000000b2782 in pthread_spin_lock () from /lib/tls/libpthread.so.0
#0  0x20000000000b2782 in pthread_spin_lock () from /lib/tls/libpthread.so.0
#1  0x2000000000d832e0 in mthca_poll_cq (ibcq=0x600000000004f160, ne=1,
    wc=0x600ffffffb2adf80) at src/cq.c:472
#2  0x40000000000a2b30 in MPIDI_CH3I_MRAILI_Cq_poll??unw ()
#3  0x40000000000806f0 in MPIDI_CH3I_read_progress??unw ()
#4  0x400000000007ed00 in MPIDI_CH3I_Progress??unw ()
#5  0x4000000000042910 in MPIC_Send??unw ()
#6  0x40000000000410c0 in MPIR_Reduce_scatter??unw ()
#7  0x400000000003bee0 in PMPI_Reduce_scatter??unw ()
#8  0x4000000000012370 in IMB_reduce_scatter (c_info=0x6000000000013890,
    size=-81075836, n_sample=1000, RUN_MODE=0x600ffffffb2ae188,
    time=0x600ffffffb2ae220) at IMB_reduce_scatter.c:150
#9  0x4000000000004480 in main (argc=-81075672, argv=0x600ffffffb2ae1dc)
    at IMB.c:273



-- 
Yann Kalemkarian
HPC Software Engineer
Open Software R&D
Bull, Architect of an Open World TM
Phone: +33 4 7629 7393
www.bull.com



More information about the mvapich-discuss mailing list