[mvapich-discuss] [MVAPICH2] OneSided
Yann K.
yann.kalemkarian at bull.net
Thu Nov 23 09:09:42 EST 2006
Hello Mr Panda,
I don't know where to submit such bugs (do you have a tracker system ?)
(besides the diflist), so I ask you to dispatch it to the guy in charge
of OneSided on MVAPICH2.
case 1 : I run IMB-EXT on IA64, with the 0.9.8 MVAPICH2 stack, and all
the 1sd routines seem to be broken. After window creations, all
operation goes in a funk. Check the 4 processes stacks below
case 2 : Because of that, I tried IMB on 0.9.5 and found deadlocks as
well in the Accumulate part when more than 4 processes on 2 machines
(see after for details and stacks). The windows creation code is
obviously better in performance also here in the 0.9.5.
conf : 2 IA64 machines with voltaire switches and DDR x 4 boards (for
more info ask).
CASE 1 :
Loaded symbols for /opt/slurm/lib/slurm/auth_none.so
0x2000000000173851 in MPIDI_CH3I_MRAILI_Get_next_vbuf ()
from
/home_nfs_nsadmin/yann/SOUCHES/mvapich2-0.9.8/distrib/lib/libmpich.so
#0 0x2000000000173851 in MPIDI_CH3I_MRAILI_Get_next_vbuf ()
from
/home_nfs_nsadmin/yann/SOUCHES/mvapich2-0.9.8/distrib/lib/libmpich.so
#1 0x20000000000f5320 in MPIDI_CH3I_read_progress ()
from
/home_nfs_nsadmin/yann/SOUCHES/mvapich2-0.9.8/distrib/lib/libmpich.so
#2 0x20000000000f4ac0 in MPIDI_CH3I_Progress ()
from
/home_nfs_nsadmin/yann/SOUCHES/mvapich2-0.9.8/distrib/lib/libmpich.so
#3 0x200000000016e2e0 in MPIC_Wait ()
from
/home_nfs_nsadmin/yann/SOUCHES/mvapich2-0.9.8/distrib/lib/libmpich.so
#4 0x200000000016e610 in MPIC_Sendrecv ()
from
/home_nfs_nsadmin/yann/SOUCHES/mvapich2-0.9.8/distrib/lib/libmpich.so
#5 0x20000000000ddb00 in MPIR_Barrier ()
from
/home_nfs_nsadmin/yann/SOUCHES/mvapich2-0.9.8/distrib/lib/libmpich.so
#6 0x20000000000de680 in PMPI_Barrier ()
from
/home_nfs_nsadmin/yann/SOUCHES/mvapich2-0.9.8/distrib/lib/libmpich.so
#7 0x400000000000aef0 in IMB_window (c_info=0x600ffffffae4e230,
size=4194304, n_sample=2, RUN_MODE=0x600ffffffae4e1f0,
time=0x600ffffffae4e200) at IMB_window.c:126
#8 0x4000000000008750 in IMB_warm_up (c_info=0x600ffffffae4e210,
Bmark=0x600000000002e078, iter=0) at IMB_warm_up.c:132
#9 0x4000000000002630 in main (argc=1, argv=0x600ffffffae4e598) at
IMB.c:271
(gdb)
0x20000000000c1861 in ?? ()
from
/home_nfs_nsadmin/yann/SOUCHES/mvapich2-0.9.8/distrib/lib/libmpich.so
#0 0x20000000000c1861 in ?? ()
from
/home_nfs_nsadmin/yann/SOUCHES/mvapich2-0.9.8/distrib/lib/libmpich.so
#1 0x20000000000f5350 in MPIDI_CH3I_read_progress ()
from
/home_nfs_nsadmin/yann/SOUCHES/mvapich2-0.9.8/distrib/lib/libmpich.so
#2 0x20000000000f4ac0 in MPIDI_CH3I_Progress ()
from
/home_nfs_nsadmin/yann/SOUCHES/mvapich2-0.9.8/distrib/lib/libmpich.so
#3 0x20000000001e1400 in PMPI_Recv ()
from
/home_nfs_nsadmin/yann/SOUCHES/mvapich2-0.9.8/distrib/lib/libmpich.so
#4 0x4000000000004e80 in IMB_init_communicator (c_info=0x600ffffffae4e210,
NP=3) at IMB_init.c:620
#5 0x4000000000002240 in main (argc=1, argv=0x600ffffffae4e598) at
IMB.c:166
(gdb)
0x2000000000101ce0 in MPIDI_CH3I_SMP_write_progress ()
from
/home_nfs_nsadmin/yann/SOUCHES/mvapich2-0.9.8/distrib/lib/libmpich.so
#0 0x2000000000101ce0 in MPIDI_CH3I_SMP_write_progress ()
from
/home_nfs_nsadmin/yann/SOUCHES/mvapich2-0.9.8/distrib/lib/libmpich.so
#1 0x20000000000f4bc0 in MPIDI_CH3I_Progress ()
from
/home_nfs_nsadmin/yann/SOUCHES/mvapich2-0.9.8/distrib/lib/libmpich.so
#2 0x200000000016e2e0 in MPIC_Wait ()
from
/home_nfs_nsadmin/yann/SOUCHES/mvapich2-0.9.8/distrib/lib/libmpich.so
#3 0x200000000016e610 in MPIC_Sendrecv ()
from
/home_nfs_nsadmin/yann/SOUCHES/mvapich2-0.9.8/distrib/lib/libmpich.so
#4 0x20000000000c60d0 in MPIR_Allgather ()
from
/home_nfs_nsadmin/yann/SOUCHES/mvapich2-0.9.8/distrib/lib/libmpich.so
#5 0x20000000000c7dd0 in PMPI_Allgather ()
from
/home_nfs_nsadmin/yann/SOUCHES/mvapich2-0.9.8/distrib/lib/libmpich.so
#6 0x20000000001392c0 in create_2level_comm ()
from
/home_nfs_nsadmin/yann/SOUCHES/mvapich2-0.9.8/distrib/lib/libmpich.so
#7 0x20000000001375f0 in PMPI_Comm_split ()
from
/home_nfs_nsadmin/yann/SOUCHES/mvapich2-0.9.8/distrib/lib/libmpich.so
#8 0x4000000000004880 in IMB_set_communicator (c_info=0x600ffffff938e210)
at IMB_init.c:726
#9 0x4000000000004920 in IMB_init_communicator (c_info=0x600ffffff938e210,
NP=4) at IMB_init.c:551
#10 0x4000000000002240 in main (argc=1, argv=0x600ffffff938e598) at
IMB.c:166
(gdb)
#2 0x2000000000170b90 in MPIDI_CH3I_MRAILI_Cq_poll ()
from
/home_nfs_nsadmin/yann/SOUCHES/mvapich2-0.9.8/distrib/lib/libmpich.so
#3 0x20000000000f5350 in MPIDI_CH3I_read_progress ()
from
/home_nfs_nsadmin/yann/SOUCHES/mvapich2-0.9.8/distrib/lib/libmpich.so
#4 0x20000000000f4ac0 in MPIDI_CH3I_Progress ()
from
/home_nfs_nsadmin/yann/SOUCHES/mvapich2-0.9.8/distrib/lib/libmpich.so
#5 0x200000000016e2e0 in MPIC_Wait ()
from
/home_nfs_nsadmin/yann/SOUCHES/mvapich2-0.9.8/distrib/lib/libmpich.so
#6 0x200000000016e610 in MPIC_Sendrecv ()
from
/home_nfs_nsadmin/yann/SOUCHES/mvapich2-0.9.8/distrib/lib/libmpich.so
#7 0x20000000000c60d0 in MPIR_Allgather ()
from
/home_nfs_nsadmin/yann/SOUCHES/mvapich2-0.9.8/distrib/lib/libmpich.so
#8 0x20000000000c7dd0 in PMPI_Allgather ()
from
/home_nfs_nsadmin/yann/SOUCHES/mvapich2-0.9.8/distrib/lib/libmpich.so
#9 0x20000000001392c0 in create_2level_comm ()
from
/home_nfs_nsadmin/yann/SOUCHES/mvapich2-0.9.8/distrib/lib/libmpich.so
#10 0x20000000001375f0 in PMPI_Comm_split ()
from
/home_nfs_nsadmin/yann/SOUCHES/mvapich2-0.9.8/distrib/lib/libmpich.so
#11 0x4000000000004880 in IMB_set_communicator (c_info=0x600ffffffa31e210)
at IMB_init.c:726
#12 0x4000000000004920 in IMB_init_communicator (c_info=0x600ffffffa31e210,
NP=4) at IMB_init.c:551
#13 0x4000000000002240 in main (argc=1, argv=0x600ffffffa31e598) at
IMB.c:166
(gdb)
CASE 2 :
Obviously, all goes okay on 0.9.5 until Accumulates goes over 16K when
more than 1 on 1 processes are dialoging (here 4).
I send as well the call stacks for the 4 processes
Loaded symbols for /opt/ofed.1.0/lib/infiniband/mthca.so
Reading symbols from /lib/libnss_files.so.2...done.
Loaded symbols for /lib/libnss_files.so.2
Reading symbols from /opt/slurm/lib/slurm/auth_none.so...done.
Loaded symbols for /opt/slurm/lib/slurm/auth_none.so
0x20000000001764c0 in MPIDI_CH3I_MRAILI_Get_next_vbuf??unw ()
from
/home_nfs_nsadmin/yann/SOUCHES/mvapich2-0.9.5/distrib/lib/libmpich.so
#0 0x20000000001764c0 in MPIDI_CH3I_MRAILI_Get_next_vbuf??unw ()
from
/home_nfs_nsadmin/yann/SOUCHES/mvapich2-0.9.5/distrib/lib/libmpich.so
#1 0x20000000000f4ae0 in MPIDI_CH3I_read_progress??unw ()
from
/home_nfs_nsadmin/yann/SOUCHES/mvapich2-0.9.5/distrib/lib/libmpich.so
#2 0x20000000000f34c0 in MPIDI_CH3I_Progress??unw ()
from
/home_nfs_nsadmin/yann/SOUCHES/mvapich2-0.9.5/distrib/lib/libmpich.so
#3 0x2000000000119640 in MPIDI_Win_fence??unw ()
from
/home_nfs_nsadmin/yann/SOUCHES/mvapich2-0.9.5/distrib/lib/libmpich.so
#4 0x20000000001a0530 in MPID_Win_fence??unw ()
from
/home_nfs_nsadmin/yann/SOUCHES/mvapich2-0.9.5/distrib/lib/libmpich.so
#5 0x200000000024b470 in PMPI_Win_fence??unw ()
from
/home_nfs_nsadmin/yann/SOUCHES/mvapich2-0.9.5/distrib/lib/libmpich.so
#6 0x4000000000015290 in IMB_accumulate (c_info=0x600ffffffb33e1e0,
size=32768, n_sample=1000, RUN_MODE=0x4000, time=0x600ffffffb33e2a0)
at IMB_ones_accu.c:198
#7 0x4000000000002d00 in main (argc=63568, argv=0x0) at IMB.c:273
(gdb)
--
Yann Kalemkarian
HPC Software Engineer
Open Software R&D
Bull, Architect of an Open World TM
Phone: +33 4 7629 7393
www.bull.com
More information about the mvapich-discuss
mailing list