[mvapich-discuss] [MVAPICH2] OneSided

Yann K. yann.kalemkarian at bull.net
Thu Nov 23 09:09:42 EST 2006


Hello Mr Panda,

I don't know where to submit such bugs (do you have a tracker system ?) 
(besides the diflist), so I ask you to dispatch it to the guy in charge 
of OneSided on MVAPICH2.
 
case 1 : I run IMB-EXT on IA64, with the 0.9.8 MVAPICH2 stack, and all 
the 1sd routines seem to be broken. After window creations, all 
operation goes in a funk. Check the 4 processes stacks below

case 2 : Because of that, I tried IMB on 0.9.5 and found deadlocks as 
well in the Accumulate part when more than 4 processes on 2 machines 
(see after for details and stacks). The windows creation code is 
obviously better in performance also here in the 0.9.5.

conf : 2 IA64 machines with voltaire switches and DDR x 4 boards (for 
more info ask).


CASE 1 :

Loaded symbols for /opt/slurm/lib/slurm/auth_none.so
0x2000000000173851 in MPIDI_CH3I_MRAILI_Get_next_vbuf ()
   from 
/home_nfs_nsadmin/yann/SOUCHES/mvapich2-0.9.8/distrib/lib/libmpich.so
#0  0x2000000000173851 in MPIDI_CH3I_MRAILI_Get_next_vbuf ()
   from 
/home_nfs_nsadmin/yann/SOUCHES/mvapich2-0.9.8/distrib/lib/libmpich.so
#1  0x20000000000f5320 in MPIDI_CH3I_read_progress ()
   from 
/home_nfs_nsadmin/yann/SOUCHES/mvapich2-0.9.8/distrib/lib/libmpich.so
#2  0x20000000000f4ac0 in MPIDI_CH3I_Progress ()
   from 
/home_nfs_nsadmin/yann/SOUCHES/mvapich2-0.9.8/distrib/lib/libmpich.so
#3  0x200000000016e2e0 in MPIC_Wait ()
   from 
/home_nfs_nsadmin/yann/SOUCHES/mvapich2-0.9.8/distrib/lib/libmpich.so
#4  0x200000000016e610 in MPIC_Sendrecv ()
   from 
/home_nfs_nsadmin/yann/SOUCHES/mvapich2-0.9.8/distrib/lib/libmpich.so
#5  0x20000000000ddb00 in MPIR_Barrier ()
   from 
/home_nfs_nsadmin/yann/SOUCHES/mvapich2-0.9.8/distrib/lib/libmpich.so
#6  0x20000000000de680 in PMPI_Barrier ()
   from 
/home_nfs_nsadmin/yann/SOUCHES/mvapich2-0.9.8/distrib/lib/libmpich.so
#7  0x400000000000aef0 in IMB_window (c_info=0x600ffffffae4e230,
    size=4194304, n_sample=2, RUN_MODE=0x600ffffffae4e1f0,
    time=0x600ffffffae4e200) at IMB_window.c:126
#8  0x4000000000008750 in IMB_warm_up (c_info=0x600ffffffae4e210,
    Bmark=0x600000000002e078, iter=0) at IMB_warm_up.c:132
#9  0x4000000000002630 in main (argc=1, argv=0x600ffffffae4e598) at 
IMB.c:271
(gdb)


0x20000000000c1861 in ?? ()
   from 
/home_nfs_nsadmin/yann/SOUCHES/mvapich2-0.9.8/distrib/lib/libmpich.so
#0  0x20000000000c1861 in ?? ()
   from 
/home_nfs_nsadmin/yann/SOUCHES/mvapich2-0.9.8/distrib/lib/libmpich.so
#1  0x20000000000f5350 in MPIDI_CH3I_read_progress ()
   from 
/home_nfs_nsadmin/yann/SOUCHES/mvapich2-0.9.8/distrib/lib/libmpich.so
#2  0x20000000000f4ac0 in MPIDI_CH3I_Progress ()
   from 
/home_nfs_nsadmin/yann/SOUCHES/mvapich2-0.9.8/distrib/lib/libmpich.so
#3  0x20000000001e1400 in PMPI_Recv ()
   from 
/home_nfs_nsadmin/yann/SOUCHES/mvapich2-0.9.8/distrib/lib/libmpich.so
#4  0x4000000000004e80 in IMB_init_communicator (c_info=0x600ffffffae4e210,
    NP=3) at IMB_init.c:620
#5  0x4000000000002240 in main (argc=1, argv=0x600ffffffae4e598) at 
IMB.c:166
(gdb)

0x2000000000101ce0 in MPIDI_CH3I_SMP_write_progress ()
   from 
/home_nfs_nsadmin/yann/SOUCHES/mvapich2-0.9.8/distrib/lib/libmpich.so
#0  0x2000000000101ce0 in MPIDI_CH3I_SMP_write_progress ()
   from 
/home_nfs_nsadmin/yann/SOUCHES/mvapich2-0.9.8/distrib/lib/libmpich.so
#1  0x20000000000f4bc0 in MPIDI_CH3I_Progress ()
   from 
/home_nfs_nsadmin/yann/SOUCHES/mvapich2-0.9.8/distrib/lib/libmpich.so
#2  0x200000000016e2e0 in MPIC_Wait ()
   from 
/home_nfs_nsadmin/yann/SOUCHES/mvapich2-0.9.8/distrib/lib/libmpich.so
#3  0x200000000016e610 in MPIC_Sendrecv ()
   from 
/home_nfs_nsadmin/yann/SOUCHES/mvapich2-0.9.8/distrib/lib/libmpich.so
#4  0x20000000000c60d0 in MPIR_Allgather ()
   from 
/home_nfs_nsadmin/yann/SOUCHES/mvapich2-0.9.8/distrib/lib/libmpich.so
#5  0x20000000000c7dd0 in PMPI_Allgather ()
   from 
/home_nfs_nsadmin/yann/SOUCHES/mvapich2-0.9.8/distrib/lib/libmpich.so
#6  0x20000000001392c0 in create_2level_comm ()
   from 
/home_nfs_nsadmin/yann/SOUCHES/mvapich2-0.9.8/distrib/lib/libmpich.so
#7  0x20000000001375f0 in PMPI_Comm_split ()
   from 
/home_nfs_nsadmin/yann/SOUCHES/mvapich2-0.9.8/distrib/lib/libmpich.so
#8  0x4000000000004880 in IMB_set_communicator (c_info=0x600ffffff938e210)
    at IMB_init.c:726
#9  0x4000000000004920 in IMB_init_communicator (c_info=0x600ffffff938e210,
    NP=4) at IMB_init.c:551
#10 0x4000000000002240 in main (argc=1, argv=0x600ffffff938e598) at 
IMB.c:166
(gdb)

#2  0x2000000000170b90 in MPIDI_CH3I_MRAILI_Cq_poll ()
   from 
/home_nfs_nsadmin/yann/SOUCHES/mvapich2-0.9.8/distrib/lib/libmpich.so
#3  0x20000000000f5350 in MPIDI_CH3I_read_progress ()
   from 
/home_nfs_nsadmin/yann/SOUCHES/mvapich2-0.9.8/distrib/lib/libmpich.so
#4  0x20000000000f4ac0 in MPIDI_CH3I_Progress ()
   from 
/home_nfs_nsadmin/yann/SOUCHES/mvapich2-0.9.8/distrib/lib/libmpich.so
#5  0x200000000016e2e0 in MPIC_Wait ()
   from 
/home_nfs_nsadmin/yann/SOUCHES/mvapich2-0.9.8/distrib/lib/libmpich.so
#6  0x200000000016e610 in MPIC_Sendrecv ()
   from 
/home_nfs_nsadmin/yann/SOUCHES/mvapich2-0.9.8/distrib/lib/libmpich.so
#7  0x20000000000c60d0 in MPIR_Allgather ()
   from 
/home_nfs_nsadmin/yann/SOUCHES/mvapich2-0.9.8/distrib/lib/libmpich.so
#8  0x20000000000c7dd0 in PMPI_Allgather ()
   from 
/home_nfs_nsadmin/yann/SOUCHES/mvapich2-0.9.8/distrib/lib/libmpich.so
#9  0x20000000001392c0 in create_2level_comm ()
   from 
/home_nfs_nsadmin/yann/SOUCHES/mvapich2-0.9.8/distrib/lib/libmpich.so
#10 0x20000000001375f0 in PMPI_Comm_split ()
   from 
/home_nfs_nsadmin/yann/SOUCHES/mvapich2-0.9.8/distrib/lib/libmpich.so
#11 0x4000000000004880 in IMB_set_communicator (c_info=0x600ffffffa31e210)
    at IMB_init.c:726
#12 0x4000000000004920 in IMB_init_communicator (c_info=0x600ffffffa31e210,
    NP=4) at IMB_init.c:551
#13 0x4000000000002240 in main (argc=1, argv=0x600ffffffa31e598) at 
IMB.c:166
(gdb)







CASE 2 :

Obviously, all goes okay on 0.9.5 until Accumulates goes over 16K when 
more than 1 on 1 processes are dialoging (here 4).
I send as well the call stacks for the 4 processes


Loaded symbols for /opt/ofed.1.0/lib/infiniband/mthca.so
Reading symbols from /lib/libnss_files.so.2...done.
Loaded symbols for /lib/libnss_files.so.2
Reading symbols from /opt/slurm/lib/slurm/auth_none.so...done.
Loaded symbols for /opt/slurm/lib/slurm/auth_none.so
0x20000000001764c0 in MPIDI_CH3I_MRAILI_Get_next_vbuf??unw ()
   from 
/home_nfs_nsadmin/yann/SOUCHES/mvapich2-0.9.5/distrib/lib/libmpich.so
#0  0x20000000001764c0 in MPIDI_CH3I_MRAILI_Get_next_vbuf??unw ()
   from 
/home_nfs_nsadmin/yann/SOUCHES/mvapich2-0.9.5/distrib/lib/libmpich.so
#1  0x20000000000f4ae0 in MPIDI_CH3I_read_progress??unw ()
   from 
/home_nfs_nsadmin/yann/SOUCHES/mvapich2-0.9.5/distrib/lib/libmpich.so
#2  0x20000000000f34c0 in MPIDI_CH3I_Progress??unw ()
   from 
/home_nfs_nsadmin/yann/SOUCHES/mvapich2-0.9.5/distrib/lib/libmpich.so
#3  0x2000000000119640 in MPIDI_Win_fence??unw ()
   from 
/home_nfs_nsadmin/yann/SOUCHES/mvapich2-0.9.5/distrib/lib/libmpich.so
#4  0x20000000001a0530 in MPID_Win_fence??unw ()
   from 
/home_nfs_nsadmin/yann/SOUCHES/mvapich2-0.9.5/distrib/lib/libmpich.so
#5  0x200000000024b470 in PMPI_Win_fence??unw ()
   from 
/home_nfs_nsadmin/yann/SOUCHES/mvapich2-0.9.5/distrib/lib/libmpich.so
#6  0x4000000000015290 in IMB_accumulate (c_info=0x600ffffffb33e1e0,
    size=32768, n_sample=1000, RUN_MODE=0x4000, time=0x600ffffffb33e2a0)
    at IMB_ones_accu.c:198
#7  0x4000000000002d00 in main (argc=63568, argv=0x0) at IMB.c:273
(gdb)










-- 
Yann Kalemkarian
HPC Software Engineer
Open Software R&D
Bull, Architect of an Open World TM
Phone: +33 4 7629 7393
www.bull.com



More information about the mvapich-discuss mailing list