[mvapich-discuss] [MVAPICH2] OneSided

Dhabaleswar Panda panda at cse.ohio-state.edu
Thu Nov 23 11:22:10 EST 2006


Hi Yann, 

Thanks for reporting these errros. 

> I don't know where to submit such bugs (do you have a tracker system ?) 
> (besides the diflist), so I ask you to dispatch it to the guy in charge 
> of OneSided on MVAPICH2.

Yes, we have a bugzilla system and will track these error reports.

> case 1 : I run IMB-EXT on IA64, with the 0.9.8 MVAPICH2 stack, and all 
> the 1sd routines seem to be broken. After window creations, all 
> operation goes in a funk. Check the 4 processes stacks below

Are you using the OFED/Gen2 stack? 

During our release, we have tested IMB with Gen2 stack on IA32, EM64T
and Opterons platforms. They seem to be working quite fine.

This error might be something related to IA64 systems. Unfortunately,
we do not have any working IA64 systems with us.  Will it be possible
to get remote access to your IA64 systems for some time. This will
help us significantly to resolve this problem quicker.

> case 2 : Because of that, I tried IMB on 0.9.5 and found deadlocks as 
> well in the Accumulate part when more than 4 processes on 2 machines 
> (see after for details and stacks). The windows creation code is 
> obviously better in performance also here in the 0.9.5.

Once again, this could be related to IA64 system specifics. 
 
> conf : 2 IA64 machines with voltaire switches and DDR x 4 boards (for 
> more info ask).

One of my students, Wei, will follow-up with you on this. 

Thanks, 

DK

> CASE 1 :
> 
> Loaded symbols for /opt/slurm/lib/slurm/auth_none.so
> 0x2000000000173851 in MPIDI_CH3I_MRAILI_Get_next_vbuf ()
>    from 
> /home_nfs_nsadmin/yann/SOUCHES/mvapich2-0.9.8/distrib/lib/libmpich.so
> #0  0x2000000000173851 in MPIDI_CH3I_MRAILI_Get_next_vbuf ()
>    from 
> /home_nfs_nsadmin/yann/SOUCHES/mvapich2-0.9.8/distrib/lib/libmpich.so
> #1  0x20000000000f5320 in MPIDI_CH3I_read_progress ()
>    from 
> /home_nfs_nsadmin/yann/SOUCHES/mvapich2-0.9.8/distrib/lib/libmpich.so
> #2  0x20000000000f4ac0 in MPIDI_CH3I_Progress ()
>    from 
> /home_nfs_nsadmin/yann/SOUCHES/mvapich2-0.9.8/distrib/lib/libmpich.so
> #3  0x200000000016e2e0 in MPIC_Wait ()
>    from 
> /home_nfs_nsadmin/yann/SOUCHES/mvapich2-0.9.8/distrib/lib/libmpich.so
> #4  0x200000000016e610 in MPIC_Sendrecv ()
>    from 
> /home_nfs_nsadmin/yann/SOUCHES/mvapich2-0.9.8/distrib/lib/libmpich.so
> #5  0x20000000000ddb00 in MPIR_Barrier ()
>    from 
> /home_nfs_nsadmin/yann/SOUCHES/mvapich2-0.9.8/distrib/lib/libmpich.so
> #6  0x20000000000de680 in PMPI_Barrier ()
>    from 
> /home_nfs_nsadmin/yann/SOUCHES/mvapich2-0.9.8/distrib/lib/libmpich.so
> #7  0x400000000000aef0 in IMB_window (c_info=0x600ffffffae4e230,
>     size=4194304, n_sample=2, RUN_MODE=0x600ffffffae4e1f0,
>     time=0x600ffffffae4e200) at IMB_window.c:126
> #8  0x4000000000008750 in IMB_warm_up (c_info=0x600ffffffae4e210,
>     Bmark=0x600000000002e078, iter=0) at IMB_warm_up.c:132
> #9  0x4000000000002630 in main (argc=1, argv=0x600ffffffae4e598) at 
> IMB.c:271
> (gdb)
> 
> 
> 0x20000000000c1861 in ?? ()
>    from 
> /home_nfs_nsadmin/yann/SOUCHES/mvapich2-0.9.8/distrib/lib/libmpich.so
> #0  0x20000000000c1861 in ?? ()
>    from 
> /home_nfs_nsadmin/yann/SOUCHES/mvapich2-0.9.8/distrib/lib/libmpich.so
> #1  0x20000000000f5350 in MPIDI_CH3I_read_progress ()
>    from 
> /home_nfs_nsadmin/yann/SOUCHES/mvapich2-0.9.8/distrib/lib/libmpich.so
> #2  0x20000000000f4ac0 in MPIDI_CH3I_Progress ()
>    from 
> /home_nfs_nsadmin/yann/SOUCHES/mvapich2-0.9.8/distrib/lib/libmpich.so
> #3  0x20000000001e1400 in PMPI_Recv ()
>    from 
> /home_nfs_nsadmin/yann/SOUCHES/mvapich2-0.9.8/distrib/lib/libmpich.so
> #4  0x4000000000004e80 in IMB_init_communicator (c_info=0x600ffffffae4e210,
>     NP=3) at IMB_init.c:620
> #5  0x4000000000002240 in main (argc=1, argv=0x600ffffffae4e598) at 
> IMB.c:166
> (gdb)
> 
> 0x2000000000101ce0 in MPIDI_CH3I_SMP_write_progress ()
>    from 
> /home_nfs_nsadmin/yann/SOUCHES/mvapich2-0.9.8/distrib/lib/libmpich.so
> #0  0x2000000000101ce0 in MPIDI_CH3I_SMP_write_progress ()
>    from 
> /home_nfs_nsadmin/yann/SOUCHES/mvapich2-0.9.8/distrib/lib/libmpich.so
> #1  0x20000000000f4bc0 in MPIDI_CH3I_Progress ()
>    from 
> /home_nfs_nsadmin/yann/SOUCHES/mvapich2-0.9.8/distrib/lib/libmpich.so
> #2  0x200000000016e2e0 in MPIC_Wait ()
>    from 
> /home_nfs_nsadmin/yann/SOUCHES/mvapich2-0.9.8/distrib/lib/libmpich.so
> #3  0x200000000016e610 in MPIC_Sendrecv ()
>    from 
> /home_nfs_nsadmin/yann/SOUCHES/mvapich2-0.9.8/distrib/lib/libmpich.so
> #4  0x20000000000c60d0 in MPIR_Allgather ()
>    from 
> /home_nfs_nsadmin/yann/SOUCHES/mvapich2-0.9.8/distrib/lib/libmpich.so
> #5  0x20000000000c7dd0 in PMPI_Allgather ()
>    from 
> /home_nfs_nsadmin/yann/SOUCHES/mvapich2-0.9.8/distrib/lib/libmpich.so
> #6  0x20000000001392c0 in create_2level_comm ()
>    from 
> /home_nfs_nsadmin/yann/SOUCHES/mvapich2-0.9.8/distrib/lib/libmpich.so
> #7  0x20000000001375f0 in PMPI_Comm_split ()
>    from 
> /home_nfs_nsadmin/yann/SOUCHES/mvapich2-0.9.8/distrib/lib/libmpich.so
> #8  0x4000000000004880 in IMB_set_communicator (c_info=0x600ffffff938e210)
>     at IMB_init.c:726
> #9  0x4000000000004920 in IMB_init_communicator (c_info=0x600ffffff938e210,
>     NP=4) at IMB_init.c:551
> #10 0x4000000000002240 in main (argc=1, argv=0x600ffffff938e598) at 
> IMB.c:166
> (gdb)
> 
> #2  0x2000000000170b90 in MPIDI_CH3I_MRAILI_Cq_poll ()
>    from 
> /home_nfs_nsadmin/yann/SOUCHES/mvapich2-0.9.8/distrib/lib/libmpich.so
> #3  0x20000000000f5350 in MPIDI_CH3I_read_progress ()
>    from 
> /home_nfs_nsadmin/yann/SOUCHES/mvapich2-0.9.8/distrib/lib/libmpich.so
> #4  0x20000000000f4ac0 in MPIDI_CH3I_Progress ()
>    from 
> /home_nfs_nsadmin/yann/SOUCHES/mvapich2-0.9.8/distrib/lib/libmpich.so
> #5  0x200000000016e2e0 in MPIC_Wait ()
>    from 
> /home_nfs_nsadmin/yann/SOUCHES/mvapich2-0.9.8/distrib/lib/libmpich.so
> #6  0x200000000016e610 in MPIC_Sendrecv ()
>    from 
> /home_nfs_nsadmin/yann/SOUCHES/mvapich2-0.9.8/distrib/lib/libmpich.so
> #7  0x20000000000c60d0 in MPIR_Allgather ()
>    from 
> /home_nfs_nsadmin/yann/SOUCHES/mvapich2-0.9.8/distrib/lib/libmpich.so
> #8  0x20000000000c7dd0 in PMPI_Allgather ()
>    from 
> /home_nfs_nsadmin/yann/SOUCHES/mvapich2-0.9.8/distrib/lib/libmpich.so
> #9  0x20000000001392c0 in create_2level_comm ()
>    from 
> /home_nfs_nsadmin/yann/SOUCHES/mvapich2-0.9.8/distrib/lib/libmpich.so
> #10 0x20000000001375f0 in PMPI_Comm_split ()
>    from 
> /home_nfs_nsadmin/yann/SOUCHES/mvapich2-0.9.8/distrib/lib/libmpich.so
> #11 0x4000000000004880 in IMB_set_communicator (c_info=0x600ffffffa31e210)
>     at IMB_init.c:726
> #12 0x4000000000004920 in IMB_init_communicator (c_info=0x600ffffffa31e210,
>     NP=4) at IMB_init.c:551
> #13 0x4000000000002240 in main (argc=1, argv=0x600ffffffa31e598) at 
> IMB.c:166
> (gdb)
> 
> 
> 
> 
> 
> 
> 
> CASE 2 :
> 
> Obviously, all goes okay on 0.9.5 until Accumulates goes over 16K when 
> more than 1 on 1 processes are dialoging (here 4).
> I send as well the call stacks for the 4 processes
> 
> 
> Loaded symbols for /opt/ofed.1.0/lib/infiniband/mthca.so
> Reading symbols from /lib/libnss_files.so.2...done.
> Loaded symbols for /lib/libnss_files.so.2
> Reading symbols from /opt/slurm/lib/slurm/auth_none.so...done.
> Loaded symbols for /opt/slurm/lib/slurm/auth_none.so
> 0x20000000001764c0 in MPIDI_CH3I_MRAILI_Get_next_vbuf??unw ()
>    from 
> /home_nfs_nsadmin/yann/SOUCHES/mvapich2-0.9.5/distrib/lib/libmpich.so
> #0  0x20000000001764c0 in MPIDI_CH3I_MRAILI_Get_next_vbuf??unw ()
>    from 
> /home_nfs_nsadmin/yann/SOUCHES/mvapich2-0.9.5/distrib/lib/libmpich.so
> #1  0x20000000000f4ae0 in MPIDI_CH3I_read_progress??unw ()
>    from 
> /home_nfs_nsadmin/yann/SOUCHES/mvapich2-0.9.5/distrib/lib/libmpich.so
> #2  0x20000000000f34c0 in MPIDI_CH3I_Progress??unw ()
>    from 
> /home_nfs_nsadmin/yann/SOUCHES/mvapich2-0.9.5/distrib/lib/libmpich.so
> #3  0x2000000000119640 in MPIDI_Win_fence??unw ()
>    from 
> /home_nfs_nsadmin/yann/SOUCHES/mvapich2-0.9.5/distrib/lib/libmpich.so
> #4  0x20000000001a0530 in MPID_Win_fence??unw ()
>    from 
> /home_nfs_nsadmin/yann/SOUCHES/mvapich2-0.9.5/distrib/lib/libmpich.so
> #5  0x200000000024b470 in PMPI_Win_fence??unw ()
>    from 
> /home_nfs_nsadmin/yann/SOUCHES/mvapich2-0.9.5/distrib/lib/libmpich.so
> #6  0x4000000000015290 in IMB_accumulate (c_info=0x600ffffffb33e1e0,
>     size=32768, n_sample=1000, RUN_MODE=0x4000, time=0x600ffffffb33e2a0)
>     at IMB_ones_accu.c:198
> #7  0x4000000000002d00 in main (argc=63568, argv=0x0) at IMB.c:273
> (gdb)
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> -- 
> Yann Kalemkarian
> HPC Software Engineer
> Open Software R&D
> Bull, Architect of an Open World TM
> Phone: +33 4 7629 7393
> www.bull.com
> 
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> 



More information about the mvapich-discuss mailing list