[mvapich-discuss] mvapich2 warning: Rndv Receiver is receiving less than as expected

Bernd Kallies kallies at zib.de
Mon Mar 9 11:11:17 EDT 2009


On Fri, 2009-03-06 at 20:25 -0500, Dhabaleswar Panda wrote:
> Thanks for reporting this issue in-depth. We will try to take a look at it
> and get back to you. In the mean time, could you try disabling the runtime
> environmental variable MV2_USE_SHMEM_COLL (MV2_USE_SHMEM_COLL=0) and let
> us know whether the problem still persists or not.

I tried several setups, without success.
The problem persists with

- MV2_USE_SHMEM_COLL=0
- MV2_USE_SHARED_MEM=0
- MV2_USE_SHMEM_BCAST=0
- running on 128 nodes, 1 task per node
- MV2_USE_LAZY_MEM_UNREGISTER=0

I used mvapich2-1.2.0 as of 06-Nov-2008, as well as the dev snapshot as
of 02-March-2009. The latter is 1.2p1, I guess.

My last trial (16 nodes, 8 tasks per node, mvapich2-1.2.0 static libs
compiled with Intel compilers, -O1 -g -traceback,
MV2_USE_LAZY_MEM_UNREGISTER=0) crashed with SIGSEGV after

Warning! Rndv Receiver is receiving (110592 < 221184) less than as
expected

I used the Intel Fortran RTE to install a SIGSEGV handler, which
generates a core dump. gdb where on this core dump gives:

#0  0x00002b81a2ae6bb5 in raise () from /lib64/libc.so.6
#1  0x00002b81a2ae7fb0 in abort () from /lib64/libc.so.6
#2  0x0000000001ac4db5 in for__signal_handler ()
#3  <signal handler called>
#4  0x00002b81a2ae6bb5 in raise () from /lib64/libc.so.6
#5  0x00002b81a2ae7fb0 in abort () from /lib64/libc.so.6
#6  0x0000000001ac1740 in for__issue_diagnostic ()
#7  0x0000000001ac4b53 in for__signal_handler ()
#8  <signal handler called>
#9  MPIDI_CH3I_SMP_readv_rndv_cont (recv_vc_ptr=0x231d4d0, iov=0x459bd011b0, iovlen=-1404046976, index=37369648, num_bytes_ptr=0x22b3e12) at ch3_smp_progress.c:1761
#10 0x0000000001a6d62d in MPIDI_CH3I_SMP_read_progress (pg=0x231d4d0) at ch3_smp_progress.c:493
#11 0x0000000001a6a5fc in MPIDI_CH3I_Progress (is_blocking=36820176, state=0x459bd011b0) at ch3_progress.c:184
#12 0x0000000001a5b701 in PMPI_Recv (buf=0x231d4d0, count=-1680862800, datatype=-1404046976, source=37369648, tag=36388370, comm=0, status=0x36906c0) at recv.c:156
#13 0x00000000011980e0 in BI_Srecv (ctxt=0x231d4d0, src=-1680862800, msgid=-1404046976, bp=0x23a3730) at BI_Srecv.c:8
#14 0x0000000001197ae5 in BI_IdringBR (ctxt=0x231d4d0, bp=0x459bd011b0, send=0x2aaaac4ff180, src=37369648, step=36388370) at BI_IdringBR.c:12
#15 0x00000000011936e5 in Cdgebr2d (ConTxt=36820176, scope=0x459bd011b0 <Address 0x459bd011b0 out of bounds>, top=0x2aaaac4ff180 "", m=37369648, n=36388370, A=0x0, lda=864, rsrc=0, csrc=5)
    at dgebr2d_.c:192
#16 0x00000000010e78ab in PB_CInV (TYPE=0x231d4d0, CONJUG=0x459bd011b0 <Address 0x459bd011b0 out of bounds>, ROWCOL=0x2aaaac4ff180 "", M=37369648, N=36388370, DESCA=0x0, K=13, 
    X=0x2aaad58ce900 "\204ĽÜÞ\223\026>W˺»_\211\f>Ëþ)\\\216:\026¾¶\204&-B\n\t¾[b\222Ö\223Îû=ÚÒ\213 at CÀ\021¾\030/\004'$/\b¾¡êS{Ûµ\022¾n\232tW\224­\024>¦®8Y4\235\026¾æ?µ\032X\226 ¾îAËÏad!>É\236?\\g=\023¾/z\017&ýÃ\020¾\235\225\t_\aT >L'N*ï\222ö=Ü}Ø¥\023$\020>Ö«U|ª\r(>ª\025J\231Ð\fñ=ð\206\204¯&rô=â\213hf\226d\b>r\025½.\031@!¾tY¨ëîXñ=\rÍ\205ÌÍÕ\020>¨¥\205NÒû\024>"..., IX=0, JX=17748848, DESCX=0x20, 
    XROC=0x0, XAPTR=0x0, DXA=0x0, XAFREE=0x7fff08794ac8) at PB_CInV.c:490
#17 0x00000000010ed370 in PB_CpgemmAC (TYPE=0x231d4d0, DIRECA=0x459bd011b0 <Address 0x459bd011b0 out of bounds>, DIRECC=0x2aaaac4ff180 "", TRANSA=0x23a3730 "", TRANSB=0x22b3e12 "", M=0, N=865, 
    K=13553, ALPHA=0x1c3dd60 "", 
    A=0x2aaad79ac3f0 "~\215Ä,\021A×¾ó;m)G඾nCFW\211x\230¾8Ý5\213`¸×¾¿ÙË\215ÏlÂ>Þ_þõò0«¾R]Ü\006l\237à¾\\J~púl\232¾\0269ÍÆ\212ëǾh\230\032ø~ÅÛ>ÂaÍаÑÐ>\222LÄÅp8Ê>\vYS.T\b±¾+d\216Í\017V»¾Î½wÜÿ~\233>Ìp1üP= >(\231öX\211g¸>v§\216þ\0222²¾\022u!V{eľ/\023½\036µ\031¶¾ÁU¼¡6Á´¾\220©Ås\202*h>ðº9èê\"¿>I\017\f\035!3\231¾¶_;\0272âq>"..., IA=0, JA=0, DESCA=0x7fff08794f38, 
    B=0x2aaad7a4e400 "\220à\205\027üM\230¾\214Úàùr7Ã>\200rÂkSý²> ö¿NÖ £>", IB=0, JB=0, DESCB=0x7fff08794f0c, BETA=0x1c3dd58 "", 
    C=0x2aaad58ce800 "Ô´Í\220\024õ÷=\024\200\220\230\a|\003>x5£\204l\f\005¾\boÁô¿\210ù½æ\233\004\203äé'¾74è˵â\036¾rìd$ê\216\037>î\202\0174\226ãì½@ çï\201\224ñ=æ{\233]nÆ\032¾Á@hÿ/\" >è¬\214®Éú÷½\222û\203Ë/¦ó½ôÚÛÃ\201X\031¾\227®±c\215ëî½¢3ËdÉ\026\030¾2\2237\236f\aû½ ¨\232\207\bïï=b?x] Ô\020>ÒA[ê­Ô\r¾J\205\224»E-è=dð\002Tû\036\t>Ô]çæ\206\027\v>\212j]³([#¾ø\233Â\221À¶\002¾"..., IC=0, JC=0, 
    DESCC=0x7fff08794ee0) at PB_CpgemmAC.c:505
#18 0x00000000010c48c6 in pdgemm_ (TRANSA=0x231d4d0 "\r", TRANSB=0x459bd011b0 <Address 0x459bd011b0 out of bounds>, M=0x2aaaac4ff180, N=0x23a3730, K=0x22b3e12, ALPHA=0x0, A=0x2aaad79ac3f0, 
    IA=0x7fff08795034, JA=0x1c070e8, DESCA=0x21eb220, B=0x2aaad7a4e400, IB=0x7fff08795038, JB=0x7fff0879503c, DESCB=0x21eb244, BETA=0x1c3dd58, C=0x2aaad58ce800, IC=0x7fff08795040, JC=0x7fff08795044, 
    DESCC=0x21eb268) at pdgemm_.c:490
#19 0x0000000000b6e30e in cp_fm_basic_linalg_mp_cp_fm_gemm_ ()
#20 0x0000000000ec9f7b in qs_ot_mp_qs_ot_get_p_ ()
#21 0x000000000078e1a4 in qs_ot_scf_mp_ot_scf_mini_ ()
#22 0x00000000007d81ce in qs_scf_mp_qs_scf_loop_do_ot_ ()
#23 0x00000000007d6538 in qs_scf_mp_scf_env_do_scf_ ()
#24 0x00000000007d4fb0 in qs_scf_mp_scf_ ()
#25 0x00000000006b1e34 in qs_energy_mp_qs_energies_ ()
#26 0x00000000006bca89 in qs_force_mp_qs_forces_ ()
#27 0x0000000000468847 in force_env_methods_mp_force_env_calc_energy_force_ ()
#28 0x0000000000d28a2d in integrator_mp_nve_ ()
#29 0x00000000008c7db8 in velocity_verlet_control_mp_velocity_verlet_ ()
#30 0x00000000005d4701 in md_run_mp_qs_mol_dyn_low_ ()
#31 0x00000000005d344f in md_run_mp_qs_mol_dyn_ ()
#32 0x0000000000415583 in cp2k_runs_mp_cp2k_run_ ()
#33 0x000000000041a9aa in cp2k_runs_mp_run_input_ ()
#34 0x0000000000413d17 in cp2k () at /gfs1/work/bzfbbk/CP2K/cp2k/makefiles/../src/cp2k.F:272
#35 0x0000000000412ae2 in main ()


-- 
Dr. Bernd Kallies
Konrad-Zuse-Zentrum für Informationstechnik Berlin
Takustr. 7
14195 Berlin
Tel: +49-30-84185-270
Fax: +49-30-84185-311
e-mail: kallies at zib.de




More information about the mvapich-discuss mailing list