[mvapich-discuss] mvapich-1.0(gen2) panic our IA64 cluster, but mvapich-1.0(tcp) not

Dhabaleswar Panda panda at cse.ohio-state.edu
Wed Sep 3 13:49:29 EDT 2008


Are you able to run Gen2-level tests (not MPI-level tests) on this cluster
for a long period of time without any panic? Your earlier posting
indicates that you have many modules loaded (blcr, lustre, etc.). When
running mvapich 1.0 with TCP/IP mode, the IB adapters are being invoked
through IPoIB.  When running mvapich 1.0 with gen2 mode, the IB adaters
are being invoked through the native libibverbs library. Thus, it will be
good for you to try out the Gen2-level tests (rdma_latency,
rdma_bandwidth, etc.)  first for a long period of time to see whether they
run smoothly without any panic.

DK

On Wed, 3 Sep 2008, [gb2312] Ç¿ Âí wrote:

> Hello.
>
>   My NAS programs run with mvapich-1.0(gen2) on IA64 cluster. Now  the kernel panic everytime, but run well with mvapich-1.0(tcp).
>
>   ibstat show:
>   CA 'mthca0'
>         CA type: MT25204
>         Number of ports: 1
>         Firmware version: 1.1.0
>         Hardware version: a0
>   panic information:
>
>   Kernel panic - not syncing: arch/ia64/hp/common/sba_iommu.c: I/O MMU @ c0000000fed01000 is out of mapping resources
>   kernel BUG at kernel/panic.c:75!
> ft.C.4[3367]: bugcheck! 0 [1]
> Modules linked in: blcr(U) blcr_vmadump(U) blcr_imports(U) nfs(U) lockd(U) nfs_acl(U) osc(U) mgc(U) lustre(U) lov(U) lquota(U) mdc(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) rdma_cm(U) netconsole(U) ib_addr(U) netdump(U) md5(U) ipv6(U) parport_pc(U) lp(U) parport(U) autofs4(U) ipmi_devintf(U) ipmi_si(U) ipmi_msghandler(U) sunrpc(U) ib_ipoib(U) ds(U) yenta_socket(U) pcmcia_core(U) vfat(U) fat(U) dm_mirror(U) dm_multipath(U) dm_mod(U) button(U) ib_mthca(U) ib_umad(U) ib_ucm(U) ib_uverbs(U) ib_cm(U) ib_sa(U) ib_mad(U) ib_core(U) tg3(U) ext3(U) jbd(U) mptscsih(U) mptfc(U) mptsas(U) mptspi(U) mptscsi(U) mptbase(U) usb_storage(U) uhci_hcd(U) ohci_hcd(U) ehci_hcd(U) sd_mod(U) scsi_mod(U)
>   Pid: 3367, CPU 3, comm:               ft.C.4
> psr : 0000101008122030 ifs : 8000000000000814 ip  : [<a000000100077410>]    Tainted: GF
> ip is at panic+0x5f0/0x6a0
> unat: 0000000000000000 pfs : 0000000000000814 rsc : 0000000000000003
> rnat: 0000000000000000 bsps: 0000000000000000 pr  : fa0166a6855a59a9
> ldrs: 0000000000000000 ccv : 0000000000000000 fpsr: 0009804c8a70433f
> csd : 0000000000000000 ssd : 0000000000000000
> b0  : a000000100077410 b6  : a00000010025ebe0 b7  : a00000010025ebe0
> f6  : 1003e00000000000000a0 f7  : 1003e0000000000000001
> f8  : 1003e00000000000000a0 f9  : 10002a000000000000000
> f10 : 0fffeb33333332fa80000 f11 : 1003e0000000000000000
> r1  : a0000001009cc240 r2  : 000000000005bac7 r3  : a0000001007cc898
> r8  : 0000000000000021 r9  : a0000001007df5b0 r10 : 0000000000000fff
> r11 : 0000000000ffffff r12 : e00001006797fd40 r13 : e000010067978000
> r14 : 0000000000004000 r15 : a000000100778bd8 r16 : 0000000000000001
> r17 : a0000001007e0108 r18 : ffffffffffc66d68 r19 : a000000100611258
> r20 : a000000100611248 r21 : a0000001007dbd68 r22 : e0000000066e0404
> r23 : e0000000066e0380 r24 : 0000000000000002 r25 : 0000000000000002
> r26 : e0000000066e03d4 r27 : 0000001008122030 r28 : e0000000066e03d4
> r29 : a000000100669e28 r30 : 0000000000000000 r31 : a0000001007df588
>   Call Trace:
>  [<a000000100016da0>] show_stack+0x80/0xa0
>                                 sp=e00001006797f8b0 bsp=e000010067979470
>  [<a0000001000176b0>] show_regs+0x890/0x8c0
>                                 sp=e00001006797fa80 bsp=e000010067979428
>  [<a00000010003e8f0>] die+0x150/0x240
>   sp=e00001006797faa0 bsp=e0000100679793e0
>  [<a00000010003ea20>] die_if_kernel+0x40/0x60
>                                 sp=e00001006797faa0 bsp=e0000100679793b0
>  [<a00000010003ebc0>] ia64_bad_break+0x180/0x600
>                                 sp=e00001006797faa0 bsp=e000010067979388
>  [<a00000010000f600>] ia64_leave_kernel+0x0/0x260
>                                 sp=e00001006797fb70 bsp=e000010067979388
>  [<a000000100077410>] panic+0x5f0/0x6a0
>                                 sp=e00001006797fd40 bsp=e0000100679792e8
>  [<a00000010045b5e0>] sba_alloc_range+0xa80/0x16e0
>                                 sp=e00001006797fda0 bsp=e000010067979278
>  [<a00000010045d440>] sba_map_sg+0x380/0x760
>                                 sp=e00001006797fda0 bsp=e0000100679791e0
>  [<a0000002002e74f0>] ib_umem_get+0x770/0xa80 [ib_uverbs]
>                                 sp=e00001006797fdb0 bsp=e000010067979120
>  [<a0000002002de900>] ib_uverbs_reg_mr+0x2a0/0x9a0 [ib_uverbs]
>                                 sp=e00001006797fdb0 bsp=e0000100679790a8
>  [<a0000002002da8b0>] ib_uverbs_write+0x210/0x280 [ib_uverbs]
>                                 sp=e00001006797fe10 bsp=e000010067979078
>  [<a0000001001222d0>] vfs_write+0x290/0x360
>                                   sp=e00001006797fe20 bsp=e000010067979028
>  [<a0000001001224f0>] sys_write+0x70/0xe0
>                                 sp=e00001006797fe20 bsp=e000010067978fa8
>  [<a00000010000f4a0>] ia64_ret_from_syscall+0x0/0x20
>                                 sp=e00001006797fe30 bsp=e000010067978fa8
>  [<a000000000010640>] 0xa000000000010640
>                                 sp=e000010067980000 bsp=e000010067978fa8
>
>
>
>
> ---------------------------------
>  ÑÅ»¢ÓÊÏ䣬ÄúµÄÖÕÉúÓÊÏ䣡




More information about the mvapich-discuss mailing list