[mvapich-discuss] problem about ibv_dealloc_pd

Matthew Koop koop at cse.ohio-state.edu
Fri Apr 18 16:57:16 EDT 2008


Hi,

So you are trying to implement your own checkpointing library in MVAPICH?
You may be interested in MVAPICH2, which already has multirail as well as
checkpointing support.

There are numerous issues with checkpointing InfiniBand -- things such as
the QPs (connections) need to be torn down and registered memory should be
unregistered. The following paper has additional information:

http://nowlab.cse.ohio-state.edu/publications/conf-papers/2006/gaoq-icpp06.pdf

Thanks,

Matt

On Thu, 17 Apr 2008, [gb2312] Ç¿ Âí wrote:

> I build mvapich-1.0 with make.mvapich.gen2_multirail. I firstly run my MPI program on single HCA. (setting NUM_HCAS=1)
> I let mpi tasks all catch a signal. The steps in the signal handler are:
> 1) flush all pending messages;
> 2) MPIR_BsendRelease(,)
> 3) MPI_Barrier()
> 4) MPID_End()
> 5) checkpoint
> 6) exit
>
>   In result, sometimes a few parts of MPI tasks failed in ibv_dealloc_pd() viainit.c:516, others successed.
> Somestimes all tasks finished all the above steps and exit successfully.
>
> When failed, ibv_dealloc_pd() always returns 16 (IBV_WC_REM_ABORT_ERR).
>
> What infiniband resources are still associated with pd?
>
>   I spend almost two weeks on checking and debugging my sources, I'm tied.
> I test with bt.C.36 on the infiniband environments:
> CA type: MT25204, ports: 1, rate: 20
>
>   Please help me,
> thanks on advanced.
>
>
> ---------------------------------
>  ÑÅ»¢ÓÊÏ䣬ÄúµÄÖÕÉúÓÊÏ䣡




More information about the mvapich-discuss mailing list