[mvapich-discuss] CM and listening ports

Hari Subramoni subramon at cse.ohio-state.edu
Thu Jul 15 18:59:03 EDT 2010


Hi,

Thanks for reporting this.

As the current mechanism we have inside MVAPICH2 works fine for normal
iWARP capable devices, we need to investigate a bit as to how this might
impact such devices. Once we are able to find a solution that works for
all devices (softiWARP and regular iWARP capable devices), we can check
this mechanism into the code.

Thx,
Hari.

On Thu, 15 Jul 2010, TJC Ward wrote:

> I have a kind of device which doesn't have TCP offload. In this case
> rdma_bind_addr doesn't go as far as the device to enable it to open a TCP
> socket, and so it is possible for rdma_listen to get an EADDRINUSE error
> even after rdma_bind_addr has come back clean (it something else on the
> node is already listening on the chosen port).
>
> This upsets the code in 'rdma_cm.c'. As a workaround, I'm trying to push
> the rdma_listen inside the recovery loop, as below
>
> static int bind_listen_port(int pg_rank, int pg_size)
> {
>     struct sockaddr_in sin;
>     int ret, count = 0;
>     MPIDI_CH3I_RDMA_Process_t *proc = &MPIDI_CH3I_RDMA_Process;
>
>     int mpi_errno = get_base_listen_port(pg_rank,
> &rdma_base_listen_port[pg_rank]);
>
>     if (mpi_errno != MPI_SUCCESS)
>     {
>         MPIU_ERR_POP(mpi_errno);
>     }
>
>     MPIU_Memset(&sin, 0, sizeof(sin));
>     sin.sin_family = AF_INET;
>     sin.sin_addr.s_addr = 0;
>     sin.sin_port = rdma_base_listen_port[pg_rank];
>
>     ret = rdma_bind_addr(proc->cm_listen_id, (struct sockaddr *) &sin);
>
>     if( 0 == ret )
>         {
>                     ret = rdma_listen(proc->cm_listen_id, 2 * (pg_size) *
> rdma_num_rails);
>                     if( ret)
>                             {
>                                         DEBUG_PRINT("[%d] Port listen
> failed - %d. retrying %d\n", pg_rank,
>  rdma_base_listen_port[pg_rank], count++);
>
>                             }
>         }
>
>     while (ret)
>     {
>         if ((mpi_errno = get_base_listen_port(pg_rank,
> &rdma_base_listen_port[pg_rank])) != MPI_SUCCESS)
>         {
>             MPIU_ERR_POP(mpi_errno);
>         }
>
>         sin.sin_port = rdma_base_listen_port[pg_rank];
>         ret = rdma_bind_addr(proc->cm_listen_id, (struct sockaddr *)
> &sin);
>         DEBUG_PRINT("[%d] Port bind failed - %d. retrying %d\n", pg_rank,
>                  rdma_base_listen_port[pg_rank], count++);
>         if (count > 1000){
>             ibv_error_abort(IBV_RETURN_ERR,
>                             "Port bind failed\n");
>         }
>         // 20100715 tjcw move the listen inside the retry loop, for siw
>         if( 0 == ret )
>                 {
>                             ret = rdma_listen(proc->cm_listen_id, 2 *
> (pg_size) * rdma_num_rails);
>                             if( ret)
>                                     {
>                                                 DEBUG_PRINT("[%d] Port
> listen failed - %d. retrying %d\n", pg_rank,
>  rdma_base_listen_port[pg_rank], count++);
>
>                                     }
>                 }
>     }
>
> //    ret = rdma_listen(proc->cm_listen_id, 2 * (pg_size) *
> rdma_num_rails);
> //    if (ret) {
> //        ibv_va_error_abort(IBV_RETURN_ERR,
> //                        "rdma_listen failed: %d\n", ret);
> //    }
>
>     DEBUG_PRINT("Listen port bind on %d\n", sin.sin_port);
>
> fn_fail:
>     return mpi_errno;
> }
>
> Please could you consider whether it might be useful to incorporate this
> revision into your code.
>
> I'm still not running cleanly at scale; sometimes I get a connection
> reject, apparently from the OFA core for reasons which I don't understand,
> and when this happens my application hangs during connection
> establishment; so it's possible that I may have misunderstood something
> about how these functions are intended to work.
>
> T J (Chris) Ward, IBM Research
> Scalable Data-Centric Computing - Active Storage Fabrics - IBM System
> BlueGene
> IBM United Kingdom Ltd., Hursley Park, Winchester, Hants, SO21 2JN
> 011-44-1962-818679
> IBM Intranet http://hurgsa.ibm.com/~tjcw/
>
> IBM System BlueGene Research
> IBM System BlueGene Marketing
>
> IBM Resources for Global Servants
> IBM Branded Products IBM Branded Swag
>
>
> UNIX in the Cloud - Find A Place Where There's Room To Grow, with the
> original Open Standard. Free Trial Here Today
> New Lamps For Old - Diskless Remote Boot Linux from National Center for
> High-Performance Computing, Taiwan
>



More information about the mvapich-discuss mailing list