[mvapich-discuss] CM and listening ports
Hari Subramoni
subramon at cse.ohio-state.edu
Thu Jul 15 18:59:03 EDT 2010
Hi,
Thanks for reporting this.
As the current mechanism we have inside MVAPICH2 works fine for normal
iWARP capable devices, we need to investigate a bit as to how this might
impact such devices. Once we are able to find a solution that works for
all devices (softiWARP and regular iWARP capable devices), we can check
this mechanism into the code.
Thx,
Hari.
On Thu, 15 Jul 2010, TJC Ward wrote:
> I have a kind of device which doesn't have TCP offload. In this case
> rdma_bind_addr doesn't go as far as the device to enable it to open a TCP
> socket, and so it is possible for rdma_listen to get an EADDRINUSE error
> even after rdma_bind_addr has come back clean (it something else on the
> node is already listening on the chosen port).
>
> This upsets the code in 'rdma_cm.c'. As a workaround, I'm trying to push
> the rdma_listen inside the recovery loop, as below
>
> static int bind_listen_port(int pg_rank, int pg_size)
> {
> struct sockaddr_in sin;
> int ret, count = 0;
> MPIDI_CH3I_RDMA_Process_t *proc = &MPIDI_CH3I_RDMA_Process;
>
> int mpi_errno = get_base_listen_port(pg_rank,
> &rdma_base_listen_port[pg_rank]);
>
> if (mpi_errno != MPI_SUCCESS)
> {
> MPIU_ERR_POP(mpi_errno);
> }
>
> MPIU_Memset(&sin, 0, sizeof(sin));
> sin.sin_family = AF_INET;
> sin.sin_addr.s_addr = 0;
> sin.sin_port = rdma_base_listen_port[pg_rank];
>
> ret = rdma_bind_addr(proc->cm_listen_id, (struct sockaddr *) &sin);
>
> if( 0 == ret )
> {
> ret = rdma_listen(proc->cm_listen_id, 2 * (pg_size) *
> rdma_num_rails);
> if( ret)
> {
> DEBUG_PRINT("[%d] Port listen
> failed - %d. retrying %d\n", pg_rank,
> rdma_base_listen_port[pg_rank], count++);
>
> }
> }
>
> while (ret)
> {
> if ((mpi_errno = get_base_listen_port(pg_rank,
> &rdma_base_listen_port[pg_rank])) != MPI_SUCCESS)
> {
> MPIU_ERR_POP(mpi_errno);
> }
>
> sin.sin_port = rdma_base_listen_port[pg_rank];
> ret = rdma_bind_addr(proc->cm_listen_id, (struct sockaddr *)
> &sin);
> DEBUG_PRINT("[%d] Port bind failed - %d. retrying %d\n", pg_rank,
> rdma_base_listen_port[pg_rank], count++);
> if (count > 1000){
> ibv_error_abort(IBV_RETURN_ERR,
> "Port bind failed\n");
> }
> // 20100715 tjcw move the listen inside the retry loop, for siw
> if( 0 == ret )
> {
> ret = rdma_listen(proc->cm_listen_id, 2 *
> (pg_size) * rdma_num_rails);
> if( ret)
> {
> DEBUG_PRINT("[%d] Port
> listen failed - %d. retrying %d\n", pg_rank,
> rdma_base_listen_port[pg_rank], count++);
>
> }
> }
> }
>
> // ret = rdma_listen(proc->cm_listen_id, 2 * (pg_size) *
> rdma_num_rails);
> // if (ret) {
> // ibv_va_error_abort(IBV_RETURN_ERR,
> // "rdma_listen failed: %d\n", ret);
> // }
>
> DEBUG_PRINT("Listen port bind on %d\n", sin.sin_port);
>
> fn_fail:
> return mpi_errno;
> }
>
> Please could you consider whether it might be useful to incorporate this
> revision into your code.
>
> I'm still not running cleanly at scale; sometimes I get a connection
> reject, apparently from the OFA core for reasons which I don't understand,
> and when this happens my application hangs during connection
> establishment; so it's possible that I may have misunderstood something
> about how these functions are intended to work.
>
> T J (Chris) Ward, IBM Research
> Scalable Data-Centric Computing - Active Storage Fabrics - IBM System
> BlueGene
> IBM United Kingdom Ltd., Hursley Park, Winchester, Hants, SO21 2JN
> 011-44-1962-818679
> IBM Intranet http://hurgsa.ibm.com/~tjcw/
>
> IBM System BlueGene Research
> IBM System BlueGene Marketing
>
> IBM Resources for Global Servants
> IBM Branded Products IBM Branded Swag
>
>
> UNIX in the Cloud - Find A Place Where There's Room To Grow, with the
> original Open Standard. Free Trial Here Today
> New Lamps For Old - Diskless Remote Boot Linux from National Center for
> High-Performance Computing, Taiwan
>
More information about the mvapich-discuss
mailing list