[mvapich-discuss] vbuf problem

David Race drace at appro.com
Tue Sep 16 14:42:27 EDT 2008


The "ulimit -l" is unlimited on all of the compute nodes and management nodes.

We saw this error with a benchmark.  It was a transpose algorithm.  (I have included the application in the attached tar file.)

I have attached the configure file and the runtime files in the tar file also.

I saw the error with 1024 cpus and two HCA with the same application.

Do you need any more information?

Thanks

David Race, Ph.D.
Principle Engineer
Appro International, Inc.
25003 Pitkin Road, Suite F600
Spring, TX  77386
Phone:  469-212-4860
Email:   drace at appro.com
________________________________
From: Lei Chai [chai.15 at osu.edu]
Sent: Monday, September 15, 2008 10:26 PM
To: David Race
Cc: mvapich-discuss at cse.ohio-state.edu
Subject: Re: [mvapich-discuss] vbuf problem

Hi David,

Thanks for reporting the error. We have not tested it with 4 HCAs per node. Could you run the command "ulimit -l" on your system and let us know the output? If it's not "unlimited", please follow the instructions in the userguide section 9.3.4 (
http://mvapich.cse.ohio-state.edu/support/user_guide_mvapich2-1.2rc2.html#x1-530009.3.4
) and set the limit to "unlimited" and try again.

If you still see the error, then may I ask you the following questions:

- Did you see the error with a benchmark or an application? And what benchmark/application is it?

- What configure/make/run-time options did you use?

- Do you see the error when using less than 4 HCAs?

These will help us get more insight into the problem.

Thanks,
Lei


David Race wrote:

> Hello,
>
> We are using mvapich2-1.2rc2 with a system that has four mellanox DDR interfaces in each computer and 16 cpus in each computer.  When we define
>
> MV2_NUM_HCAS=4
>
> we get a failure in line 230 of vbuf.c which indicates a failure in the following code
>
>     for (i = 0; i < rdma_num_hcas; ++i)
>     {
>         reg->mem_handle[i] = ibv_reg_mr(
>             ptag_save[i],
>             vbuf_dma_buffer,
>             nvbufs * rdma_vbuf_total_size,
>             IBV_ACCESS_LOCAL_WRITE | IBV_ACCESS_REMOTE_WRITE);
>         if (!reg->mem_handle[i])
>         {
>             fprintf(stderr, "[%s %d] Cannot register vbuf region\n", __FILE__, __LINE__);
>             return -1;
>         }
>     }
> We get this failure in as few as 289 processors, has someone run across this problem before?  Is there a suggested set of environment variables that might help prevent the failure?
>
> Thanks
>
> David Race, Ph.D.
> Principle Engineer
> Appro International, Inc.
> 25003 Pitkin Road, Suite F600
> Spring, TX  77386
> Phone:  469-212-4860
> Email:   drace at appro.com
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>

-------------- next part --------------
A non-text attachment was scrubbed...
Name: bug.tar
Type: application/x-tar
Size: 20480 bytes
Desc: bug.tar
Url : http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20080916/2d00e3d6/bug-0001.tar


More information about the mvapich-discuss mailing list