[mvapich-discuss] vbuf problem
Lei Chai
chai.15 at osu.edu
Wed Sep 17 16:16:40 EDT 2008
Thanks for sending us the program and the related files. We are taking a
look at the problem and will get back to you.
In the mean time, could you try mvapich-1.0.3? And also with 1 HCA do
you see this error at all?
Thanks,
Lei
David Race wrote:
> The "ulimit -l" is unlimited on all of the compute nodes and management nodes.
>
> We saw this error with a benchmark. It was a transpose algorithm. (I have included the application in the attached tar file.)
>
> I have attached the configure file and the runtime files in the tar file also.
>
> I saw the error with 1024 cpus and two HCA with the same application.
>
> Do you need any more information?
>
> Thanks
>
> David Race, Ph.D.
> Principle Engineer
> Appro International, Inc.
> 25003 Pitkin Road, Suite F600
> Spring, TX 77386
> Phone: 469-212-4860
> Email: drace at appro.com
> ________________________________
> From: Lei Chai [chai.15 at osu.edu]
> Sent: Monday, September 15, 2008 10:26 PM
> To: David Race
> Cc: mvapich-discuss at cse.ohio-state.edu
> Subject: Re: [mvapich-discuss] vbuf problem
>
> Hi David,
>
> Thanks for reporting the error. We have not tested it with 4 HCAs per node. Could you run the command "ulimit -l" on your system and let us know the output? If it's not "unlimited", please follow the instructions in the userguide section 9.3.4 (
> http://mvapich.cse.ohio-state.edu/support/user_guide_mvapich2-1.2rc2.html#x1-530009.3.4
> ) and set the limit to "unlimited" and try again.
>
> If you still see the error, then may I ask you the following questions:
>
> - Did you see the error with a benchmark or an application? And what benchmark/application is it?
>
> - What configure/make/run-time options did you use?
>
> - Do you see the error when using less than 4 HCAs?
>
> These will help us get more insight into the problem.
>
> Thanks,
> Lei
>
>
> David Race wrote:
>
>
>> Hello,
>>
>> We are using mvapich2-1.2rc2 with a system that has four mellanox DDR interfaces in each computer and 16 cpus in each computer. When we define
>>
>> MV2_NUM_HCAS=4
>>
>> we get a failure in line 230 of vbuf.c which indicates a failure in the following code
>>
>> for (i = 0; i < rdma_num_hcas; ++i)
>> {
>> reg->mem_handle[i] = ibv_reg_mr(
>> ptag_save[i],
>> vbuf_dma_buffer,
>> nvbufs * rdma_vbuf_total_size,
>> IBV_ACCESS_LOCAL_WRITE | IBV_ACCESS_REMOTE_WRITE);
>> if (!reg->mem_handle[i])
>> {
>> fprintf(stderr, "[%s %d] Cannot register vbuf region\n", __FILE__, __LINE__);
>> return -1;
>> }
>> }
>> We get this failure in as few as 289 processors, has someone run across this problem before? Is there a suggested set of environment variables that might help prevent the failure?
>>
>> Thanks
>>
>> David Race, Ph.D.
>> Principle Engineer
>> Appro International, Inc.
>> 25003 Pitkin Road, Suite F600
>> Spring, TX 77386
>> Phone: 469-212-4860
>> Email: drace at appro.com
>>
>> _______________________________________________
>> mvapich-discuss mailing list
>> mvapich-discuss at cse.ohio-state.edu
>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>
>>
More information about the mvapich-discuss
mailing list