[mvapich-discuss] vbuf problem

Lei Chai chai.15 at osu.edu
Wed Sep 17 16:16:40 EDT 2008


Thanks for sending us the program and the related files. We are taking a 
look at the problem and will get back to you.

In the mean time, could you try mvapich-1.0.3? And also with 1 HCA do 
you see this error at all?

Thanks,
Lei


David Race wrote:
> The "ulimit -l" is unlimited on all of the compute nodes and management nodes.
>
> We saw this error with a benchmark.  It was a transpose algorithm.  (I have included the application in the attached tar file.)
>
> I have attached the configure file and the runtime files in the tar file also.
>
> I saw the error with 1024 cpus and two HCA with the same application.
>
> Do you need any more information?
>
> Thanks
>
> David Race, Ph.D.
> Principle Engineer
> Appro International, Inc.
> 25003 Pitkin Road, Suite F600
> Spring, TX  77386
> Phone:  469-212-4860
> Email:   drace at appro.com
> ________________________________
> From: Lei Chai [chai.15 at osu.edu]
> Sent: Monday, September 15, 2008 10:26 PM
> To: David Race
> Cc: mvapich-discuss at cse.ohio-state.edu
> Subject: Re: [mvapich-discuss] vbuf problem
>
> Hi David,
>
> Thanks for reporting the error. We have not tested it with 4 HCAs per node. Could you run the command "ulimit -l" on your system and let us know the output? If it's not "unlimited", please follow the instructions in the userguide section 9.3.4 (
> http://mvapich.cse.ohio-state.edu/support/user_guide_mvapich2-1.2rc2.html#x1-530009.3.4
> ) and set the limit to "unlimited" and try again.
>
> If you still see the error, then may I ask you the following questions:
>
> - Did you see the error with a benchmark or an application? And what benchmark/application is it?
>
> - What configure/make/run-time options did you use?
>
> - Do you see the error when using less than 4 HCAs?
>
> These will help us get more insight into the problem.
>
> Thanks,
> Lei
>
>
> David Race wrote:
>
>   
>> Hello,
>>
>> We are using mvapich2-1.2rc2 with a system that has four mellanox DDR interfaces in each computer and 16 cpus in each computer.  When we define
>>
>> MV2_NUM_HCAS=4
>>
>> we get a failure in line 230 of vbuf.c which indicates a failure in the following code
>>
>>     for (i = 0; i < rdma_num_hcas; ++i)
>>     {
>>         reg->mem_handle[i] = ibv_reg_mr(
>>             ptag_save[i],
>>             vbuf_dma_buffer,
>>             nvbufs * rdma_vbuf_total_size,
>>             IBV_ACCESS_LOCAL_WRITE | IBV_ACCESS_REMOTE_WRITE);
>>         if (!reg->mem_handle[i])
>>         {
>>             fprintf(stderr, "[%s %d] Cannot register vbuf region\n", __FILE__, __LINE__);
>>             return -1;
>>         }
>>     }
>> We get this failure in as few as 289 processors, has someone run across this problem before?  Is there a suggested set of environment variables that might help prevent the failure?
>>
>> Thanks
>>
>> David Race, Ph.D.
>> Principle Engineer
>> Appro International, Inc.
>> 25003 Pitkin Road, Suite F600
>> Spring, TX  77386
>> Phone:  469-212-4860
>> Email:   drace at appro.com
>>
>> _______________________________________________
>> mvapich-discuss mailing list
>> mvapich-discuss at cse.ohio-state.edu
>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>
>>     



More information about the mvapich-discuss mailing list