[mvapich-discuss] How do I start the IB modules?

Christopher Tanner christopher.tanner at gatech.edu
Fri Apr 11 14:58:28 EDT 2008


All -

How do I make sure that the pertinent IB modules are loading (i.e.  
rdma_ucm, ib_uverbs, etc)? I am getting the following error when I try  
to execute the OSU benchmarks:

libibverbs: Fatal: couldn't read uverbs ABI version.
Fatal error in MPI_Init:
Other MPI error, error stack:
MPIR_Init_thread(259)...........: Initialization failed
MPID_Init(102)..................: channel initialization failed
MPIDI_CH3_Init(178).............:
MPIDI_CH3I_RMDA_init(115).......: rdma_get_control_parameters
rdma_get_control_parameters(432):
rdma_open_hca(367)..............: No IB device found
rank 0 in job 15  master.cl.ae.gatech.edu_42042   caused collective
abort of all ranks
exit status of rank 0: return code 1

-------------------------------------------
Chris Tanner
Space Systems Design Lab
Georgia Institute of Technology
christopher.tanner at gatech.edu
-------------------------------------------



On Apr 10, 2008, at 1:49 PM, wei huang wrote:
> Hi Chris,
>
> You have to make sure related kernel modules are loaded (including
> rdma_ucm, ib_uverbs, ib_mthca, etc). Thanks.
>
> Regards,
> Wei Huang
>
> 774 Dreese Lab, 2015 Neil Ave,
> Dept. of Computer Science and Engineering
> Ohio State University
> OH 43210
> Tel: (614)292-8501
>
>
> On Thu, 10 Apr 2008, Christopher Tanner wrote:
>
>> Ok Wei -
>>
>> Even though I've copied the libib* libraries from the master node to
>> all of the other nodes and included the /usr/local/lib directory in
>> the LD_LIBRARY_PATH, it seems that osu_latency still cannot find
>> libibverbs.so.1. I'm kind of stuck... Any thoughts?
>>
>> Also, whenever I try to execute osu_latency using just one core on  
>> the
>> master node (mpiexec -n 1 ./osu_latency), I receive the following  
>> error:
>>
>> libibverbs: Fatal: couldn't read uverbs ABI version.
>> Fatal error in MPI_Init:
>> Other MPI error, error stack:
>> MPIR_Init_thread(259)...........: Initialization failed
>> MPID_Init(102)..................: channel initialization failed
>> MPIDI_CH3_Init(178).............:
>> MPIDI_CH3I_RMDA_init(115).......: rdma_get_control_parameters
>> rdma_get_control_parameters(432):
>> rdma_open_hca(367)..............: No IB device found
>> rank 0 in job 15  master.cl.ae.gatech.edu_42042   caused collective
>> abort of all ranks
>>   exit status of rank 0: return code 1
>>
>> Does this output help solve the other problem?
>>
>> -------------------------------------------
>> Chris Tanner
>> Space Systems Design Lab
>> Georgia Institute of Technology
>> christopher.tanner at gatech.edu
>> -------------------------------------------
>>
>>
>>
>> On Apr 10, 2008, at 11:53 AM, wei huang wrote:
>>>
>>> Do you see the same error?
>>>
>>> Try:
>>> export LD_LIBRARY_PATH=/usr/local/lib64:$LD_LIBRARY_PATH
>>>
>>> Regards,
>>> Wei Huang
>>>
>>> 774 Dreese Lab, 2015 Neil Ave,
>>> Dept. of Computer Science and Engineering
>>> Ohio State University
>>> OH 43210
>>> Tel: (614)292-8501
>>>
>>>
>>> On Thu, 10 Apr 2008, Christopher Tanner wrote:
>>>
>>>> Thanks Wei. Of course, the problem isn't solved yet...
>>>>
>>>> So I found the file in the /usr/local/lib64 directory on the master
>>>> node only. I copied the file to the rest of the nodes to the /usr/
>>>> local/lib64 directory and included the directory in my path. When I
>>>> tried to execute the osu_latency program, it gave me the same
>>>> error. A
>>>> 'which librdmacm.so.1' command reveals that it can indeed find the
>>>> library.
>>>>
>>>> Any clues? Or perhaps, any other ways to determine if the  
>>>> Infiniband
>>>> is working?
>>>>
>>>> -------------------------------------------
>>>> Chris Tanner
>>>> Space Systems Design Lab
>>>> Georgia Institute of Technology
>>>> christopher.tanner at gatech.edu
>>>> -------------------------------------------
>>>>
>>>>
>>>>
>>>> On Apr 10, 2008, at 11:18 AM, wei huang wrote:
>>>>> Hi Chris,
>>>>>
>>>>> It seems that some ib libraries are not in your default path. You
>>>>> may need
>>>>> to explicitly export the path to ib library in your environmental
>>>>> variables (bash profile or similar places). To find where those
>>>>> libraries
>>>>> are, you may try to see /etc/infiniband/info file. Or you can ask
>>>>> your
>>>>> system administrator about the path.
>>>>>
>>>>> Thanks.
>>>>>
>>>>> Regards,
>>>>> Wei Huang
>>>>>
>>>>> 774 Dreese Lab, 2015 Neil Ave,
>>>>> Dept. of Computer Science and Engineering
>>>>> Ohio State University
>>>>> OH 43210
>>>>> Tel: (614)292-8501
>>>>>
>>>>>
>>>>> On Thu, 10 Apr 2008, Dhabaleswar Panda wrote:
>>>>>
>>>>>> ---------- Forwarded message ----------
>>>>>> Date: Wed, 9 Apr 2008 20:01:00 -0400
>>>>>> From: Christopher Tanner <christopher.tanner at gatech.edu>
>>>>>> To: mvapich-discuss at cse.ohio-state.edu
>>>>>> Subject: [mvapich-discuss] Running latency tests
>>>>>>
>>>>>> All -
>>>>>>
>>>>>> I believe I am gravy with the mvapich2 install so now I'm  
>>>>>> trying to
>>>>>> run the latency tests to see if it's really working. But, I'm a
>>>>>> dummy
>>>>>> and can't get it to work. Here's what I've done so far:
>>>>>>
>>>>>> a) Initiated a mpd ring with 16 hosts (i.e. mpdboot --rsh=rsh - 
>>>>>> n 16
>>>>>> -1). I have multiple processors, each with multiple cores on each
>>>>>> node, thus the '-1'.
>>>>>> b) Compiled osu_latency.c using mpicc (to an executable called
>>>>>> osu_latency)
>>>>>> b) Tried to execute the compile file via 'mpiexec -machinefile
>>>>>> machine.list -n 16 ./osu_latency'
>>>>>>
>>>>>> I receive the following error (16 times naturally) ::
>>>>>> ./osu_latency: error while loading shared libraries:  
>>>>>> librdmacm.so.
>>>>>> 1:
>>>>>> cannot open shared object file: No such file or directory
>>>>>>
>>>>>> I don't know where this file would be -- it's not in the /usr/lib
>>>>>> with
>>>>>> all of the other *.so.* files.
>>>>>> Any thoughts? Thanks.
>>>>>>
>>>>>> -------------------------------------------
>>>>>> Chris Tanner
>>>>>> Space Systems Design Lab
>>>>>> Georgia Institute of Technology
>>>>>> christopher.tanner at gatech.edu
>>>>>> -------------------------------------------
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Apr 9, 2008, at 2:17 PM, Matthew Koop wrote:
>>>>>>> Hi Fred,
>>>>>>>
>>>>>>> If InfiniBand is not working then the job will not run. There is
>>>>>>> currently
>>>>>>> no method by which it will fall back to TCP/IP.
>>>>>>>
>>>>>>> Does this answer your question?
>>>>>>>
>>>>>>> Matt
>>>>>>>
>>>>>>> On Wed, 9 Apr 2008, Stecher, Fred wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>> When I installed MVAPICH, I used the default. If Infiniband is
>>>>>>>> not
>>>>>>>> working will my executable still run?
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>>
>>>>>>>> Fred
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> mvapich-discuss mailing list
>>>>>>> mvapich-discuss at cse.ohio-state.edu
>>>>>>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>>>>>
>>>>>> _______________________________________________
>>>>>> mvapich-discuss mailing list
>>>>>> mvapich-discuss at cse.ohio-state.edu
>>>>>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>> _______________________________________________
>>> mvapich-discuss mailing list
>>> mvapich-discuss at cse.ohio-state.edu
>>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>
>



More information about the mvapich-discuss mailing list