[mvapich-discuss] Troubles starting simple 2 process IMB-MPI1 using mvapich2-2.1

Hari Subramoni subramoni.1 at osu.edu
Mon Apr 20 11:35:57 EDT 2015


Glad to know that it helped.

Regards,
Hari.

On Mon, Apr 20, 2015 at 11:33 AM, Devesh Sharma <devesh28 at gmail.com> wrote:

> Hi Hari,
>
> Thanks for your quick help, changing the env to MV2_USE_RoCE helped and it
> starting working.
>
> On Mon, Apr 20, 2015 at 8:52 PM, Hari Subramoni <subramoni.1 at osu.edu>
> wrote:
>
>> Hello Devesh,
>>
>> MVAPICH2 does not support Emulex HCAs at this point in time.
>>
>> One side note - the environment variable for IB RoCE is MV2_USE_RoCE not
>> MV2_USE_RDMAOE.
>>
>> Best Regards,
>> Hari.
>>
>> On Mon, Apr 20, 2015 at 11:01 AM, Devesh Sharma <devesh28 at gmail.com>
>> wrote:
>>
>>> Hello list,
>>>
>>> I am trying to run IMB 4.0.2 with latest MVAPICH2-2.1 release with
>>> Emulex RoCE device. I have set the ulimts to the unlimited. However I am
>>> seeing that my job is stuck, neither its aborting and coming out to the
>>> shell nor it is running.
>>>
>>> I tried attaching gdb on the other node, and found that its stuck
>>> somewhere in the exit path, following is the stack trace
>>>
>>> (gdb) bt
>>> #0  0x00002adae3df04f6 in malloc_consolidate () from
>>> /usr/local/mpi/mvapich2/lib/libmpi.so
>>> #1  0x00002adae3df2003 in _int_malloc () from
>>> /usr/local/mpi/mvapich2/lib/libmpi.so
>>> #2  0x00002adae3df2d0a in malloc () from
>>> /usr/local/mpi/mvapich2/lib/libmpi.so
>>> #3  0x00002adae41b68bd in __fopen_internal () from /lib64/libc.so.6
>>> #4  0x00002adae41f99dd in __tzfile_read () from /lib64/libc.so.6
>>> #5  0x00002adae41f8d24 in tzset_internal () from /lib64/libc.so.6
>>> #6  0x00002adae41f96d3 in __tz_convert () from /lib64/libc.so.6
>>> #7  0x00002adae3d96207 in MPID_Abort () from
>>> /usr/local/mpi/mvapich2/lib/libmpi.so
>>> #8  0x00002adae3d6366f in handleFatalError () from
>>> /usr/local/mpi/mvapich2/lib/libmpi.so
>>> #9  0x00002adae3d63771 in MPIR_Err_return_comm () from
>>> /usr/local/mpi/mvapich2/lib/libmpi.so
>>> #10 0x00002adae3d1efc4 in PMPI_Init () from
>>> /usr/local/mpi/mvapich2/lib/libmpi.so
>>> #11 0x00000000004018f9 in main ()
>>> (gdb) q
>>> A debugging session is active.
>>>
>>>         Inferior 1 [process 7099] will be detached.
>>>
>>> Quit anyway? (y or n) y
>>>
>>>
>>> Following is LDD output:
>>>
>>> -bash-4.2# ldd /usr/local/imb/mvapich2/IMB-MPI1
>>>         linux-vdso.so.1 =>  (0x00007fff5d5fe000)
>>>         libmpi.so.12 => /usr/local/mpi/mvapich2/lib/libmpi.so.12
>>> (0x00002b1bdf26d000)
>>>         libc.so.6 => /lib64/libc.so.6 (0x00002b1bdf904000)
>>>         libibmad.so.5 => /lib64/libibmad.so.5 (0x00002b1bdfcc5000)
>>>         librdmacm.so.1 => /lib64/librdmacm.so.1 (0x00002b1bdfedf000)
>>>         libibumad.so.3 => /lib64/libibumad.so.3 (0x00002b1be00f6000)
>>>         libibverbs.so.1 => /lib64/libibverbs.so.1 (0x00002b1be02ff000)
>>>         libdl.so.2 => /lib64/libdl.so.2 (0x00002b1be050d000)
>>>         librt.so.1 => /lib64/librt.so.1 (0x00002b1be0711000)
>>>         libgfortran.so.3 => /lib64/libgfortran.so.3 (0x00002b1be0919000)
>>>         libm.so.6 => /lib64/libm.so.6 (0x00002b1be0c3a000)
>>>         libpthread.so.0 => /lib64/libpthread.so.0 (0x00002b1be0f3c000)
>>>         libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00002b1be1158000)
>>>         libquadmath.so.0 => /lib64/libquadmath.so.0 (0x00002b1be136e000)
>>>         /lib64/ld-linux-x86-64.so.2 (0x00002b1bdf04a000)
>>> -bash-4.2#
>>>
>>> Command Used:
>>>
>>> /usr/local/mpi/mvapich2/bin/mpirun --verbose -np 2 -f /root/hostfile
>>> -env MV2_USE_RDMAOE 1 -env MV2_USE_RDMA_CM 1 -env MV2_DEFAULT_MAX_CQ_SIZE
>>> 4096 -env MV2_USE_SRQ 0 -env MV2_MX_SEND_WR 1024 -env MV2_MAX_RECV_WR 1024
>>> -env MV2_USE_SHARED_MEM 1 -env MV2_USE_BLOCKING 0  -env MV2_USE_UD_HYBRID 0
>>> -env MV2_VBUF_TOTAL_SIZE 65536 -env MV2_IBA_EAGER_THRESHOLD 65536 -env
>>> MV2_IBA_HCA ocrdma2 /usr/local/imb/mvapich2/IMB-MPI1
>>>
>>> I have also tired workaround specified in Section 9.1.3 of user guide
>>> but no help. The same command with same mvapich2 version runs fine on
>>> another 2 node cluster. Any pointers to resolve this issue would be a great
>>> help.
>>>
>>>
>>> --
>>> Please don't print this E- mail unless you really need to - this will
>>> preserve trees on planet earth.
>>>
>>> _______________________________________________
>>> mvapich-discuss mailing list
>>> mvapich-discuss at cse.ohio-state.edu
>>> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>>
>>>
>>
>
>
> --
> Please don't print this E- mail unless you really need to - this will
> preserve trees on planet earth.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20150420/e31824f8/attachment.html>


More information about the mvapich-discuss mailing list