[mvapich-discuss] TensorFlow: Segmentation Fault (mv2-gdr)

Jain, Arpan jain.575 at buckeyemail.osu.edu
Mon Sep 7 19:21:10 EDT 2020


Hello,

We had an offline discussion with the reporter and we found that "LD_PRELOAD=$MV2_PATH/lib/libmpi.so" was set. The issue was solved by unsetting the LD_PRELOAD variable ("unset LD_PRELOAD" command).  Currently, we do not recommend using "LD_PRELOAD=$MV2_PATH/lib/libmpi.so" for TensorFlow.

Thus,  we are closing this report.

Regards,
Arpan

________________________________
From: Le, Viet Duc <vdle at moasys.com>
Sent: Wednesday, September 2, 2020 9:10 PM
To: Jain, Arpan <jain.575 at buckeyemail.osu.edu>
Cc: mvapich-discuss at cse.ohio-state.edu <mvapich-discuss at mailman.cse.ohio-state.edu>
Subject: Re: [mvapich-discuss] TensorFlow: Segmentation Fault (mv2-gdr)

Hi, Arpain

Thanks for the suggestion.
However, MV2_USE_CUDA=1 was always propagated via the module environment.Otherwise, there will be a warning message regarding performance.
Thus, we are sure that MV2_USE_CUDA has been properly set up during all our tests.

I would like to provide progress regarding debugging message on our end:
1.  The user guide does not explicitly mention numpy/horovod versions. We downgraded to numpy/1.16.2 and horovod/0.18.2-this is considered to be conservative choice for tf-1.13
2.  From the backtrace, error generated via session.py. We confirm that segmentation fault was generated simply by importing TensorFlow from python shell:
    <<<
     $ python
     >>> import tensorflow as tf
     >>> session = tf.Session()
     2020-09-03 09:48:57.218106: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX
     2020-09-03 09:48:57.607414: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2493830000 Hz
     Segmentation fault
     >>>
     The above was not observed with generic mvapich2-2.3.4 we built from source.
3. From gdb:
    Program received signal SIGSEGV, Segmentation fault.
    #0  0x00007ffff7711fd9 in _int_free () from /apps/compiler/gcc/8.3.0/cudampi/10.1/mvapich2-gdr/2.3.4/lib64/libmpi.so
    #1  0x00007ffff771251f in free () from /apps/compiler/gcc/8.3.0/cudampi/10.1/mvapich2-gdr/2.3.4/lib64/libmpi.so
    #2  0x00007ffff6e1e30a in pthread_create@@GLIBC_2.2.5 () from /usr/lib64/libpthread.so.0
    #3  0x00007ffff5e3c514 in __gthread_create (__args=0x555557e57ff0, __func=0x7ffff5e3c3e0 <execute_native_thread_routine_compat>, __threadid=0x555557e5ef

    It seems that the rpm causes some issues with the system's libpthread.

 If you have more ideas how to identify the issue, please let us know.

Regards.
Viet-Duc
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20200907/949df83f/attachment-0001.html>


More information about the mvapich-discuss mailing list