[mvapich-discuss] TensorFlow: Segmentation Fault (mv2-gdr)

Le, Viet Duc vdle at moasys.com
Wed Sep 2 21:10:03 EDT 2020


Hi, Arpain

Thanks for the suggestion.
However, MV2_USE_CUDA=1 was always propagated via the module
environment.Otherwise, there will be a warning message regarding
performance.
Thus, we are sure that MV2_USE_CUDA has been properly set up during all our
tests.

I would like to provide progress regarding debugging message on our end:
1.  The user guide does not explicitly mention numpy/horovod versions. We
downgraded to numpy/1.16.2 and horovod/0.18.2-this is considered to be
conservative choice for tf-1.13
2.  From the backtrace, error generated via session.py. We confirm that
segmentation fault was generated simply by importing TensorFlow from python
shell:
    <<<
     $ python
     >>> import tensorflow as tf
     >>> session = tf.Session()
     2020-09-03 09:48:57.218106: I
tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports
instructions that this TensorFlow binary was not compiled to use: SSE4.1
SSE4.2 AVX
     2020-09-03 09:48:57.607414: I
tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency:
2493830000 Hz
     Segmentation fault
     >>>
     The above was not observed with generic mvapich2-2.3.4 we built from
source.
3. From gdb:
    Program received signal SIGSEGV, Segmentation fault.
    #0  0x00007ffff7711fd9 in _int_free () from
/apps/compiler/gcc/8.3.0/cudampi/10.1/mvapich2-gdr/2.3.4/lib64/libmpi.so
    #1  0x00007ffff771251f in free () from
/apps/compiler/gcc/8.3.0/cudampi/10.1/mvapich2-gdr/2.3.4/lib64/libmpi.so
    #2  0x00007ffff6e1e30a in pthread_create@@GLIBC_2.2.5 () from
/usr/lib64/libpthread.so.0
    #3  0x00007ffff5e3c514 in __gthread_create (__args=0x555557e57ff0,
__func=0x7ffff5e3c3e0 <execute_native_thread_routine_compat>,
__threadid=0x555557e5ef

    It seems that the rpm causes some issues with the system's libpthread.

 If you have more ideas how to identify the issue, please let us know.

Regards.
Viet-Duc
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20200903/b089a7f6/attachment.html>


More information about the mvapich-discuss mailing list