[mvapich-discuss] TensorFlow: Segmentation Fault (mv2-gdr)

Le, Viet Duc vdle at moasys.com
Fri Aug 28 03:29:00 EDT 2020


Hello,

When testing the latest version of mvapich2-gdr (2.3.4), we encounter a
segmentation fault related to python.
The hardware setup is same as previous inquiry:
http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/2020-August/007120.htm
l
Tensorlfow-GPU was installed strictly following mvapich2's official guide
at: http://hidl.cse.ohio-state.edu/userguide/horovod/

[job script]
>>> Begin of job script
#!/usr/bin/env bash

#SBATCH --job-name=test
#SBATCH --partition=skl_v100_2
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=2
#SBATCH --gres=gpu:2
#SBATCH --error=%j.stderr
#SBATCH --output=%j.stdout
#SBATCH --time=24:00:00
#SBATCH --comment=tensorflow

module load gcc/8.3.0 cuda/10.1 load cudampi/mvapich2-gdr-2.3.4
conda activate horovod_mv2
export MV2_SUPPORT_DL=1

srun  python tf_cnn_benchmarks.py --model resnet50 --batch_size 64
--variable_update horovod
<<< End of job script

[error message]
>>> Begin of error message
Fatal Python error: Segmentation fault

Current thread 0x00002b16ba052700 (most recent call first):
  File
"/home01/optpar01/.conda/envs/horovod_mv2/lib/python3.6/site-packages/tensorflow/python/client/session.py",
line 676 in __init__
  File
"/home01/optpar01/.conda/envs/horovod_mv2/lib/python3.6/site-packages/tensorflow/python/client/session.py",
line 1551 in __init__
  File
"/scratch/optpar01/apps/tf_benchmark/cnn_tf_v1.13/scripts/tf_cnn_benchmarks/benchmark_cnn.py",
line 3503 in setup
  File
"/scratch/optpar01/apps/tf_benchmark/cnn_tf_v1.13/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py",
line 61 in main
  File
"/home01/optpar01/.conda/envs/horovod_mv2/lib/python3.6/site-packages/absl/app.py",
line 251 in _run_main
  File
"/home01/optpar01/.conda/envs/horovod_mv2/lib/python3.6/site-packages/absl/app.py",
line 300 in run
  File
"/scratch/optpar01/apps/tf_benchmark/cnn_tf_v1.13/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py",
line 72 in <module>
<<< End of error message

As you can see TensorFlow and benchmark script are properly synced to 1.13
version, following your official guide.
We do not observe this error when using the base mvapich2 version(build
with --enable-cuda).

Regards,
Viet-Duc
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20200828/a642a2d7/attachment.html>


More information about the mvapich-discuss mailing list