[mvapich-discuss] TensorFlow: Segmentation Fault (mv2-gdr)
Le, Viet Duc
vdle at moasys.com
Fri Aug 28 03:29:00 EDT 2020
Hello,
When testing the latest version of mvapich2-gdr (2.3.4), we encounter a
segmentation fault related to python.
The hardware setup is same as previous inquiry:
http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/2020-August/007120.htm
l
Tensorlfow-GPU was installed strictly following mvapich2's official guide
at: http://hidl.cse.ohio-state.edu/userguide/horovod/
[job script]
>>> Begin of job script
#!/usr/bin/env bash
#SBATCH --job-name=test
#SBATCH --partition=skl_v100_2
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=2
#SBATCH --gres=gpu:2
#SBATCH --error=%j.stderr
#SBATCH --output=%j.stdout
#SBATCH --time=24:00:00
#SBATCH --comment=tensorflow
module load gcc/8.3.0 cuda/10.1 load cudampi/mvapich2-gdr-2.3.4
conda activate horovod_mv2
export MV2_SUPPORT_DL=1
srun python tf_cnn_benchmarks.py --model resnet50 --batch_size 64
--variable_update horovod
<<< End of job script
[error message]
>>> Begin of error message
Fatal Python error: Segmentation fault
Current thread 0x00002b16ba052700 (most recent call first):
File
"/home01/optpar01/.conda/envs/horovod_mv2/lib/python3.6/site-packages/tensorflow/python/client/session.py",
line 676 in __init__
File
"/home01/optpar01/.conda/envs/horovod_mv2/lib/python3.6/site-packages/tensorflow/python/client/session.py",
line 1551 in __init__
File
"/scratch/optpar01/apps/tf_benchmark/cnn_tf_v1.13/scripts/tf_cnn_benchmarks/benchmark_cnn.py",
line 3503 in setup
File
"/scratch/optpar01/apps/tf_benchmark/cnn_tf_v1.13/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py",
line 61 in main
File
"/home01/optpar01/.conda/envs/horovod_mv2/lib/python3.6/site-packages/absl/app.py",
line 251 in _run_main
File
"/home01/optpar01/.conda/envs/horovod_mv2/lib/python3.6/site-packages/absl/app.py",
line 300 in run
File
"/scratch/optpar01/apps/tf_benchmark/cnn_tf_v1.13/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py",
line 72 in <module>
<<< End of error message
As you can see TensorFlow and benchmark script are properly synced to 1.13
version, following your official guide.
We do not observe this error when using the base mvapich2 version(build
with --enable-cuda).
Regards,
Viet-Duc
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20200828/a642a2d7/attachment.html>
More information about the mvapich-discuss
mailing list