[mvapich-discuss] TensorFlow: Segmentation Fault (mv2-gdr)
Jain, Arpan
jain.575 at buckeyemail.osu.edu
Mon Aug 31 20:04:08 EDT 2020
Hello Viet-Duc,
Please use "MV2_USE_CUDA=1" flag as indicated in MVAPICH2-GDR user guide (Section 7.2).
MVAPICH2-GDR user guide: https://mvapich.cse.ohio-state.edu/userguide/gdr/
We have a complete user guide to run all Deep Learning frameworks with MVAPICH2-GDR (Horovod with MVAPICH2-GDR). Running instructions for TensorFlow can be found in Section 4.1.
Horovod with MVAPICH2 user guide: http://hidl.cse.ohio-state.edu/userguide/horovod/
Regards,
Arpan Jain
________________________________________
From: mvapich-discuss-bounces at cse.ohio-state.edu<mailto:mvapich-discuss-bounces at cse.ohio-state.edu> <mvapich-discuss-bounces at mailman.cse.ohio-state.edu<mailto:mvapich-discuss-bounces at mailman.cse.ohio-state.edu>> on behalf of Le, Viet Duc <vdle at moasys.com<mailto:vdle at moasys.com>>
Sent: Friday, August 28, 2020 3:29 AM
To: mvapich-discuss at cse.ohio-state.edu<mailto:mvapich-discuss at cse.ohio-state.edu>
Subject: [mvapich-discuss] TensorFlow: Segmentation Fault (mv2-gdr)
Hello,
When testing the latest version of mvapich2-gdr (2.3.4), we encounter a segmentation fault related to python.
The hardware setup is same as previous inquiry: http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/2020-August/007120.html
Tensorlfow-GPU was installed strictly following mvapich2's official guide at: http://hidl.cse.ohio-state.edu/userguide/horovod/
[job script]
>>> Begin of job script
#!/usr/bin/env bash
#SBATCH --job-name=test
#SBATCH --partition=skl_v100_2
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=2
#SBATCH --gres=gpu:2
#SBATCH --error=%j.stderr
#SBATCH --output=%j.stdout
#SBATCH --time=24:00:00
#SBATCH --comment=tensorflow
module load gcc/8.3.0 cuda/10.1 load cudampi/mvapich2-gdr-2.3.4
conda activate horovod_mv2
export MV2_SUPPORT_DL=1
srun python tf_cnn_benchmarks.py --model resnet50 --batch_size 64 --variable_update horovod
<<< End of job script
[error message]
>>> Begin of error message
Fatal Python error: Segmentation fault
Current thread 0x00002b16ba052700 (most recent call first):
File "/home01/optpar01/.conda/envs/horovod_mv2/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 676 in __init__
File "/home01/optpar01/.conda/envs/horovod_mv2/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1551 in __init__
File "/scratch/optpar01/apps/tf_benchmark/cnn_tf_v1.13/scripts/tf_cnn_benchmarks/benchmark_cnn.py", line 3503 in setup
File "/scratch/optpar01/apps/tf_benchmark/cnn_tf_v1.13/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py", line 61 in main
File "/home01/optpar01/.conda/envs/horovod_mv2/lib/python3.6/site-packages/absl/app.py", line 251 in _run_main
File "/home01/optpar01/.conda/envs/horovod_mv2/lib/python3.6/site-packages/absl/app.py", line 300 in run
File "/scratch/optpar01/apps/tf_benchmark/cnn_tf_v1.13/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py", line 72 in <module>
<<< End of error message
As you can see TensorFlow and benchmark script are properly synced to 1.13 version, following your official guide.
We do not observe this error when using the base mvapich2 version(build with --enable-cuda).
Regards,
Viet-Duc
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20200901/f360530d/attachment.html>
More information about the mvapich-discuss
mailing list