[mvapich-discuss] TensorFlow: Segmentation Fault (mv2-gdr)

Jain, Arpan jain.575 at buckeyemail.osu.edu
Mon Aug 31 20:04:08 EDT 2020


Hello Viet-Duc,

Please use "MV2_USE_CUDA=1" flag as indicated in MVAPICH2-GDR user guide (Section 7.2).
MVAPICH2-GDR user guide: https://mvapich.cse.ohio-state.edu/userguide/gdr/

We have a complete user guide to run all Deep Learning frameworks with MVAPICH2-GDR (Horovod with MVAPICH2-GDR). Running instructions for TensorFlow can be found in Section 4.1.
Horovod with MVAPICH2 user guide: http://hidl.cse.ohio-state.edu/userguide/horovod/

Regards,
Arpan Jain

________________________________________
From: mvapich-discuss-bounces at cse.ohio-state.edu<mailto:mvapich-discuss-bounces at cse.ohio-state.edu> <mvapich-discuss-bounces at mailman.cse.ohio-state.edu<mailto:mvapich-discuss-bounces at mailman.cse.ohio-state.edu>> on behalf of Le, Viet Duc <vdle at moasys.com<mailto:vdle at moasys.com>>
Sent: Friday, August 28, 2020 3:29 AM
To: mvapich-discuss at cse.ohio-state.edu<mailto:mvapich-discuss at cse.ohio-state.edu>
Subject: [mvapich-discuss] TensorFlow: Segmentation Fault (mv2-gdr)

Hello,

When testing the latest version of mvapich2-gdr (2.3.4), we encounter a segmentation fault related to python.
The hardware setup is same as previous inquiry: http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/2020-August/007120.html
Tensorlfow-GPU was installed strictly following mvapich2's official guide at: http://hidl.cse.ohio-state.edu/userguide/horovod/

[job script]
>>> Begin of job script
#!/usr/bin/env bash

#SBATCH --job-name=test
#SBATCH --partition=skl_v100_2
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=2
#SBATCH --gres=gpu:2
#SBATCH --error=%j.stderr
#SBATCH --output=%j.stdout
#SBATCH --time=24:00:00
#SBATCH --comment=tensorflow

module load gcc/8.3.0 cuda/10.1 load cudampi/mvapich2-gdr-2.3.4
conda activate horovod_mv2
export MV2_SUPPORT_DL=1

srun  python tf_cnn_benchmarks.py --model resnet50 --batch_size 64 --variable_update horovod
<<< End of job script

[error message]
>>> Begin of error message
Fatal Python error: Segmentation fault

Current thread 0x00002b16ba052700 (most recent call first):
  File "/home01/optpar01/.conda/envs/horovod_mv2/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 676 in __init__
  File "/home01/optpar01/.conda/envs/horovod_mv2/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1551 in __init__
  File "/scratch/optpar01/apps/tf_benchmark/cnn_tf_v1.13/scripts/tf_cnn_benchmarks/benchmark_cnn.py", line 3503 in setup
  File "/scratch/optpar01/apps/tf_benchmark/cnn_tf_v1.13/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py", line 61 in main
  File "/home01/optpar01/.conda/envs/horovod_mv2/lib/python3.6/site-packages/absl/app.py", line 251 in _run_main
  File "/home01/optpar01/.conda/envs/horovod_mv2/lib/python3.6/site-packages/absl/app.py", line 300 in run
  File "/scratch/optpar01/apps/tf_benchmark/cnn_tf_v1.13/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py", line 72 in <module>
<<< End of error message

As you can see TensorFlow and benchmark script are properly synced to 1.13 version, following your official guide.
We do not observe this error when using the base mvapich2 version(build with --enable-cuda).

Regards,
Viet-Duc
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20200901/f360530d/attachment.html>


More information about the mvapich-discuss mailing list