[mvapich-discuss] MVAPICH2-GDR LD_PRELOAD Bug with Tensorflow

Strube, Alexandre a.strube at fz-juelich.de
Mon Apr 20 14:15:09 EDT 2020


Hi Ammar, hi Andreas,

the solution is on the mpi-settings/tensorflow - which unsets the variables.

As Andreas said, there is no “good” solution - You need the LD_PRELOAD for a bunch of other things, so it does get set by default.

One COULD set this on horovod, but there are edge cases where this is not desirable. 

Anyway, it’s installed on our systems.


Dr. Alexandre Strube
a.strube at fz-juelich.de
Jülich Supercomputing Centre
Institute for Advanced Simulation
Forschungszentrum Juelich GmbH
52425 Jülich, Germany
Phone: +49 2461 61-3866
Fax: +49 2461 61-6656


JSC is the coordinator of the
John von Neumann Institute for Computing (NIC)
and member of the
Gauss Centre for Supercomputing (GCS)

> On 20. Apr 2020, at 11:34, Herten, Andreas <a.herten at fz-juelich.de> wrote:
> 
> Resending to the correct address of the list.
> 
> Dear Ammar,
> 
> Thanks for pointing out the part about unsetting LD_PRELOAD. Even though it’s bold in the documentation, I over-read it.
> 
> That means that the behaviour is known. As stated in the bug report on the Gist, unsetting LD_PRELOAD does indeed solve the issue.
> 
> For us, operating a large HPC installation, this is quite inconvenient.
> There is a set of environment variables to be set when MVAPICH is loaded into the environment, and there is a different set of environment variables to be set when Tensorflow is loaded in addition. Well…, if Tensorflow is loaded and intended to be run with MPI, that is. And it needs to work either when Tensorflow is first loaded, or when MVAPICH is first loaded.
> 
> We’ve thought well about this today and found no fully satisfying solution. We seem to need to talk to the individual users if they run into the problem on a case-by-case basis.
> 
> Best,
> 
> -Andreas
>> NVIDIA Application Lab // POWER Acceleration and Design Centre
> Jülich Supercomputing Centre
> Forschungszentrum Jülich, Germany
> +49 2461 61 1825
> 
> ##########
> 
> Forschungszentrum Jülich GmbH
> 52425 Jülich
> Sitz der Gesellschaft: Jülich
> Eingetragen im Handelsregister des Amtsgerichts Düren Nr. HR B 3498
> Vorsitzender des Aufsichtsrats: MinDir Volker Rieke
> Geschäftsführung: Prof. Dr.-Ing. Wolfgang Marquardt (Vorsitzender),
> Karsten Beneke (stellv. Vorsitzender), Prof. Dr.-Ing. Harald Bolt
> 
>> Am 15.04.2020 um 18:13 schrieb Awan, Ammar Ahmad <awan.10 at buckeyemail.osu.edu <mailto:awan.10 at buckeyemail.osu.edu>>:
>> 
>> Dear Andreas,
>> 
>> Thank you for your question regarding TensorFlow.
>> 
>> Please refer to the MVAPICH2-GDR User guide section 7.2 from the link below.
>> 
>> http://mvapich.cse.ohio-state.edu/userguide/gdr/#_example_running_tensorflow_tf_cnn_benchmarks_with_mvapich2_gdr <http://mvapich.cse.ohio-state.edu/userguide/gdr/#_example_running_tensorflow_tf_cnn_benchmarks_with_mvapich2_gdr>
>> 
>> 7.2. Example running TensorFlow (tf_cnn_benchmarks) with MVAPICH2-GDR
>> 
>> MVAPICH2-GDR supports TensorFlow with Horovod/MPI design but a special flag is needed to run the jobs properly. Please use the MV2_SUPPORT_TENSOR_FLOW=1 runtime variable but do not use the LD_PRELOAD option.
>> 
>> Example:
>> 
>>    1: $ export MV2_PATH=/opt/mvapich2/gdr/2.3.3/gnu
>>    2: $ export MV2_USE_CUDA=1
>>    3: $ export MV2_SUPPORT_TENSOR_FLOW=1
>>    4:
>>    5: $ $MV2_PATH/bin/mpirun_rsh -export -np 2 hostA hostB \
>>    6:         python tf_cnn_benchmarks.py --model=resnet50 \
>>    7:                            --variable_update=horovod
>> 
>> Please let us know if this resolves your issue.
>> 
>> Regards,
>> Ammar
>> 
>> ________________________________________
>> From: mvapich-discuss <mvapich-discuss-bounces at cse.ohio-state.edu <mailto:mvapich-discuss-bounces at cse.ohio-state.edu>> on behalf of Herten, Andreas <a.herten at fz-juelich.de <mailto:a.herten at fz-juelich.de>>
>> Sent: Wednesday, April 15, 2020 12:04 PM
>> To: mvapich-discuss at cse.ohio-state.edu <mailto:mvapich-discuss at cse.ohio-state.edu>
>> Subject: [mvapich-discuss] MVAPICH2-GDR LD_PRELOAD Bug with Tensorflow
>> 
>> Dear all,
>> 
>> On our HPC system JUWELS we see another bug with MVAPICH 2.3.3-GDR.
>> As soon as MVAPICH2 is introduced to the environment (and with it, the recommended LD_PRELOAD variable), even a simple Tensorflow program seg faults.
>> 
>> Please see here for some more description:
>> https://urldefense.com/v3/__https://gist.github.com/AndiH/4f29c4b2d1a21a115580086223bbb2d5__;!!KGKeukY!jWxkbelY9NjheQiIT1rk_KPO6k8rleJhVt-kHwp9tH_Fdbvm_BfgN_MVHDNcPOXH9QIuPh9rCPIu4TA$ <https://urldefense.com/v3/__https://gist.github.com/AndiH/4f29c4b2d1a21a115580086223bbb2d5__;!!KGKeukY!jse9OyOO7y0ltPRKrlm4EbfQVrgU5ITFktCRnXN1mI7-jL0aXl8Sct_oot3rJXMcw0ivgvpf3zYYf8Q$> <https://urldefense.com/v3/__https://gist.github.com/AndiH/4f29c4b2d1a21a115580086223bbb2d5*3Chttps:/*urldefense.com/v3/__https:/*gist.github.com/AndiH/4f29c4b2d1a21a115580086223bbb2d5__;!!KGKeukY!jse9OyOO7y0ltPRKrlm4EbfQVrgU5ITFktCRnXN1mI7-jL0aXl8Sct_oot3rJXMcw0ivgvpf3zYYf8Q$*3E__;JS8vJQ!!KGKeukY!jWxkbelY9NjheQiIT1rk_KPO6k8rleJhVt-kHwp9tH_Fdbvm_BfgN_MVHDNcPOXH9QIuPh9rLNJIg1c$ >
>> 
>> What do you recommend to debug this further? Any ideas?
>> 
>> Best,
>> 
>> -Andreas
>> 
>>>> NVIDIA Application Lab // POWER Acceleration and Design Centre
>> Jülich Supercomputing Centre
>> Forschungszentrum Jülich, Germany
>> +49 2461 61 1825
>> 
>> ##########
>> 
>> Forschungszentrum Jülich GmbH
>> 52425 Jülich
>> Sitz der Gesellschaft: Jülich
>> Eingetragen im Handelsregister des Amtsgerichts Düren Nr. HR B 3498
>> Vorsitzender des Aufsichtsrats: MinDir Volker Rieke
>> Geschäftsführung: Prof. Dr.-Ing. Wolfgang Marquardt (Vorsitzender),
>> Karsten Beneke (stellv. Vorsitzender), Prof. Dr.-Ing. Harald Bolt
>> 
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20200420/32f2780a/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 2182 bytes
Desc: not available
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20200420/32f2780a/attachment-0001.p7s>


More information about the mvapich-discuss mailing list