[mvapich-discuss] MVAPICH 2.1a GDR cuda-aware/GDR data corrupted?

Filippo Spiga spiga.filippo at gmail.com
Mon Apr 13 15:59:03 EDT 2015


Hi khaled, no luck I am afraid. 

Attached the files (including the modified osu_bw.c). FYI recently we upgraded our compute nodes to MOFED 2.4, do I need to recompile gdrdrv.ko and libgdrapi.so? I add Davide to the conversation as well. 

Here the various mpirun comands, please let me know if there is something I stupidly did wrong (the only way to get rid to the binding warning message is by specify binding via CPUSET manually...)


mpirun -np $SLURM_NTASKS -ppn 2  -genvall \
-genv MV2_RAIL_SHARING_POLICY FIXED_MAPPING -genv MV2_PROCESS_TO_RAIL_MAPPING 0:1 \
-genv MV2_RAIL_SHARING_LARGE_MSG_THRESHOLD 1G \
-genv MV2_ENABLE_AFFINITY 1 -genv MV2_CPU_BINDING_LEVEL SOCKET -genv MV2_CPU_BINDING_POLICY SCATTER \
-genv MV2_USE_SHARED_MEM 0 \
-genv MV2_USE_CUDA 1 -genv MV2_CUDA_NONBLOCKING_STREAMS 0 -genv MV2_USE_GPUDIRECT 0 -genv MV2_GPUDIRECT_GDRCOPY_LIB ${MV2_GPUDIRECT_GDRCOPY_LIB}/libgdrapi.so -genv MV2_CUDA_IPC 0 \
get_local_rank ./a.out D D 2>&1 | tee out.A

mpirun -np $SLURM_NTASKS -ppn 2  -genvall \
-genv MV2_RAIL_SHARING_POLICY FIXED_MAPPING -genv MV2_PROCESS_TO_RAIL_MAPPING 0:1 \
-genv MV2_RAIL_SHARING_LARGE_MSG_THRESHOLD 1G \
-genv MV2_ENABLE_AFFINITY 1 -genv MV2_CPU_BINDING_LEVEL SOCKET -genv MV2_CPU_BINDING_POLICY SCATTER \
-genv MV2_USE_SHARED_MEM 0 \
-genv MV2_USE_CUDA 1 -genv MV2_CUDA_NONBLOCKING_STREAMS 0 -genv MV2_USE_GPUDIRECT 1 -genv MV2_GPUDIRECT_GDRCOPY_LIB ${MV2_GPUDIRECT_GDRCOPY_LIB}/libgdrapi.so -genv MV2_CUDA_IPC 0 \
get_local_rank ./a.out D D 2>&1 | tee out.B

mpirun -np $SLURM_NTASKS -ppn 2  -genvall \
-genv MV2_RAIL_SHARING_POLICY FIXED_MAPPING -genv MV2_PROCESS_TO_RAIL_MAPPING 0:1 \
-genv MV2_RAIL_SHARING_LARGE_MSG_THRESHOLD 1G \
-genv MV2_ENABLE_AFFINITY 1 -genv MV2_CPU_BINDING_LEVEL SOCKET -genv MV2_CPU_BINDING_POLICY SCATTER \
-genv MV2_USE_SHARED_MEM 0 \
-genv MV2_USE_CUDA 1 -genv MV2_CUDA_NONBLOCKING_STREAMS 0 -genv MV2_USE_GPUDIRECT 1 -genv MV2_GPUDIRECT_GDRCOPY_LIB ${MV2_GPUDIRECT_GDRCOPY_LIB}/libgdrapi.so -genv MV2_CUDA_IPC 0 \
-genv MV2_USE_GPUDIRECT_GDRCOPY_LIMIT 32768 \
get_local_rank ./a.out D D 2>&1 | tee out.C

mpirun -np $SLURM_NTASKS -ppn 2  -genvall \
-genv MV2_RAIL_SHARING_POLICY FIXED_MAPPING -genv MV2_PROCESS_TO_RAIL_MAPPING 0:1 \
-genv MV2_RAIL_SHARING_LARGE_MSG_THRESHOLD 1G \
-genv MV2_ENABLE_AFFINITY 1 -genv MV2_CPU_BINDING_LEVEL SOCKET -genv MV2_CPU_BINDING_POLICY SCATTER \
-genv MV2_USE_SHARED_MEM 0 \
-genv MV2_USE_CUDA 1 -genv MV2_CUDA_NONBLOCKING_STREAMS 0 -genv MV2_USE_GPUDIRECT 1 -genv MV2_GPUDIRECT_GDRCOPY_LIB ${MV2_GPUDIRECT_GDRCOPY_LIB}/libgdrapi.so -genv MV2_CUDA_IPC 0 \
-genv MV2_USE_GPUDIRECT_GDRCOPY_LIMIT 65536 \
get_local_rank ./a.out D D 2>&1 | tee out.D

F

On Apr 12, 2015, at 11:02 PM, khaled hamidouche <hamidouc at cse.ohio-state.edu> wrote:
> Hi Filippo, 
> Sorry for the delay, 
> 
> As I mentioned to Jens earlier in this thread, disabling Non-Blocking streams will fix the Non GDR issue. 
> For GDR related can you please try setting MV2_USE_GPUDIRECT_GDRCOPY_LIMIT=32768 or 64K. Please try first 32K  and let us know if this fixes your issue. 
> 
> Thanks a lot 
> 
> On Sun, Apr 12, 2015 at 11:14 PM, Filippo SPIGA <fs395 at cam.ac.uk> wrote:
> Dear MVAPICH-2 developers,
> 
> any news about this issue?
> 
> F

--
Mr. Filippo SPIGA, M.Sc.
http://filippospiga.info ~ skype: filippo.spiga

«Nobody will drive us out of Cantor's paradise.» ~ David Hilbert

*****
Disclaimer: "Please note this message and any attachments are CONFIDENTIAL and may be privileged or otherwise protected from disclosure. The contents are not to be disclosed to anyone other than the addressee. Unauthorized recipients are requested to preserve this confidentiality and to advise the sender immediately of any error in transmission."

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20150413/1030964f/attachment-0008.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: out.D
Type: application/octet-stream
Size: 3631 bytes
Desc: not available
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20150413/1030964f/attachment-0007.obj>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20150413/1030964f/attachment-0009.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: slurm-1254345.out
Type: application/octet-stream
Size: 12155 bytes
Desc: not available
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20150413/1030964f/attachment-0008.obj>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20150413/1030964f/attachment-0010.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: out.C
Type: application/octet-stream
Size: 3631 bytes
Desc: not available
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20150413/1030964f/attachment-0009.obj>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20150413/1030964f/attachment-0011.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: out.B
Type: application/octet-stream
Size: 3631 bytes
Desc: not available
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20150413/1030964f/attachment-0010.obj>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20150413/1030964f/attachment-0012.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: out.A
Type: application/octet-stream
Size: 1262 bytes
Desc: not available
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20150413/1030964f/attachment-0011.obj>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20150413/1030964f/attachment-0013.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: submit.slurm_osubw_mine
Type: application/octet-stream
Size: 2811 bytes
Desc: not available
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20150413/1030964f/attachment-0012.obj>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20150413/1030964f/attachment-0014.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: osu_bw.c
Type: application/octet-stream
Size: 16057 bytes
Desc: not available
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20150413/1030964f/attachment-0013.obj>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20150413/1030964f/attachment-0015.html>


More information about the mvapich-discuss mailing list