[mvapich-discuss] Having problems installing mvapich2-gdr/2.2-4 for the PGI compiler

Raghu Reddy raghu.reddy at noaa.gov
Mon Jul 10 13:01:48 EDT 2017


Hi team,

 

Here is our environment:

 

-          Intel Haswell processors

-          P100 GPUs (8 per node)

-          Mellanox QDR IB

-          RHEL 7.3

-          Cuda/8.0

-          Running stock ofed

 

We have already installed the GNU version of the mvapich2-gdr library and
using it with the Intel compiler (because the Intel versions are not yet
available) and it is working fine. Thank you!

 

Our users are now requesting for the version that works with the PGI
compiler.  We have tried a couple of different downloads and have not been
successful in getting it to work.

 

For our initial testing, we are trying to test a simple MPI hello world
program without involving the GPUs.

 

Just for testing purposes, we do not want to install it in the standard
location, so instead of using RPM commands to install, where using CPIO to
install it in a known standard location.

 

When we install the GNU version of the library, it works fine with the PGI
compiler.

When we install the PGI version of the library, we are unable to get MPI
hello world code working with the PGI compiler.

 

Installing the GNU version of mvapich2-gdr:

 

rpm2cpio
/home/admin/theia_software/mvapich2-gdr/mvapich2-gdr-2.2-4.cuda8.0.stock.gnu
4.8.5.el7.centos.x86_64.rpm | cpio -i -v -d -m

mv opt opt-gnu-nomcast-nopbs

 

Installing the PGI version of mvapich2-gdr:

 

rpm2cpio
/home/admin/theia_software/mvapich2-gdr/mvapich2-gdr-2.2-4.cuda8.0.stock.pgi
16.10.el7.centos.x86_64.rpm | cpio -i -v -d -m

mv opt opt-pgi-nomcast-nopbs

 

Having installed these two versions, I was trying to do a quick check with
MPI hello world code using these two versions while compiling them with the
PGI compilers (without involving the wrappers).

 

When I did that, it works fine with the GNU version even when the code is
compiled with PGI.

But when I use the PGI version, it fails as shown below:

 

Using GNU version of mvapich2-gdr with the PGI compiler:

 

sg001% module purge

sg001% module load pgi/17.5 cuda/8.0

sg001% module load mvapich2-gdr/2.2-4-gnu-mcast-nopbs-rr


 

sg001% echo $MPIROOT

/tds_scratch3/SYSADMIN/nesccmgmt/Raghu.Reddy/apps/mvapich2/opt-gnu-mcast-nop
bs/mvapich2/gdr/mcast/2.2/cuda8.0/mpirun/gnu4.8.5

sg001% 

 

sg001% pgcc -I$MPIROOT/include -L$MPIROOT/lib64 -lmpich hello_mpi_c.c
-L$CUDALIBDIR -lcuda -lcudart

sg001% 

 

sg001% env LD_PRELOAD=$MPIROOT/lib64/libmpi.so
/apps/mvapich2-gdr/2.2-3/cuda8.0-intel/bin/mpirun -np 4 ./a.out

Hello from rank 0 out of 4; procname = sg001

Hello from rank 1 out of 4; procname = sg001

Hello from rank 2 out of 4; procname = sg001

Hello from rank 3 out of 4; procname = sg001

sg001% 

sg001%

 

Using the PGI version of mvapich2-gdr with the PGI compiler:

 

sg001% module purge

sg001% module load pgi/17.5 cuda/8.0

sg001% module load mvapich2-gdr/2.2-4-pgi-mcast-nopbs-rr


sg001% 

 

sg001% echo $MPIROOT

/tds_scratch3/SYSADMIN/nesccmgmt/Raghu.Reddy/apps/mvapich2/opt-pgi-mcast-nop
bs/mvapich2/gdr/mcast/2.2/cuda8.0/mpirun/pgi16.10

sg001% 

 

sg001% pgcc -I$MPIROOT/include -L$MPIROOT/lib64 -lmpich hello_mpi_c.c
-L$CUDALIBDIR -lcuda -lcudart

sg001% 

 

sg001% env LD_PRELOAD=$MPIROOT/lib64/libmpi.so
/apps/mvapich2-gdr/2.2-3/cuda8.0-intel/bin/mpirun -np 4 ./a.out

[sg001:mpi_rank_2][error_sighandler] Caught error: Segmentation fault
(signal 11)

[sg001:mpi_rank_1][error_sighandler] Caught error: Segmentation fault
(signal 11)

[sg001:mpi_rank_0][error_sighandler] Caught error: Segmentation fault
(signal 11)

[sg001:mpi_rank_3][error_sighandler] Caught error: Segmentation fault
(signal 11)

 

============================================================================
=======

=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES

=   PID 86087 RUNNING AT sg001

=   EXIT CODE: 139

=   CLEANING UP REMAINING PROCESSES

=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES

============================================================================
=======

YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Segmentation fault (signal
11)

This typically refers to a problem with your application.

Please see the FAQ page for debugging suggestions

sg001%

 

For completeness, it is the program that was used:

 

sfe01% cat hello_mpi_c.c

#include <stdio.h>

#include <mpi.h>

 

int main(int argc, char **argv)

{

   int ierr, myid, npes;

   int len;

   char name[MPI_MAX_PROCESSOR_NAME];

 

   ierr = MPI_Init(&argc, &argv);

 

   ierr = MPI_Comm_rank(MPI_COMM_WORLD, &myid);

   ierr = MPI_Comm_size(MPI_COMM_WORLD, &npes);

   ierr = MPI_Get_processor_name( name, &len );

 

   printf("Hello from rank %d out of %d; procname = %s\n", myid, npes,
name);

 

   ierr = MPI_Finalize();

 

}

sfe01%

 

Any suggestions on how to fix this problem? 

 

Also, I was wondering if there is an ETA for the Intel version?  As I have
mentioned above, for the time being we are using the GNU version of the
library with the Intel compiler and it is working fine.  We just used the
module files from an earlier version of the library (for which we had an
Intel download available) as a workaround for the FORTRAN 90 modules.

 

Thanks,

Raghu

 

 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20170710/5e13f049/attachment-0001.html>


More information about the mvapich-discuss mailing list