[Mvapich-discuss] Building mvapich2-2.3.7 ch3:nemesis / osu_bibw fails on same host

Tscheuschner Joachim Joachim.Tscheuschner at dwd.de
Wed Jul 13 08:16:08 EDT 2022


!-------------------------------------------------------------------|
  This Message Is From an External Sender
  This message came from outside your organization.
|-------------------------------------------------------------------!

Hi MVAPICH-Team,

I have difficulties installing and running mvapich2-2.3.7 in a container using 
--with-device=ch3:nemesis
While installing I get error-messages, which can be eliminated with e.g.
sed -i '154,156{s/^/\/\//}' src/mpid/ch3/channels/nemesis/netmod/tcp/tcp_getip.c
sed -i '155{s/^/fprintf(stdout, \"IPv4 adress = \%08x (\%s)\\n\", addr.s_addr,/}' src/mpid/ch3/channels/nemesis/netmod/tcp/tcp_getip.c
sed -i '156{s/^/inet_ntoa( addr ) );/}' src/mpid/ch3/channels/nemesis/netmod/tcp/tcp_getip.c
sed -i 's/mpierrno/mpi_errno/g' src/mpid/ch3/channels/nemesis/netmod/tcp/tcp_init.c (typing??).
Doing so the osu_bibw test 
MPICH_NEMESIS_NETMOD=tcp mpiexec.hydra -verbose -n 2 -ppn 1 singularity exec test.sif /home/user/software/mvapich_tcp/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_bibw 
Fails with 

[mpiexec at node101] Launch arguments: /hpc/sw/intel/compilers_and_libraries_2020.0.166/linux/mpi/intel64/bin//hydra_bstrap_proxy --upstream-host node101 --upstream-port 34308 --pgid 0 --launcher ssh --launcher-number 0 --base-path /hpc/sw/intel/compilers_and_libraries_2020.0.166/linux/mpi/intel64/bin/ --tree-width 16 --tree-level 1 --time-left -1 --collective-launch 1 --debug --proxy-id 0 --node-id 0 --subtree-size 1 --upstream-fd 7 /hpc/sw/intel/compilers_and_libraries_2020.0.166/linux/mpi/intel64/bin//hydra_pmi_proxy --usize -1 --auto-cleanup 1 --abort-signal 9
[proxy:0:0 at node101] pmi cmd from fd 9: cmd=init pmi_version=1 pmi_subversion=1
[proxy:0:0 at node101] PMI response: cmd=response_to_init pmi_version=1 pmi_subversion=1 rc=0
[proxy:0:0 at node101] pmi cmd from fd 9: cmd=get_maxes
[proxy:0:0 at node101] PMI response: cmd=maxes kvsname_max=256 keylen_max=64 vallen_max=4096
[proxy:0:0 at node101] pmi cmd from fd 9: cmd=get_appnum
[proxy:0:0 at node101] PMI response: cmd=appnum appnum=0
[proxy:0:0 at node101] pmi cmd from fd 6: cmd=init pmi_version=1 pmi_subversion=1
[proxy:0:0 at node101] PMI response: cmd=response_to_init pmi_version=1 pmi_subversion=1 rc=0
[proxy:0:0 at node101] pmi cmd from fd 6: cmd=get_maxes
[proxy:0:0 at node101] PMI response: cmd=maxes kvsname_max=256 keylen_max=64 vallen_max=4096
[proxy:0:0 at node101] pmi cmd from fd 6: cmd=get_appnum
[proxy:0:0 at node101] PMI response: cmd=appnum appnum=0
[proxy:0:0 at node101] pmi cmd from fd 6: cmd=get_my_kvsname
[proxy:0:0 at node101] PMI response: cmd=my_kvsname kvsname=kvs_1205283_0
[proxy:0:0 at node101] pmi cmd from fd 9: cmd=get_my_kvsname
[proxy:0:0 at node101] PMI response: cmd=my_kvsname kvsname=kvs_1205283_0
[proxy:0:0 at node101] pmi cmd from fd 6: cmd=get_my_kvsname
[proxy:0:0 at node101] PMI response: cmd=my_kvsname kvsname=kvs_1205283_0
[proxy:0:0 at node101] pmi cmd from fd 9: cmd=get_my_kvsname
[proxy:0:0 at node101] PMI response: cmd=my_kvsname kvsname=kvs_1205283_0
[proxy:0:0 at node101] pmi cmd from fd 6: cmd=get kvsname=kvs_1205283_0 key=PMI_process_mapping
[proxy:0:0 at node101] PMI response: cmd=get_result rc=0 msg=success value=(vector,(0,1,2))
[proxy:0:0 at node101] pmi cmd from fd 9: cmd=get kvsname=kvs_1205283_0 key=PMI_process_mapping
[proxy:0:0 at node101] PMI response: cmd=get_result rc=0 msg=success value=(vector,(0,1,2))
[proxy:0:0 at node101] pmi cmd from fd 6: cmd=put kvsname=kvs_1205283_0 key=sharedFilename-0 value=/dev/shm/mpich_shar_tmpEmz9dR
[proxy:0:0 at node101] PMI response: cmd=put_result rc=0 msg=success
[proxy:0:0 at node101] pmi cmd from fd 9: cmd=barrier_in
[proxy:0:0 at node101] pmi cmd from fd 6: cmd=barrier_in
[proxy:0:0 at node101] PMI response: cmd=barrier_out
[proxy:0:0 at node101] PMI response: cmd=barrier_out
[proxy:0:0 at node101] pmi cmd from fd 9: cmd=get kvsname=kvs_1205283_0 key=sharedFilename-0
[proxy:0:0 at node101] PMI response: cmd=get_result rc=0 msg=success value=/dev/shm/mpich_shar_tmpEmz9dR
[proxy:0:0 at node101] pmi cmd from fd 6: cmd=put kvsname=kvs_1205283_0 key=Pbusinesscard-0 value=description#node101$port#40866$ifname#171.13.115.9$
[proxy:0:0 at node101] PMI response: cmd=put_result rc=0 msg=success
[proxy:0:0 at node101] pmi cmd from fd 9: cmd=put kvsname=kvs_1205283_0 key=Pbusinesscard-1 value=description#node101$port#57570$ifname#171.13.115.9$
[proxy:0:0 at node101] PMI response: cmd=put_result rc=0 msg=success
[proxy:0:0 at node101] pmi cmd from fd 6: cmd=barrier_in
[proxy:0:0 at node101] pmi cmd from fd 9: cmd=barrier_in
[proxy:0:0 at node101] PMI response: cmd=barrier_out
[proxy:0:0 at node101] PMI response: cmd=barrier_out# OSU MPI Bi-Directional Bandwidth Test v5.5
# Size      Bandwidth (MB/s)

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   RANK 0 PID 994293 RUNNING AT node101
=   KILLED BY SIGNAL: 9 (Killed)
===================================================================================

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   RANK 1 PID 994294 RUNNING AT node101
=   KILLED BY SIGNAL: 11 (Segmentation fault)
===================================================================================

Sometimes a few messages pass through. While specifying the hosts explicitly the test runs fine, as long as the processes are on different hosts. 
Note, osu_bw does not show any problems,
Copying the file libmpi.so.12.1.1 from either mpich-version works, but seems to be a fragile solution if at all.. 

My questions are:
Are the errors stemming from wrong/missing  os/packages?
How can the conflict be solved?
As an example I added a minimum example.
1. Build the image with docker
2. Make the sif-image with singularity
3. Run using intelmpi on the host-system

Sincerely
Joachim Tscheuschner

Dockerfile:

>From debian:10 as mvapich
USER root
# Set environment
ENV mvapich_tcp=mvapich2-2.3.7 \
    PATH=/home/user/software/bin:$PATH:/home/user/.local/bin \
    LD_LIBRARY_PATH=/usr/lib64/libibverbs:$LD_LIBRARY_PATH \ 
    prefix_build=/home/user/software 
COPY install.mvapich.sh .
RUN bash ./install.mvapich.sh

Script:

set -x 
# Remove obm is already in the package
export OSU_VERSION=5.5
export FFLAGS="-O3"
apt-get update
apt-get upgrade
apt-get install -y \
      autoconf \
      libtool \
      automake \
      gettext \
      make \
      build-essential \
      gfortran \
      bison \
      libibverbs1 \
      librdma* \
      libnuma-dev \
      wget
apt-get clean 
rm -rf /var/lib/apt/lists/* /var/tmp/* 
mkdir -p ${prefix_build}
mkdir /temp
cd /temp
#Download and unpack files
counter=0
until wget --no-check-certificate https://mvapich.cse.ohio-state.edu/download/mvapich/mv2/${mvapich_tcp}.tar.gz || [ $counter -gt 5 ] ; do sleep 30; ((counter++)); done
tar xzf ${mvapich_tcp}.tar.gz
counter=0
until wget --no-check-certificate http://mvapich.cse.ohio-state.edu/download/mvapich/osu-micro-benchmarks-${OSU_VERSION}.tar.gz || [$counter -gt 5 ]; do sleep 30; ((counter++)); done
tar xzf osu-micro-benchmarks-${OSU_VERSION}.tar.gz
#Compile mvapich with nemesis (tcp)
cd ${mvapich_tcp}
/temp/${mvapich_tcp}/configure --prefix=${prefix_build}/mvapich_tcp \
 --with-device=ch3:nemesis 
make -j4 && make install
# compile obm with the mvapich-tcp
cd /temp/osu-micro-benchmarks-${OSU_VERSION}
LD_LIBRARY_PATH=${prefix_build}/mvapich_tcp/lib:$LD_LIBRARY_PATH \
/temp/osu-micro-benchmarks-${OSU_VERSION}/configure            \
    CC=${prefix_build}/mvapich_tcp/bin/mpicc  CXX=${prefix_build}/mvapich_tcp/bin/mpicxx  \
    CFLAGS=-I$(pwd)/../osu-micro-benchmarks-${OSU_VERSION}/util \
    --prefix=${prefix_build}/mvapich_tcp
echo "make obm"
make && make instal




More information about the Mvapich-discuss mailing list