[Mvapich-discuss] Building mvapich2-2.3.7 ch3:nemesis / osu_bibw fails on same host
Tscheuschner Joachim
Joachim.Tscheuschner at dwd.de
Wed Jul 13 08:16:08 EDT 2022
!-------------------------------------------------------------------|
This Message Is From an External Sender
This message came from outside your organization.
|-------------------------------------------------------------------!
Hi MVAPICH-Team,
I have difficulties installing and running mvapich2-2.3.7 in a container using
--with-device=ch3:nemesis
While installing I get error-messages, which can be eliminated with e.g.
sed -i '154,156{s/^/\/\//}' src/mpid/ch3/channels/nemesis/netmod/tcp/tcp_getip.c
sed -i '155{s/^/fprintf(stdout, \"IPv4 adress = \%08x (\%s)\\n\", addr.s_addr,/}' src/mpid/ch3/channels/nemesis/netmod/tcp/tcp_getip.c
sed -i '156{s/^/inet_ntoa( addr ) );/}' src/mpid/ch3/channels/nemesis/netmod/tcp/tcp_getip.c
sed -i 's/mpierrno/mpi_errno/g' src/mpid/ch3/channels/nemesis/netmod/tcp/tcp_init.c (typing??).
Doing so the osu_bibw test
MPICH_NEMESIS_NETMOD=tcp mpiexec.hydra -verbose -n 2 -ppn 1 singularity exec test.sif /home/user/software/mvapich_tcp/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_bibw
Fails with
[mpiexec at node101] Launch arguments: /hpc/sw/intel/compilers_and_libraries_2020.0.166/linux/mpi/intel64/bin//hydra_bstrap_proxy --upstream-host node101 --upstream-port 34308 --pgid 0 --launcher ssh --launcher-number 0 --base-path /hpc/sw/intel/compilers_and_libraries_2020.0.166/linux/mpi/intel64/bin/ --tree-width 16 --tree-level 1 --time-left -1 --collective-launch 1 --debug --proxy-id 0 --node-id 0 --subtree-size 1 --upstream-fd 7 /hpc/sw/intel/compilers_and_libraries_2020.0.166/linux/mpi/intel64/bin//hydra_pmi_proxy --usize -1 --auto-cleanup 1 --abort-signal 9
[proxy:0:0 at node101] pmi cmd from fd 9: cmd=init pmi_version=1 pmi_subversion=1
[proxy:0:0 at node101] PMI response: cmd=response_to_init pmi_version=1 pmi_subversion=1 rc=0
[proxy:0:0 at node101] pmi cmd from fd 9: cmd=get_maxes
[proxy:0:0 at node101] PMI response: cmd=maxes kvsname_max=256 keylen_max=64 vallen_max=4096
[proxy:0:0 at node101] pmi cmd from fd 9: cmd=get_appnum
[proxy:0:0 at node101] PMI response: cmd=appnum appnum=0
[proxy:0:0 at node101] pmi cmd from fd 6: cmd=init pmi_version=1 pmi_subversion=1
[proxy:0:0 at node101] PMI response: cmd=response_to_init pmi_version=1 pmi_subversion=1 rc=0
[proxy:0:0 at node101] pmi cmd from fd 6: cmd=get_maxes
[proxy:0:0 at node101] PMI response: cmd=maxes kvsname_max=256 keylen_max=64 vallen_max=4096
[proxy:0:0 at node101] pmi cmd from fd 6: cmd=get_appnum
[proxy:0:0 at node101] PMI response: cmd=appnum appnum=0
[proxy:0:0 at node101] pmi cmd from fd 6: cmd=get_my_kvsname
[proxy:0:0 at node101] PMI response: cmd=my_kvsname kvsname=kvs_1205283_0
[proxy:0:0 at node101] pmi cmd from fd 9: cmd=get_my_kvsname
[proxy:0:0 at node101] PMI response: cmd=my_kvsname kvsname=kvs_1205283_0
[proxy:0:0 at node101] pmi cmd from fd 6: cmd=get_my_kvsname
[proxy:0:0 at node101] PMI response: cmd=my_kvsname kvsname=kvs_1205283_0
[proxy:0:0 at node101] pmi cmd from fd 9: cmd=get_my_kvsname
[proxy:0:0 at node101] PMI response: cmd=my_kvsname kvsname=kvs_1205283_0
[proxy:0:0 at node101] pmi cmd from fd 6: cmd=get kvsname=kvs_1205283_0 key=PMI_process_mapping
[proxy:0:0 at node101] PMI response: cmd=get_result rc=0 msg=success value=(vector,(0,1,2))
[proxy:0:0 at node101] pmi cmd from fd 9: cmd=get kvsname=kvs_1205283_0 key=PMI_process_mapping
[proxy:0:0 at node101] PMI response: cmd=get_result rc=0 msg=success value=(vector,(0,1,2))
[proxy:0:0 at node101] pmi cmd from fd 6: cmd=put kvsname=kvs_1205283_0 key=sharedFilename-0 value=/dev/shm/mpich_shar_tmpEmz9dR
[proxy:0:0 at node101] PMI response: cmd=put_result rc=0 msg=success
[proxy:0:0 at node101] pmi cmd from fd 9: cmd=barrier_in
[proxy:0:0 at node101] pmi cmd from fd 6: cmd=barrier_in
[proxy:0:0 at node101] PMI response: cmd=barrier_out
[proxy:0:0 at node101] PMI response: cmd=barrier_out
[proxy:0:0 at node101] pmi cmd from fd 9: cmd=get kvsname=kvs_1205283_0 key=sharedFilename-0
[proxy:0:0 at node101] PMI response: cmd=get_result rc=0 msg=success value=/dev/shm/mpich_shar_tmpEmz9dR
[proxy:0:0 at node101] pmi cmd from fd 6: cmd=put kvsname=kvs_1205283_0 key=Pbusinesscard-0 value=description#node101$port#40866$ifname#171.13.115.9$
[proxy:0:0 at node101] PMI response: cmd=put_result rc=0 msg=success
[proxy:0:0 at node101] pmi cmd from fd 9: cmd=put kvsname=kvs_1205283_0 key=Pbusinesscard-1 value=description#node101$port#57570$ifname#171.13.115.9$
[proxy:0:0 at node101] PMI response: cmd=put_result rc=0 msg=success
[proxy:0:0 at node101] pmi cmd from fd 6: cmd=barrier_in
[proxy:0:0 at node101] pmi cmd from fd 9: cmd=barrier_in
[proxy:0:0 at node101] PMI response: cmd=barrier_out
[proxy:0:0 at node101] PMI response: cmd=barrier_out# OSU MPI Bi-Directional Bandwidth Test v5.5
# Size Bandwidth (MB/s)
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 0 PID 994293 RUNNING AT node101
= KILLED BY SIGNAL: 9 (Killed)
===================================================================================
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 1 PID 994294 RUNNING AT node101
= KILLED BY SIGNAL: 11 (Segmentation fault)
===================================================================================
Sometimes a few messages pass through. While specifying the hosts explicitly the test runs fine, as long as the processes are on different hosts.
Note, osu_bw does not show any problems,
Copying the file libmpi.so.12.1.1 from either mpich-version works, but seems to be a fragile solution if at all..
My questions are:
Are the errors stemming from wrong/missing os/packages?
How can the conflict be solved?
As an example I added a minimum example.
1. Build the image with docker
2. Make the sif-image with singularity
3. Run using intelmpi on the host-system
Sincerely
Joachim Tscheuschner
Dockerfile:
>From debian:10 as mvapich
USER root
# Set environment
ENV mvapich_tcp=mvapich2-2.3.7 \
PATH=/home/user/software/bin:$PATH:/home/user/.local/bin \
LD_LIBRARY_PATH=/usr/lib64/libibverbs:$LD_LIBRARY_PATH \
prefix_build=/home/user/software
COPY install.mvapich.sh .
RUN bash ./install.mvapich.sh
Script:
set -x
# Remove obm is already in the package
export OSU_VERSION=5.5
export FFLAGS="-O3"
apt-get update
apt-get upgrade
apt-get install -y \
autoconf \
libtool \
automake \
gettext \
make \
build-essential \
gfortran \
bison \
libibverbs1 \
librdma* \
libnuma-dev \
wget
apt-get clean
rm -rf /var/lib/apt/lists/* /var/tmp/*
mkdir -p ${prefix_build}
mkdir /temp
cd /temp
#Download and unpack files
counter=0
until wget --no-check-certificate https://mvapich.cse.ohio-state.edu/download/mvapich/mv2/${mvapich_tcp}.tar.gz || [ $counter -gt 5 ] ; do sleep 30; ((counter++)); done
tar xzf ${mvapich_tcp}.tar.gz
counter=0
until wget --no-check-certificate http://mvapich.cse.ohio-state.edu/download/mvapich/osu-micro-benchmarks-${OSU_VERSION}.tar.gz || [$counter -gt 5 ]; do sleep 30; ((counter++)); done
tar xzf osu-micro-benchmarks-${OSU_VERSION}.tar.gz
#Compile mvapich with nemesis (tcp)
cd ${mvapich_tcp}
/temp/${mvapich_tcp}/configure --prefix=${prefix_build}/mvapich_tcp \
--with-device=ch3:nemesis
make -j4 && make install
# compile obm with the mvapich-tcp
cd /temp/osu-micro-benchmarks-${OSU_VERSION}
LD_LIBRARY_PATH=${prefix_build}/mvapich_tcp/lib:$LD_LIBRARY_PATH \
/temp/osu-micro-benchmarks-${OSU_VERSION}/configure \
CC=${prefix_build}/mvapich_tcp/bin/mpicc CXX=${prefix_build}/mvapich_tcp/bin/mpicxx \
CFLAGS=-I$(pwd)/../osu-micro-benchmarks-${OSU_VERSION}/util \
--prefix=${prefix_build}/mvapich_tcp
echo "make obm"
make && make instal
More information about the Mvapich-discuss
mailing list