[mvapich-discuss] Error in initializing MVAPICH2 ptmalloc library with mpi4py

Jonatan Martín martin at ice.csic.es
Tue Dec 15 11:51:43 EST 2020


Hi Hari,

I've tried with the following script and I could run it without warnings
finally.

#!/bin/bash
#SBATCH -N 2
LD_PRELOAD=<full path to libmpi.so> mpiexec -f hosts -n 40 -env
MV2_HOMOGENEOUS_CLUSTER=1 python myprogram.py

Thanks for your help! As a final question, what's the impact of setting
MV2_HOMOGENEOUS_CLUSTER=1?

Regards,

Jonatan

Missatge de Subramoni, Hari <subramoni.1 at osu.edu> del dia dt., 15 de des.
2020 a les 16:15:

> Hi, Jonatan.
>
>
>
> There could be some firewall issue preventing mpirun_rsh from launching on
> multiple nodes. Can you please try to launch with mpiexec instead to see if
> it works?
>
>
>
> mpiexec -n <num_procs> -f <hostfile> -env <env vars if any> <executable>
>
>
>
> If you would like to launch with srun, then yes, you should reconfigure
> MVAPICH2 with --with-pm=slurm and --with-pmi=pmix and provide the pmi
> option as you mentioned when launching.
>
>
>
> Thx,
>
> Hari.
>
>
>
> *From:* Jonatan Martín <martin at ice.csic.es>
> *Sent:* Tuesday, December 15, 2020 6:30 AM
> *To:* Subramoni, Hari <subramoni.1 at osu.edu>
> *Cc:* mvapich-discuss at cse.ohio-state.edu <
> mvapich-discuss at mailman.cse.ohio-state.edu>
> *Subject:* Re: [mvapich-discuss] Error in initializing MVAPICH2 ptmalloc
> library with mpi4py
>
>
>
> Hi Hari,
>
>
>
> Thank you very much for your fast reply. I did both tests, but I couldn't
> solve the problem. Actually, I realized that the code gets stuck at some
> point in a loop where there is exchange of information between CPUs. It's
> really weird. Let me show you some examples.
>
>
>
> We use the queue system Slurm to run our jobs. The process manager
> interface is PMIx. If I use the script:
>
>
>
> #!/bin/bash
>
> #SBATCH -n 40
>
> #SBATCH - N 1
>
> mpirun python myprogram.py
>
>
>
> The program starts and finishes well, but with the following warnings:
>
>
>
> - WARNING: Error in initializing MVAPICH2 ptmalloc library. Continuing
> without InfiniBand registration cache support.
>
> - [hidra1:mpi_rank_0][rdma_param_handle_heterogeneity] All nodes involved
> in the job were detected to be homogeneous in terms of processors and
> interconnects. Setting MV2_HOMOGENEOUS_CLUSTER=1 can improve job startup
> performance on such systems.
>
>
>
> The first warning seems to be solved when we set “LD_PRELOAD=<full path to
> libmpi.so>” and the second by setting "MV2_HOMOGENEOUS_CLUSTER=1".
>
>
>
> But when I want to use more than one node (let's say 2),
>
>
>
> #!/bin/bash
>
> #SBATCH - N 2
>
> mpirun_rsh -hostfile hosts -n 40 python myprogram.py
>
>
>
> Then the program gets stuck. I'm wondering if we should run the code with
> srun and install mvapich2 with another configuration different from the
> default one (maybe with --with-pm=slurm and --with-pmi=pmix). We tried
> before, but we couldn't manage to run the code. Do you know if we use srun
> we must give explicitly the option --pmi=pmix?
>
>
>
> Thank you again and regards,
>
>
>
> Jonatan
>
>
>
>
>
> Missatge de Subramoni, Hari <subramoni.1 at osu.edu> del dia dl., 14 de des.
> 2020 a les 17:37:
>
> Hi, Jonatan.
>
>
>
> Can you please try the following options one by one and let us know what
> performance you obseve?
>
>
>
>    1. Add “LD_PRELOAD=<full path to libmpi.so>” before your launch
>    command
>
>
>    1. E.g. LD_PRELOAD=/home/subramon/MVAPICH2/git/
>       install-dir/lib/libmpi.so mpirun_rsh -np xx -hostfile <hostfile>
>       <path_to_executable>
>
>
>    1. Set MV2_VBUF_TOTAL_SIZE & MV2_IBA_EAGER_THRESHOLD to 128 KB
>
>
>
> Thx,
>
> Hari.
>
>
>
> *From:* mvapich-discuss-bounces at cse.ohio-state.edu <
> mvapich-discuss-bounces at mailman.cse.ohio-state.edu> *On Behalf Of *Jonatan
> Martín
> *Sent:* Monday, December 14, 2020 10:59 AM
> *To:* mvapich-discuss at cse.ohio-state.edu <
> mvapich-discuss at mailman.cse.ohio-state.edu>
> *Subject:* [mvapich-discuss] Error in initializing MVAPICH2 ptmalloc
> library with mpi4py
>
>
>
> Hi,
>
>
>
> I'm trying to run a Python code with mvapich2 2.3.5 with multiple nodes,
> but I get the following warning:
>
>
>
> WARNING: Error in initializing MVAPICH2 ptmalloc library. Continuing
> without InfiniBand registration cache support.
>
>
>
> The execution goes on, but with poor performance. I've read in another
> discussion (
> http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/2015-April/005532.html)
> that this issue was supposed to be solved in the version 2.2. I wanted to
> ask if this issue was already solved or it is really a problem with the
> configuration of mvapich2 in the cluster I'm using. Can you confirm this?
>
>
>
> Also it is stated that you can modify the variables "MV2_VBUF_TOTAL_SIZE"
> and "MV2_IBA_EAGER_THRESHOLD" to get better performance, but I'm not sure
> how to tune these variables. What's the maximum recommended size you can
> use there? Or how do you know the optimal size?
>
>
>
> Thank you very much and regards,
>
>
>
> Jonatan
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20201215/846a27cc/attachment-0001.html>


More information about the mvapich-discuss mailing list