[mvapich-discuss] Error in initializing MVAPICH2 ptmalloc library with mpi4py

Jonatan Martín martin at ice.csic.es
Tue Dec 15 06:30:07 EST 2020


Hi Hari,

Thank you very much for your fast reply. I did both tests, but I couldn't
solve the problem. Actually, I realized that the code gets stuck at some
point in a loop where there is exchange of information between CPUs. It's
really weird. Let me show you some examples.

We use the queue system Slurm to run our jobs. The process manager
interface is PMIx. If I use the script:

#!/bin/bash
#SBATCH -n 40
#SBATCH - N 1
mpirun python myprogram.py

The program starts and finishes well, but with the following warnings:

- WARNING: Error in initializing MVAPICH2 ptmalloc library. Continuing
without InfiniBand registration cache support.
- [hidra1:mpi_rank_0][rdma_param_handle_heterogeneity] All nodes involved
in the job were detected to be homogeneous in terms of processors and
interconnects. Setting MV2_HOMOGENEOUS_CLUSTER=1 can improve job startup
performance on such systems.

The first warning seems to be solved when we set “LD_PRELOAD=<full path to
libmpi.so>” and the second by setting "MV2_HOMOGENEOUS_CLUSTER=1".

But when I want to use more than one node (let's say 2),

#!/bin/bash
#SBATCH - N 2
mpirun_rsh -hostfile hosts -n 40 python myprogram.py

Then the program gets stuck. I'm wondering if we should run the code with
srun and install mvapich2 with another configuration different from the
default one (maybe with --with-pm=slurm and --with-pmi=pmix). We tried
before, but we couldn't manage to run the code. Do you know if we use srun
we must give explicitly the option --pmi=pmix?

Thank you again and regards,

Jonatan


Missatge de Subramoni, Hari <subramoni.1 at osu.edu> del dia dl., 14 de des.
2020 a les 17:37:

> Hi, Jonatan.
>
>
>
> Can you please try the following options one by one and let us know what
> performance you obseve?
>
>
>
>    1. Add “LD_PRELOAD=<full path to libmpi.so>” before your launch command
>       1. E.g. LD_PRELOAD=/home/subramon/MVAPICH2/git/
>       install-dir/lib/libmpi.so mpirun_rsh -np xx -hostfile <hostfile>
>       <path_to_executable>
>    2. Set MV2_VBUF_TOTAL_SIZE & MV2_IBA_EAGER_THRESHOLD to 128 KB
>
>
>
> Thx,
>
> Hari.
>
>
>
> *From:* mvapich-discuss-bounces at cse.ohio-state.edu <
> mvapich-discuss-bounces at mailman.cse.ohio-state.edu> *On Behalf Of *Jonatan
> Martín
> *Sent:* Monday, December 14, 2020 10:59 AM
> *To:* mvapich-discuss at cse.ohio-state.edu <
> mvapich-discuss at mailman.cse.ohio-state.edu>
> *Subject:* [mvapich-discuss] Error in initializing MVAPICH2 ptmalloc
> library with mpi4py
>
>
>
> Hi,
>
>
>
> I'm trying to run a Python code with mvapich2 2.3.5 with multiple nodes,
> but I get the following warning:
>
>
>
> WARNING: Error in initializing MVAPICH2 ptmalloc library. Continuing
> without InfiniBand registration cache support.
>
>
>
> The execution goes on, but with poor performance. I've read in another
> discussion (
> http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/2015-April/005532.html)
> that this issue was supposed to be solved in the version 2.2. I wanted to
> ask if this issue was already solved or it is really a problem with the
> configuration of mvapich2 in the cluster I'm using. Can you confirm this?
>
>
>
> Also it is stated that you can modify the variables "MV2_VBUF_TOTAL_SIZE"
> and "MV2_IBA_EAGER_THRESHOLD" to get better performance, but I'm not sure
> how to tune these variables. What's the maximum recommended size you can
> use there? Or how do you know the optimal size?
>
>
>
> Thank you very much and regards,
>
>
>
> Jonatan
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20201215/7c4b97a8/attachment-0001.html>


More information about the mvapich-discuss mailing list