[mvapich-discuss] EOFError with mpi4py

Sourav Chakraborty chakraborty.52 at buckeyemail.osu.edu
Tue May 29 13:41:52 EDT 2018


Hi Ankur,

Can you try setting MV2_USE_LAZY_MEM_UNREGISTER=0 and see if it fixes the
issue?

Also, can you please post the exact hwloc error you are getting?

Please note that MVAPICH2 2.1a is quite old, the latest version (2.3rc2)
has several bug fixes and performance improvements. Can you try the latest
version and see if the issue persists?

Thanks,
Sourav


On Mon, May 28, 2018 at 7:00 AM Ankur Sinha <sanjay.ankur at gmail.com> wrote:

> Hello,
>
> With some of my simulations (using the NEST simulator[0] and mpi4py), I get
> this error after an mpi4py `allgather` command (the NEST bits seem to
> work fine):
>
> >  ranksets = self.comm.allgather(lneurons)
> >  File "mpi4py/MPI/Comm.pyx", line 1272, in mpi4py.MPI.Comm.allgather
> >  File "mpi4py/MPI/msgpickle.pxi", line 781, in mpi4py.MPI.PyMPI_allgather
> >  File "mpi4py/MPI/msgpickle.pxi", line 136, in mpi4py.MPI.Pickle.loadv
> >  File "mpi4py/MPI/msgpickle.pxi", line 111, in mpi4py.MPI.Pickle.load
> >  File "mpi4py/MPI/msgpickle.pxi", line 101, in mpi4py.MPI.Pickle.cloads
> >  EOFError
>
>
> It does not happen all the time, which makes it harder to reproduce and
> debug.
>
> The docs already mention that one should use
> `LD_PRELOAD=/path/to/libmpi.so` to work around issues with Python and I've
> done that, but this still occurs. There's a bug filed upstream with
> mpi4py, but mpi4py upstream says it isn't an issue there[1]. Unfortunately,
> no workaround for mvapich was suggested there. Would anyone know how I
> can correct/workaround these? I've already tried using a `barrier`
> before the allgather call, hoping that would force the processes to
> synchronise, but that hasn't seemed to worked.
>
> From mpi4py:
> > MPI.Get_library_version()
> >  Out[2]: 'MVAPICH2 Version      :\t2.1a\nMVAPICH2 Release date :\tSun
> Sep 21 12:00:00 EDT 2014\nMVAPICH2 Device       :\tch3:mrail\nMVAPICH2
> configure    :\t--prefix=/usr/mpi/gcc/mvapich2-2.1a\nMVAPICH2 CC :\tgcc
> -DNDEBUG -DNVALGRIND -O2\nMVAPICH2 CXX          :\tg++ -DNDEBUG -DNVALGRIND
> -O2\nMVAPICH2 F77          :\tgfortran -L/lib -L/lib   -O2\nMVAPICH2 FC
>        :\tgfortran   -O2\n'
>
> (It's an older version of mvapich that is installed on our cluster. I
> have built the newer version, but with that, simulations fail with a
> rather cryptic hwloc error right at the start. I'll discuss that in a
> different thread.)
>
> The docs also mention increasing the size of the internal communication
> buffer used by mvapich, and the "switch point between eager and
> rendezvous protocol" too, but I'm afraid I don't know what values I
> should set these to.
>
> [0] http://nest-simulator.org/
> [1]
> https://bitbucket.org/mpi4py/mpi4py/issues/39/mpi-msgpicklepxi-eoferror
>
> --
> Thanks,
> Regards,
>
> Ankur Sinha
>
> Ph.D. candidate - UH Biocomputation
> Visiting lecturer - School of Computer Science
> University of Hertfordshire,
> Hatfield, UK
>
> http://biocomputation.herts.ac.uk
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20180529/9eeb9fcd/attachment.html>


More information about the mvapich-discuss mailing list