[mvapich-discuss] EOFError with mpi4py

Ankur Sinha sanjay.ankur at gmail.com
Mon May 28 07:00:01 EDT 2018


Hello,

With some of my simulations (using the NEST simulator[0] and mpi4py), I get
this error after an mpi4py `allgather` command (the NEST bits seem to
work fine):

>  ranksets = self.comm.allgather(lneurons)
>  File "mpi4py/MPI/Comm.pyx", line 1272, in mpi4py.MPI.Comm.allgather
>  File "mpi4py/MPI/msgpickle.pxi", line 781, in mpi4py.MPI.PyMPI_allgather
>  File "mpi4py/MPI/msgpickle.pxi", line 136, in mpi4py.MPI.Pickle.loadv
>  File "mpi4py/MPI/msgpickle.pxi", line 111, in mpi4py.MPI.Pickle.load
>  File "mpi4py/MPI/msgpickle.pxi", line 101, in mpi4py.MPI.Pickle.cloads
>  EOFError


It does not happen all the time, which makes it harder to reproduce and
debug.

The docs already mention that one should use
`LD_PRELOAD=/path/to/libmpi.so` to work around issues with Python and I've
done that, but this still occurs. There's a bug filed upstream with
mpi4py, but mpi4py upstream says it isn't an issue there[1]. Unfortunately,
no workaround for mvapich was suggested there. Would anyone know how I
can correct/workaround these? I've already tried using a `barrier`
before the allgather call, hoping that would force the processes to
synchronise, but that hasn't seemed to worked.

From mpi4py:
> MPI.Get_library_version()
>  Out[2]: 'MVAPICH2 Version      :\t2.1a\nMVAPICH2 Release date :\tSun Sep 21 12:00:00 EDT 2014\nMVAPICH2 Device       :\tch3:mrail\nMVAPICH2 configure    :\t--prefix=/usr/mpi/gcc/mvapich2-2.1a\nMVAPICH2 CC :\tgcc    -DNDEBUG -DNVALGRIND -O2\nMVAPICH2 CXX          :\tg++ -DNDEBUG -DNVALGRIND -O2\nMVAPICH2 F77          :\tgfortran -L/lib -L/lib   -O2\nMVAPICH2 FC           :\tgfortran   -O2\n'

(It's an older version of mvapich that is installed on our cluster. I
have built the newer version, but with that, simulations fail with a
rather cryptic hwloc error right at the start. I'll discuss that in a
different thread.)

The docs also mention increasing the size of the internal communication
buffer used by mvapich, and the "switch point between eager and
rendezvous protocol" too, but I'm afraid I don't know what values I
should set these to.

[0] http://nest-simulator.org/
[1] https://bitbucket.org/mpi4py/mpi4py/issues/39/mpi-msgpicklepxi-eoferror

-- 
Thanks,
Regards,

Ankur Sinha

Ph.D. candidate - UH Biocomputation
Visiting lecturer - School of Computer Science
University of Hertfordshire,
Hatfield, UK

http://biocomputation.herts.ac.uk
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 488 bytes
Desc: not available
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20180528/0a3e78fb/attachment.sig>


More information about the mvapich-discuss mailing list