[mvapich-discuss] VBUF Abort reached in job
Thompson, Matt (GSFC-610.1)[SCIENCE SYSTEMS AND APPLICATIONS INC]
matthew.thompson at nasa.gov
Thu May 9 11:59:21 EDT 2013
MVAPICH List,
I recently encountered an issue that is obscure enough that I need to
ask an expert. The job is using Intel 13.1, MVAPICH 2 1.8.1 run on 1536
processors (128 Westmeres). It dies with this error:
> [vbuf.c 963] Cannot register vbuf region
> [0] Abort: UD VBUF reagion allocation failed. Pool size 1024
> at line 1016 in file vbuf.c
> [borg01r001:mpispawn_0][readline] Unexpected End-Of-File on file descriptor 14. MPI process died?
> [borg01r001:mpispawn_0][mtpmi_processops] Error while reading PMI socket. MPI process died?
> [borg01r001:mpispawn_0][child_handler] MPI process (rank: 0, pid: 3892) exited with status 255
<snip>
Well, there are lots of those mpispawn messages, of course, I see them
all the time when an MPI job dies. But the first three lines are new to
me. A look around the list shows me threads like this:
http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/2013-February/004263.html
which seem to involve vbuf...but it might not be related to the same
issue. Is it Infiniband related (the cluster, discover at NCCS at NASA
Goddard, uses Infiniband)? Or something else?
Any ideas?
Thanks,
Matt Thompson
--
Matt Thompson, PhD SSAI, Sr Software Test Engr
NASA GSFC, Global Modeling and Assimilation Office
Code 610.1, 8800 Greenbelt Rd, Greenbelt, MD 20771
Phone: 301-614-6712 Fax: 301-614-6246
More information about the mvapich-discuss
mailing list