[mvapich-discuss] VBUF Abort reached in job

Thompson, Matt (GSFC-610.1)[SCIENCE SYSTEMS AND APPLICATIONS INC] matthew.thompson at nasa.gov
Thu May 9 11:59:21 EDT 2013


MVAPICH List,

I recently encountered an issue that is obscure enough that I need to 
ask an expert. The job is using Intel 13.1, MVAPICH 2 1.8.1 run on 1536 
processors (128 Westmeres). It dies with this error:

> [vbuf.c 963] Cannot register vbuf region
> [0] Abort: UD VBUF reagion allocation failed. Pool size 1024
>  at line 1016 in file vbuf.c
> [borg01r001:mpispawn_0][readline] Unexpected End-Of-File on file descriptor 14. MPI process died?
> [borg01r001:mpispawn_0][mtpmi_processops] Error while reading PMI socket. MPI process died?
> [borg01r001:mpispawn_0][child_handler] MPI process (rank: 0, pid: 3892) exited with status 255
<snip>

Well, there are lots of those mpispawn messages, of course, I see them 
all the time when an MPI job dies. But the first three lines are new to 
me. A look around the list shows me threads like this:

http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/2013-February/004263.html

which seem to involve vbuf...but it might not be related to the same 
issue. Is it Infiniband related (the cluster, discover at NCCS at NASA 
Goddard, uses Infiniband)? Or something else?

Any ideas?

Thanks,
Matt Thompson
-- 
Matt Thompson, PhD     SSAI, Sr Software Test Engr
NASA GSFC, Global Modeling and Assimilation Office
Code 610.1, 8800 Greenbelt Rd, Greenbelt, MD 20771
Phone: 301-614-6712              Fax: 301-614-6246


More information about the mvapich-discuss mailing list