[mvapich-discuss] ENOMEM when writing to /dev/infiniband/uverbs0

Paul Howard paulhoward at microway.com
Tue Jul 31 12:27:08 EDT 2007


I have an issue with an MPI application.

The version of MVAPICH is 0.9.8, compiled with PGI 6.2.

The program, also compiled with PGI 6.2, is running on an 8-node
cluster, with 2 dual-core Opteron 2218's on each node. Each node has
4GB of memory. The nodes are named node10, node11, ..., node17. I
start the MPI job on node10: "mpirun -np 32 ./wrf.exe". The machines
list lists the 8 nodes on the first 8 lines, then repeats those 8
lines 3 more times, for a total of 32 lines.

The program runs successfully as root with np=32. (It takes hours to
run.) When run as an ordinary user, it fails almost immediately
(within 5 seconds or so) with a segmentation fault.

It also fails when I remove the last 3 occurrences of node10 from the
machines list and run with np=29 as an ordinary user (and as expected,
it does not fail immediately as root with np=29). Doing it this way
lets me run strace on the single process on node10.

It seems to fail with error ENOMEM some times but not every time that
it writes to /dev/infiniband/uverbs0. It reports ENOMEM a number of
times; the segmentation fault came on the 38th ENOMEM. (When run in a
similar way as root, with np=29 and running strace on the only process
on node10, there are no ENOMEM errors.) I couldn't find anything with
Google.

The output of strace is like this (I've added some blank lines to make
things stand out). I can provide the whole 7MB strace log if
it would be useful.

=============== START OF strace SNIPPETS ===============================

[280 lines deleted]

open("/sys/class/infiniband_verbs/uverbs0/abi_version", O_RDONLY) = 3
read(3, "1\n", 8)                       = 2
close(3)                                = 0
open("/sys/class/infiniband_verbs/uverbs0/device/vendor", O_RDONLY) = 3
read(3, "0x15b3\n", 8)                  = 7
close(3)                                = 0
open("/sys/class/infiniband_verbs/uverbs0/device/device", O_RDONLY) = 3
read(3, "0x6274\n", 8)                  = 7
close(3)                                = 0


open("/dev/infiniband/uverbs0", O_RDWR) = 3


write(3, "\0\0\0\0\4\0\4\0000\223\336\356\377\177\0\0", 16) = 16
mmap(NULL, 4096, PROT_WRITE, MAP_SHARED, 3, 0) = 0x2ac1bc8e2000
write(3, "\3\0\0\0\4\0\3\0\0\223\336\356\377\177\0\0", 16) = 16
write(3, "\3\0\0\0\4\0\3\0`\223\336\356\377\177\0\0", 16) = 16
write(3, "\2\0\0\0\6\0\n\0\20\223\336\356\377\177\0\0\1\0\0\0\0\0"..., 
24) = 24


[about 157000 lines deleted, none of them involving opening or closing
fd=3, but about 200 of them involving write(3,...)]

write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0\220n/\0\0\0\0"..., 
48) = 48
write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0000\222/\0\0\0"..., 
48) = 48
write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0\300\265/\0\0\0"..., 
48) = 48
write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0\260\332/\0\0\0"..., 
48) = -1 ENOMEM (Cannot allocate memory)
write(3, "\r\0\0\0\3\0\0\0!\0\0\0", 12) = 12
write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0\260\332/\0\0\0"..., 
48) = -1 ENOMEM (Cannot allocate memory)
write(3, "\r\0\0\0\3\0\0\0\"\0\0\0", 12) = 12
write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0\260\332/\0\0\0"..., 
48) = -1 ENOMEM (Cannot allocate memory)
write(3, "\r\0\0\0\3\0\0\0-\0\0\0", 12) = 12
write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0\260\332/\0\0\0"..., 
48) = -1 ENOMEM (Cannot allocate memory)
write(3, "\r\0\0\0\3\0\0\0.\0\0\0", 12) = 12
write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0\260\332/\0\0\0"..., 
48) = -1 ENOMEM (Cannot allocate memory)
write(3, "\r\0\0\0\3\0\0\0/\0\0\0", 12) = 12
write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0\260\332/\0\0\0"..., 
48) = -1 ENOMEM (Cannot allocate memory)
write(3, "\r\0\0\0\3\0\0\0000\0\0\0", 12) = 12
write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0\260\332/\0\0\0"..., 
48) = -1 ENOMEM (Cannot allocate memory)
write(3, "\r\0\0\0\3\0\0\0001\0\0\0", 12) = 12
write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0\260\332/\0\0\0"..., 
48) = -1 ENOMEM (Cannot allocate memory)
write(3, "\r\0\0\0\3\0\0\0002\0\0\0", 12) = 12
write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0\260\332/\0\0\0"..., 
48) = -1 ENOMEM (Cannot allocate memory)
write(3, "\r\0\0\0\3\0\0\0003\0\0\0", 12) = 12
write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0\260\332/\0\0\0"..., 
48) = -1 ENOMEM (Cannot allocate memory)
write(3, "\r\0\0\0\3\0\0\0004\0\0\0", 12) = 12
write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0\260\332/\0\0\0"..., 
48) = -1 ENOMEM (Cannot allocate memory)
write(3, "\r\0\0\0\3\0\0\0005\0\0\0", 12) = 12
write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0\260\332/\0\0\0"..., 
48) = -1 ENOMEM (Cannot allocate memory)
write(3, "\r\0\0\0\3\0\0\0006\0\0\0", 12) = 12
write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0\260\332/\0\0\0"..., 
48) = -1 ENOMEM (Cannot allocate memory)
write(3, "\r\0\0\0\3\0\0\0007\0\0\0", 12) = 12
write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0\260\332/\0\0\0"..., 
48) = -1 ENOMEM (Cannot allocate memory)
write(3, "\r\0\0\0\3\0\0\0008\0\0\0", 12) = 12
write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0\260\332/\0\0\0"..., 
48) = -1 ENOMEM (Cannot allocate memory)
write(3, "\r\0\0\0\3\0\0\0009\0\0\0", 12) = 12
write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0\260\332/\0\0\0"..., 
48) = -1 ENOMEM (Cannot allocate memory)
write(3, "\r\0\0\0\3\0\0\0:\0\0\0", 12) = 12
write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0\260\332/\0\0\0"..., 
48) = -1 ENOMEM (Cannot allocate memory)
write(3, "\r\0\0\0\3\0\0\0;\0\0\0", 12) = 12
write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0\260\332/\0\0\0"..., 
48) = -1 ENOMEM (Cannot allocate memory)
write(3, "\r\0\0\0\3\0\0\0<\0\0\0", 12) = 12
write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0\260\332/\0\0\0"..., 
48) = -1 ENOMEM (Cannot allocate memory)
write(3, "\r\0\0\0\3\0\0\0=\0\0\0", 12) = 12
write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0\260\332/\0\0\0"..., 
48) = -1 ENOMEM (Cannot allocate memory)
write(3, "\r\0\0\0\3\0\0\0>\0\0\0", 12) = 12
write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0\260\332/\0\0\0"..., 
48) = 48
write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0\220\377/\0\0\0"..., 
48) = -1 ENOMEM (Cannot allocate memory)
write(3, "\r\0\0\0\3\0\0\0?\0\0\0", 12) = 12
write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0\220\377/\0\0\0"..., 
48) = -1 ENOMEM (Cannot allocate memory)
write(3, "\r\0\0\0\3\0\0\0@\0\0\0", 12) = 12
write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0\220\377/\0\0\0"..., 
48) = -1 ENOMEM (Cannot allocate memory)
write(3, "\r\0\0\0\3\0\0\0A\0\0\0", 12) = 12
write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0\220\377/\0\0\0"..., 
48) = -1 ENOMEM (Cannot allocate memory)
write(3, "\r\0\0\0\3\0\0\0B\0\0\0", 12) = 12
write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0\220\377/\0\0\0"..., 
48) = -1 ENOMEM (Cannot allocate memory)
write(3, "\r\0\0\0\3\0\0\0C\0\0\0", 12) = 12
write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0\220\377/\0\0\0"..., 
48) = -1 ENOMEM (Cannot allocate memory)
write(3, "\r\0\0\0\3\0\0\0D\0\0\0", 12) = 12
write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0\220\377/\0\0\0"..., 
48) = -1 ENOMEM (Cannot allocate memory)
write(3, "\r\0\0\0\3\0\0\0E\0\0\0", 12) = 12
write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0\220\377/\0\0\0"..., 
48) = -1 ENOMEM (Cannot allocate memory)
write(3, "\r\0\0\0\3\0\0\0F\0\0\0", 12) = 12
write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0\220\377/\0\0\0"..., 
48) = -1 ENOMEM (Cannot allocate memory)
write(3, "\r\0\0\0\3\0\0\0G\0\0\0", 12) = 12
write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0\220\377/\0\0\0"..., 
48) = -1 ENOMEM (Cannot allocate memory)
write(3, "\r\0\0\0\3\0\0\0H\0\0\0", 12) = 12
write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0\220\377/\0\0\0"..., 
48) = -1 ENOMEM (Cannot allocate memory)
write(3, "\r\0\0\0\3\0\0\0J\0\0\0", 12) = 12
write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0\220\377/\0\0\0"..., 
48) = 48
write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0\200$0\0\0\0\0"..., 
48) = -1 ENOMEM (Cannot allocate memory)
write(3, "\r\0\0\0\3\0\0\0K\0\0\0", 12) = 12
write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0\200$0\0\0\0\0"..., 
48) = 48
write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0pI0\0\0\0\0\0\360"..., 
48) = -1 ENOMEM (Cannot allocate memory)
write(3, "\r\0\0\0\3\0\0\0L\0\0\0", 12) = 12
write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0pI0\0\0\0\0\0\360"..., 
48) = 48
write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0Pn0\0\0\0\0\0\0"..., 
48) = -1 ENOMEM (Cannot allocate memory)
write(3, "\r\0\0\0\3\0\0\0M\0\0\0", 12) = 12
write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0Pn0\0\0\0\0\0\0"..., 
48) = 48
write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0@\2230\0\0\0\0"..., 
48) = -1 ENOMEM (Cannot allocate memory)
write(3, "\r\0\0\0\3\0\0\0N\0\0\0", 12) = 12
write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0@\2230\0\0\0\0"..., 
48) = 48
write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0000\2700\0\0\0"..., 
48) = -1 ENOMEM (Cannot allocate memory)
write(3, "\r\0\0\0\3\0\0\0O\0\0\0", 12) = 12
write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0000\2700\0\0\0"..., 
48) = 48
write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0\20\3350\0\0\0"..., 
48) = -1 ENOMEM (Cannot allocate memory)
write(3, "\r\0\0\0\3\0\0\0P\0\0\0", 12) = 12
write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0\20\3350\0\0\0"..., 
48) = 48
lseek(8, 0, SEEK_CUR)                   = 71696384
read(8, "\276\22\335r\275\367\315A\275\312\337p\275\237\22\303\275"..., 
131072) = 131072
lseek(8, 0, SEEK_CUR)                   = 71827456



[another 1000 lines or so not involving fd=3]



write(3, "\t\0\0\0\f\0\3\0 at f\336\356\377\177\0\0\0\240\353\'\0\0"..., 
48) = 48
write(3, "\t\0\0\0\f\0\3\0 at f\336\356\377\177\0\0\0@\360\'\0\0\0\0"..., 
48) = 48
write(3, "\t\0\0\0\f\0\3\0 at f\336\356\377\177\0\0\0\340\364\'\0\0"..., 
48) = 48
write(3, "\t\0\0\0\f\0\3\0 at f\336\356\377\177\0\0\0\200\371\'\0\0"..., 
48) = -1 ENOMEM (Cannot allocate memory)
--- SIGSEGV (Segmentation fault) @ 0 (0) ---
+++ killed by SIGSEGV +++
Process 12024 detached

=============== END OF strace SNIPPETS ===============================




I'd appreciate any insight into this problem. Let me know if you need 
more information, or the full log file.

Thanks,
Paul

-- 
Paul Howard
Chief Scientist
Microway, Inc.

paulhoward at microway.com
1-508-732-5521



More information about the mvapich-discuss mailing list