[mvapich-discuss] ENOMEM when writing to /dev/infiniband/uverbs0

Dhabaleswar Panda panda at cse.ohio-state.edu
Tue Jul 31 22:46:59 EDT 2007


Hi Paul, 

Thanks for reporting this problem. Have you tried this application
with the latest MVAPICH 0.9.9 release. MVAPICH 0.9.8 is already one
year old. Many new features, ehancements and bug fixes have gone into
the 0.9.9 version. Some of these are related to memory allocation. Can
you try this with MVAPICH 0.9.9 and let us know the outcome. If the
problem persists with 0.9.9, it will be easier to debug.

Also, do you see this problem with MVAPICH2 0.9.8p3 (or the latest
released MVAPICH2 1.0-beta). This will also help us to narrow down the
problem.

Best Regards, 

DK

> I have an issue with an MPI application.
> 
> The version of MVAPICH is 0.9.8, compiled with PGI 6.2.
> 
> The program, also compiled with PGI 6.2, is running on an 8-node
> cluster, with 2 dual-core Opteron 2218's on each node. Each node has
> 4GB of memory. The nodes are named node10, node11, ..., node17. I
> start the MPI job on node10: "mpirun -np 32 ./wrf.exe". The machines
> list lists the 8 nodes on the first 8 lines, then repeats those 8
> lines 3 more times, for a total of 32 lines.
> 
> The program runs successfully as root with np=32. (It takes hours to
> run.) When run as an ordinary user, it fails almost immediately
> (within 5 seconds or so) with a segmentation fault.
> 
> It also fails when I remove the last 3 occurrences of node10 from the
> machines list and run with np=29 as an ordinary user (and as expected,
> it does not fail immediately as root with np=29). Doing it this way
> lets me run strace on the single process on node10.
> 
> It seems to fail with error ENOMEM some times but not every time that
> it writes to /dev/infiniband/uverbs0. It reports ENOMEM a number of
> times; the segmentation fault came on the 38th ENOMEM. (When run in a
> similar way as root, with np=29 and running strace on the only process
> on node10, there are no ENOMEM errors.) I couldn't find anything with
> Google.
> 
> The output of strace is like this (I've added some blank lines to make
> things stand out). I can provide the whole 7MB strace log if
> it would be useful.
> 
> =============== START OF strace SNIPPETS ===============================
> 
> [280 lines deleted]
> 
> open("/sys/class/infiniband_verbs/uverbs0/abi_version", O_RDONLY) = 3
> read(3, "1\n", 8)                       = 2
> close(3)                                = 0
> open("/sys/class/infiniband_verbs/uverbs0/device/vendor", O_RDONLY) = 3
> read(3, "0x15b3\n", 8)                  = 7
> close(3)                                = 0
> open("/sys/class/infiniband_verbs/uverbs0/device/device", O_RDONLY) = 3
> read(3, "0x6274\n", 8)                  = 7
> close(3)                                = 0
> 
> 
> open("/dev/infiniband/uverbs0", O_RDWR) = 3
> 
> 
> write(3, "\0\0\0\0\4\0\4\0000\223\336\356\377\177\0\0", 16) = 16
> mmap(NULL, 4096, PROT_WRITE, MAP_SHARED, 3, 0) = 0x2ac1bc8e2000
> write(3, "\3\0\0\0\4\0\3\0\0\223\336\356\377\177\0\0", 16) = 16
> write(3, "\3\0\0\0\4\0\3\0`\223\336\356\377\177\0\0", 16) = 16
> write(3, "\2\0\0\0\6\0\n\0\20\223\336\356\377\177\0\0\1\0\0\0\0\0"..., 
> 24) = 24
> 
> 
> [about 157000 lines deleted, none of them involving opening or closing
> fd=3, but about 200 of them involving write(3,...)]
> 
> write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0\220n/\0\0\0\0"..., 
> 48) = 48
> write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0000\222/\0\0\0"..., 
> 48) = 48
> write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0\300\265/\0\0\0"..., 
> 48) = 48
> write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0\260\332/\0\0\0"..., 
> 48) = -1 ENOMEM (Cannot allocate memory)
> write(3, "\r\0\0\0\3\0\0\0!\0\0\0", 12) = 12
> write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0\260\332/\0\0\0"..., 
> 48) = -1 ENOMEM (Cannot allocate memory)
> write(3, "\r\0\0\0\3\0\0\0\"\0\0\0", 12) = 12
> write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0\260\332/\0\0\0"..., 
> 48) = -1 ENOMEM (Cannot allocate memory)
> write(3, "\r\0\0\0\3\0\0\0-\0\0\0", 12) = 12
> write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0\260\332/\0\0\0"..., 
> 48) = -1 ENOMEM (Cannot allocate memory)
> write(3, "\r\0\0\0\3\0\0\0.\0\0\0", 12) = 12
> write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0\260\332/\0\0\0"..., 
> 48) = -1 ENOMEM (Cannot allocate memory)
> write(3, "\r\0\0\0\3\0\0\0/\0\0\0", 12) = 12
> write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0\260\332/\0\0\0"..., 
> 48) = -1 ENOMEM (Cannot allocate memory)
> write(3, "\r\0\0\0\3\0\0\0000\0\0\0", 12) = 12
> write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0\260\332/\0\0\0"..., 
> 48) = -1 ENOMEM (Cannot allocate memory)
> write(3, "\r\0\0\0\3\0\0\0001\0\0\0", 12) = 12
> write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0\260\332/\0\0\0"..., 
> 48) = -1 ENOMEM (Cannot allocate memory)
> write(3, "\r\0\0\0\3\0\0\0002\0\0\0", 12) = 12
> write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0\260\332/\0\0\0"..., 
> 48) = -1 ENOMEM (Cannot allocate memory)
> write(3, "\r\0\0\0\3\0\0\0003\0\0\0", 12) = 12
> write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0\260\332/\0\0\0"..., 
> 48) = -1 ENOMEM (Cannot allocate memory)
> write(3, "\r\0\0\0\3\0\0\0004\0\0\0", 12) = 12
> write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0\260\332/\0\0\0"..., 
> 48) = -1 ENOMEM (Cannot allocate memory)
> write(3, "\r\0\0\0\3\0\0\0005\0\0\0", 12) = 12
> write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0\260\332/\0\0\0"..., 
> 48) = -1 ENOMEM (Cannot allocate memory)
> write(3, "\r\0\0\0\3\0\0\0006\0\0\0", 12) = 12
> write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0\260\332/\0\0\0"..., 
> 48) = -1 ENOMEM (Cannot allocate memory)
> write(3, "\r\0\0\0\3\0\0\0007\0\0\0", 12) = 12
> write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0\260\332/\0\0\0"..., 
> 48) = -1 ENOMEM (Cannot allocate memory)
> write(3, "\r\0\0\0\3\0\0\0008\0\0\0", 12) = 12
> write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0\260\332/\0\0\0"..., 
> 48) = -1 ENOMEM (Cannot allocate memory)
> write(3, "\r\0\0\0\3\0\0\0009\0\0\0", 12) = 12
> write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0\260\332/\0\0\0"..., 
> 48) = -1 ENOMEM (Cannot allocate memory)
> write(3, "\r\0\0\0\3\0\0\0:\0\0\0", 12) = 12
> write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0\260\332/\0\0\0"..., 
> 48) = -1 ENOMEM (Cannot allocate memory)
> write(3, "\r\0\0\0\3\0\0\0;\0\0\0", 12) = 12
> write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0\260\332/\0\0\0"..., 
> 48) = -1 ENOMEM (Cannot allocate memory)
> write(3, "\r\0\0\0\3\0\0\0<\0\0\0", 12) = 12
> write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0\260\332/\0\0\0"..., 
> 48) = -1 ENOMEM (Cannot allocate memory)
> write(3, "\r\0\0\0\3\0\0\0=\0\0\0", 12) = 12
> write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0\260\332/\0\0\0"..., 
> 48) = -1 ENOMEM (Cannot allocate memory)
> write(3, "\r\0\0\0\3\0\0\0>\0\0\0", 12) = 12
> write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0\260\332/\0\0\0"..., 
> 48) = 48
> write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0\220\377/\0\0\0"..., 
> 48) = -1 ENOMEM (Cannot allocate memory)
> write(3, "\r\0\0\0\3\0\0\0?\0\0\0", 12) = 12
> write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0\220\377/\0\0\0"..., 
> 48) = -1 ENOMEM (Cannot allocate memory)
> write(3, "\r\0\0\0\3\0\0\0@\0\0\0", 12) = 12
> write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0\220\377/\0\0\0"..., 
> 48) = -1 ENOMEM (Cannot allocate memory)
> write(3, "\r\0\0\0\3\0\0\0A\0\0\0", 12) = 12
> write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0\220\377/\0\0\0"..., 
> 48) = -1 ENOMEM (Cannot allocate memory)
> write(3, "\r\0\0\0\3\0\0\0B\0\0\0", 12) = 12
> write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0\220\377/\0\0\0"..., 
> 48) = -1 ENOMEM (Cannot allocate memory)
> write(3, "\r\0\0\0\3\0\0\0C\0\0\0", 12) = 12
> write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0\220\377/\0\0\0"..., 
> 48) = -1 ENOMEM (Cannot allocate memory)
> write(3, "\r\0\0\0\3\0\0\0D\0\0\0", 12) = 12
> write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0\220\377/\0\0\0"..., 
> 48) = -1 ENOMEM (Cannot allocate memory)
> write(3, "\r\0\0\0\3\0\0\0E\0\0\0", 12) = 12
> write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0\220\377/\0\0\0"..., 
> 48) = -1 ENOMEM (Cannot allocate memory)
> write(3, "\r\0\0\0\3\0\0\0F\0\0\0", 12) = 12
> write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0\220\377/\0\0\0"..., 
> 48) = -1 ENOMEM (Cannot allocate memory)
> write(3, "\r\0\0\0\3\0\0\0G\0\0\0", 12) = 12
> write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0\220\377/\0\0\0"..., 
> 48) = -1 ENOMEM (Cannot allocate memory)
> write(3, "\r\0\0\0\3\0\0\0H\0\0\0", 12) = 12
> write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0\220\377/\0\0\0"..., 
> 48) = -1 ENOMEM (Cannot allocate memory)
> write(3, "\r\0\0\0\3\0\0\0J\0\0\0", 12) = 12
> write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0\220\377/\0\0\0"..., 
> 48) = 48
> write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0\200$0\0\0\0\0"..., 
> 48) = -1 ENOMEM (Cannot allocate memory)
> write(3, "\r\0\0\0\3\0\0\0K\0\0\0", 12) = 12
> write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0\200$0\0\0\0\0"..., 
> 48) = 48
> write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0pI0\0\0\0\0\0\360"..., 
> 48) = -1 ENOMEM (Cannot allocate memory)
> write(3, "\r\0\0\0\3\0\0\0L\0\0\0", 12) = 12
> write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0pI0\0\0\0\0\0\360"..., 
> 48) = 48
> write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0Pn0\0\0\0\0\0\0"..., 
> 48) = -1 ENOMEM (Cannot allocate memory)
> write(3, "\r\0\0\0\3\0\0\0M\0\0\0", 12) = 12
> write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0Pn0\0\0\0\0\0\0"..., 
> 48) = 48
> write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0@\2230\0\0\0\0"..., 
> 48) = -1 ENOMEM (Cannot allocate memory)
> write(3, "\r\0\0\0\3\0\0\0N\0\0\0", 12) = 12
> write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0@\2230\0\0\0\0"..., 
> 48) = 48
> write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0000\2700\0\0\0"..., 
> 48) = -1 ENOMEM (Cannot allocate memory)
> write(3, "\r\0\0\0\3\0\0\0O\0\0\0", 12) = 12
> write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0000\2700\0\0\0"..., 
> 48) = 48
> write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0\20\3350\0\0\0"..., 
> 48) = -1 ENOMEM (Cannot allocate memory)
> write(3, "\r\0\0\0\3\0\0\0P\0\0\0", 12) = 12
> write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0\20\3350\0\0\0"..., 
> 48) = 48
> lseek(8, 0, SEEK_CUR)                   = 71696384
> read(8, "\276\22\335r\275\367\315A\275\312\337p\275\237\22\303\275"..., 
> 131072) = 131072
> lseek(8, 0, SEEK_CUR)                   = 71827456
> 
> 
> 
> [another 1000 lines or so not involving fd=3]
> 
> 
> 
> write(3, "\t\0\0\0\f\0\3\0 at f\336\356\377\177\0\0\0\240\353\'\0\0"..., 
> 48) = 48
> write(3, "\t\0\0\0\f\0\3\0 at f\336\356\377\177\0\0\0@\360\'\0\0\0\0"..., 
> 48) = 48
> write(3, "\t\0\0\0\f\0\3\0 at f\336\356\377\177\0\0\0\340\364\'\0\0"..., 
> 48) = 48
> write(3, "\t\0\0\0\f\0\3\0 at f\336\356\377\177\0\0\0\200\371\'\0\0"..., 
> 48) = -1 ENOMEM (Cannot allocate memory)
> --- SIGSEGV (Segmentation fault) @ 0 (0) ---
> +++ killed by SIGSEGV +++
> Process 12024 detached
> 
> =============== END OF strace SNIPPETS ===============================
> 
> 
> 
> 
> I'd appreciate any insight into this problem. Let me know if you need 
> more information, or the full log file.
> 
> Thanks,
> Paul
> 
> -- 
> Paul Howard
> Chief Scientist
> Microway, Inc.
> 
> paulhoward at microway.com
> 1-508-732-5521
> 
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> 



More information about the mvapich-discuss mailing list