[mvapich-discuss] ENOMEM when writing to /dev/infiniband/uverbs0

Paul Howard paulhoward at microway.com
Thu Aug 2 09:44:05 EDT 2007


Dr. Panda,

You wrote:
> Thanks for reporting this problem. Have you tried this application
> with the latest MVAPICH 0.9.9 release. MVAPICH 0.9.8 is already one
> year old. Many new features, ehancements and bug fixes have gone into
> the 0.9.9 version. Some of these are related to memory allocation. Can
> you try this with MVAPICH 0.9.9 and let us know the outcome. If the
> problem persists with 0.9.9, it will be easier to debug.
>   
The problem does not seem to occur with the latest MVAPICH 0.9.9. Thanks 
for the suggestion.

> Also, do you see this problem with MVAPICH2 0.9.8p3 (or the latest
> released MVAPICH2 1.0-beta). This will also help us to narrow down the
> problem.
>   

We did not try MVAPICH2.

I guess no further action is required on your part or mine, unless you 
need more information from me.

Thanks,
Paul

Original problem report:
>   
>> I have an issue with an MPI application.
>>
>> The version of MVAPICH is 0.9.8, compiled with PGI 6.2.
>>
>> The program, also compiled with PGI 6.2, is running on an 8-node
>> cluster, with 2 dual-core Opteron 2218's on each node. Each node has
>> 4GB of memory. The nodes are named node10, node11, ..., node17. I
>> start the MPI job on node10: "mpirun -np 32 ./wrf.exe". The machines
>> list lists the 8 nodes on the first 8 lines, then repeats those 8
>> lines 3 more times, for a total of 32 lines.
>>
>> The program runs successfully as root with np=32. (It takes hours to
>> run.) When run as an ordinary user, it fails almost immediately
>> (within 5 seconds or so) with a segmentation fault.
>>
>> It also fails when I remove the last 3 occurrences of node10 from the
>> machines list and run with np=29 as an ordinary user (and as expected,
>> it does not fail immediately as root with np=29). Doing it this way
>> lets me run strace on the single process on node10.
>>
>> It seems to fail with error ENOMEM some times but not every time that
>> it writes to /dev/infiniband/uverbs0. It reports ENOMEM a number of
>> times; the segmentation fault came on the 38th ENOMEM. (When run in a
>> similar way as root, with np=29 and running strace on the only process
>> on node10, there are no ENOMEM errors.) I couldn't find anything with
>> Google.
>>
>> The output of strace is like this (I've added some blank lines to make
>> things stand out). I can provide the whole 7MB strace log if
>> it would be useful.
>>
>> =============== START OF strace SNIPPETS ===============================
>>
>> [280 lines deleted]
>>
>> open("/sys/class/infiniband_verbs/uverbs0/abi_version", O_RDONLY) = 3
>> read(3, "1\n", 8)                       = 2
>> close(3)                                = 0
>> open("/sys/class/infiniband_verbs/uverbs0/device/vendor", O_RDONLY) = 3
>> read(3, "0x15b3\n", 8)                  = 7
>> close(3)                                = 0
>> open("/sys/class/infiniband_verbs/uverbs0/device/device", O_RDONLY) = 3
>> read(3, "0x6274\n", 8)                  = 7
>> close(3)                                = 0
>>
>>
>> open("/dev/infiniband/uverbs0", O_RDWR) = 3
>>
>>
>> write(3, "\0\0\0\0\4\0\4\0000\223\336\356\377\177\0\0", 16) = 16
>> mmap(NULL, 4096, PROT_WRITE, MAP_SHARED, 3, 0) = 0x2ac1bc8e2000
>> write(3, "\3\0\0\0\4\0\3\0\0\223\336\356\377\177\0\0", 16) = 16
>> write(3, "\3\0\0\0\4\0\3\0`\223\336\356\377\177\0\0", 16) = 16
>> write(3, "\2\0\0\0\6\0\n\0\20\223\336\356\377\177\0\0\1\0\0\0\0\0"..., 
>> 24) = 24
>>
>>
>> [about 157000 lines deleted, none of them involving opening or closing
>> fd=3, but about 200 of them involving write(3,...)]
>>
>> write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0\220n/\0\0\0\0"..., 
>> 48) = 48
>> write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0000\222/\0\0\0"..., 
>> 48) = 48
>> write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0\300\265/\0\0\0"..., 
>> 48) = 48
>> write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0\260\332/\0\0\0"..., 
>> 48) = -1 ENOMEM (Cannot allocate memory)
>> write(3, "\r\0\0\0\3\0\0\0!\0\0\0", 12) = 12
>> write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0\260\332/\0\0\0"..., 
>> 48) = -1 ENOMEM (Cannot allocate memory)
>> write(3, "\r\0\0\0\3\0\0\0\"\0\0\0", 12) = 12
>> write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0\260\332/\0\0\0"..., 
>> 48) = -1 ENOMEM (Cannot allocate memory)
>> write(3, "\r\0\0\0\3\0\0\0-\0\0\0", 12) = 12
>> write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0\260\332/\0\0\0"..., 
>> 48) = -1 ENOMEM (Cannot allocate memory)
>> write(3, "\r\0\0\0\3\0\0\0.\0\0\0", 12) = 12
>> write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0\260\332/\0\0\0"..., 
>> 48) = -1 ENOMEM (Cannot allocate memory)
>> write(3, "\r\0\0\0\3\0\0\0/\0\0\0", 12) = 12
>> write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0\260\332/\0\0\0"..., 
>> 48) = -1 ENOMEM (Cannot allocate memory)
>> write(3, "\r\0\0\0\3\0\0\0000\0\0\0", 12) = 12
>> write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0\260\332/\0\0\0"..., 
>> 48) = -1 ENOMEM (Cannot allocate memory)
>> write(3, "\r\0\0\0\3\0\0\0001\0\0\0", 12) = 12
>> write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0\260\332/\0\0\0"..., 
>> 48) = -1 ENOMEM (Cannot allocate memory)
>> write(3, "\r\0\0\0\3\0\0\0002\0\0\0", 12) = 12
>> write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0\260\332/\0\0\0"..., 
>> 48) = -1 ENOMEM (Cannot allocate memory)
>> write(3, "\r\0\0\0\3\0\0\0003\0\0\0", 12) = 12
>> write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0\260\332/\0\0\0"..., 
>> 48) = -1 ENOMEM (Cannot allocate memory)
>> write(3, "\r\0\0\0\3\0\0\0004\0\0\0", 12) = 12
>> write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0\260\332/\0\0\0"..., 
>> 48) = -1 ENOMEM (Cannot allocate memory)
>> write(3, "\r\0\0\0\3\0\0\0005\0\0\0", 12) = 12
>> write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0\260\332/\0\0\0"..., 
>> 48) = -1 ENOMEM (Cannot allocate memory)
>> write(3, "\r\0\0\0\3\0\0\0006\0\0\0", 12) = 12
>> write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0\260\332/\0\0\0"..., 
>> 48) = -1 ENOMEM (Cannot allocate memory)
>> write(3, "\r\0\0\0\3\0\0\0007\0\0\0", 12) = 12
>> write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0\260\332/\0\0\0"..., 
>> 48) = -1 ENOMEM (Cannot allocate memory)
>> write(3, "\r\0\0\0\3\0\0\0008\0\0\0", 12) = 12
>> write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0\260\332/\0\0\0"..., 
>> 48) = -1 ENOMEM (Cannot allocate memory)
>> write(3, "\r\0\0\0\3\0\0\0009\0\0\0", 12) = 12
>> write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0\260\332/\0\0\0"..., 
>> 48) = -1 ENOMEM (Cannot allocate memory)
>> write(3, "\r\0\0\0\3\0\0\0:\0\0\0", 12) = 12
>> write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0\260\332/\0\0\0"..., 
>> 48) = -1 ENOMEM (Cannot allocate memory)
>> write(3, "\r\0\0\0\3\0\0\0;\0\0\0", 12) = 12
>> write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0\260\332/\0\0\0"..., 
>> 48) = -1 ENOMEM (Cannot allocate memory)
>> write(3, "\r\0\0\0\3\0\0\0<\0\0\0", 12) = 12
>> write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0\260\332/\0\0\0"..., 
>> 48) = -1 ENOMEM (Cannot allocate memory)
>> write(3, "\r\0\0\0\3\0\0\0=\0\0\0", 12) = 12
>> write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0\260\332/\0\0\0"..., 
>> 48) = -1 ENOMEM (Cannot allocate memory)
>> write(3, "\r\0\0\0\3\0\0\0>\0\0\0", 12) = 12
>> write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0\260\332/\0\0\0"..., 
>> 48) = 48
>> write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0\220\377/\0\0\0"..., 
>> 48) = -1 ENOMEM (Cannot allocate memory)
>> write(3, "\r\0\0\0\3\0\0\0?\0\0\0", 12) = 12
>> write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0\220\377/\0\0\0"..., 
>> 48) = -1 ENOMEM (Cannot allocate memory)
>> write(3, "\r\0\0\0\3\0\0\0@\0\0\0", 12) = 12
>> write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0\220\377/\0\0\0"..., 
>> 48) = -1 ENOMEM (Cannot allocate memory)
>> write(3, "\r\0\0\0\3\0\0\0A\0\0\0", 12) = 12
>> write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0\220\377/\0\0\0"..., 
>> 48) = -1 ENOMEM (Cannot allocate memory)
>> write(3, "\r\0\0\0\3\0\0\0B\0\0\0", 12) = 12
>> write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0\220\377/\0\0\0"..., 
>> 48) = -1 ENOMEM (Cannot allocate memory)
>> write(3, "\r\0\0\0\3\0\0\0C\0\0\0", 12) = 12
>> write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0\220\377/\0\0\0"..., 
>> 48) = -1 ENOMEM (Cannot allocate memory)
>> write(3, "\r\0\0\0\3\0\0\0D\0\0\0", 12) = 12
>> write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0\220\377/\0\0\0"..., 
>> 48) = -1 ENOMEM (Cannot allocate memory)
>> write(3, "\r\0\0\0\3\0\0\0E\0\0\0", 12) = 12
>> write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0\220\377/\0\0\0"..., 
>> 48) = -1 ENOMEM (Cannot allocate memory)
>> write(3, "\r\0\0\0\3\0\0\0F\0\0\0", 12) = 12
>> write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0\220\377/\0\0\0"..., 
>> 48) = -1 ENOMEM (Cannot allocate memory)
>> write(3, "\r\0\0\0\3\0\0\0G\0\0\0", 12) = 12
>> write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0\220\377/\0\0\0"..., 
>> 48) = -1 ENOMEM (Cannot allocate memory)
>> write(3, "\r\0\0\0\3\0\0\0H\0\0\0", 12) = 12
>> write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0\220\377/\0\0\0"..., 
>> 48) = -1 ENOMEM (Cannot allocate memory)
>> write(3, "\r\0\0\0\3\0\0\0J\0\0\0", 12) = 12
>> write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0\220\377/\0\0\0"..., 
>> 48) = 48
>> write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0\200$0\0\0\0\0"..., 
>> 48) = -1 ENOMEM (Cannot allocate memory)
>> write(3, "\r\0\0\0\3\0\0\0K\0\0\0", 12) = 12
>> write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0\200$0\0\0\0\0"..., 
>> 48) = 48
>> write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0pI0\0\0\0\0\0\360"..., 
>> 48) = -1 ENOMEM (Cannot allocate memory)
>> write(3, "\r\0\0\0\3\0\0\0L\0\0\0", 12) = 12
>> write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0pI0\0\0\0\0\0\360"..., 
>> 48) = 48
>> write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0Pn0\0\0\0\0\0\0"..., 
>> 48) = -1 ENOMEM (Cannot allocate memory)
>> write(3, "\r\0\0\0\3\0\0\0M\0\0\0", 12) = 12
>> write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0Pn0\0\0\0\0\0\0"..., 
>> 48) = 48
>> write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0@\2230\0\0\0\0"..., 
>> 48) = -1 ENOMEM (Cannot allocate memory)
>> write(3, "\r\0\0\0\3\0\0\0N\0\0\0", 12) = 12
>> write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0@\2230\0\0\0\0"..., 
>> 48) = 48
>> write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0000\2700\0\0\0"..., 
>> 48) = -1 ENOMEM (Cannot allocate memory)
>> write(3, "\r\0\0\0\3\0\0\0O\0\0\0", 12) = 12
>> write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0000\2700\0\0\0"..., 
>> 48) = 48
>> write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0\20\3350\0\0\0"..., 
>> 48) = -1 ENOMEM (Cannot allocate memory)
>> write(3, "\r\0\0\0\3\0\0\0P\0\0\0", 12) = 12
>> write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0\20\3350\0\0\0"..., 
>> 48) = 48
>> lseek(8, 0, SEEK_CUR)                   = 71696384
>> read(8, "\276\22\335r\275\367\315A\275\312\337p\275\237\22\303\275"..., 
>> 131072) = 131072
>> lseek(8, 0, SEEK_CUR)                   = 71827456
>>
>>
>>
>> [another 1000 lines or so not involving fd=3]
>>
>>
>>
>> write(3, "\t\0\0\0\f\0\3\0 at f\336\356\377\177\0\0\0\240\353\'\0\0"..., 
>> 48) = 48
>> write(3, "\t\0\0\0\f\0\3\0 at f\336\356\377\177\0\0\0@\360\'\0\0\0\0"..., 
>> 48) = 48
>> write(3, "\t\0\0\0\f\0\3\0 at f\336\356\377\177\0\0\0\340\364\'\0\0"..., 
>> 48) = 48
>> write(3, "\t\0\0\0\f\0\3\0 at f\336\356\377\177\0\0\0\200\371\'\0\0"..., 
>> 48) = -1 ENOMEM (Cannot allocate memory)
>> --- SIGSEGV (Segmentation fault) @ 0 (0) ---
>> +++ killed by SIGSEGV +++
>> Process 12024 detached
>>
>> =============== END OF strace SNIPPETS ===============================
>>
>>
>>
>>
>> I'd appreciate any insight into this problem. Let me know if you need 
>> more information, or the full log file.
>>
>> Thanks,
>> Paul
>>
>> -- 
>> Paul Howard
>> Chief Scientist
>> Microway, Inc.
>>
>> paulhoward at microway.com
>> 1-508-732-5521
>>
>> _______________________________________________
>> mvapich-discuss mailing list
>> mvapich-discuss at cse.ohio-state.edu
>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>     

-- 
Paul Howard
Chief Scientist
Microway, Inc.

paulhoward at microway.com
1-508-732-5521



More information about the mvapich-discuss mailing list