Hunting Memory Leaks
Memory leaks can be a big issue with gevent workers because several tasks share the same python process.
Thankfully, MRQ provides tools to track down such issues. Memory usage of each worker is graphed in the dashboard and makes it easy to see if memory leaks are happening.
When a worker has a steadily growing memory usage, here are the steps to find the leak:
- Check which jobs are running on this worker and try to isolate which of them is leaking and on which queue
- Start a dedicated worker with
--trace_memory --greenlets 1
on the same queue : This will start a worker doing one job at a time with memory profiling enabled. After each job you should see a report of leaked object types. - Find the most unique type in the list (usually not 'list' or 'dict') and restart the worker with
--trace_memory --greenlets 1 --trace_memory_type=XXX --trace_memory_output_dir=memdbg
(after creating the directory memdbg). - There you will find a graph for each task generated by objgraph which is incredibly helpful to track down the leak.
Using guppy
If you want to get an interactive debugging session to deal with high memory usage, you can use guppy. Here is how:
First, initialize a REPL with MRQ configured and guppy loaded:
$ pip install guppy
$ python
>>> from mrq.context import setup_context, run_task
>>> setup_context()
>>> from guppy import hpy
>>> hp = hpy()
Then, wrap your memory-intensive task or code around guppy calls
>>> hp.setrelheap() # Used as reference point for memory usage
>>> run_task("tasks.your.MemoryHungryTask", {"a": 1, "b": 2})
>>> h = hp.heap()
At this point h
should contain all the infos you need. You can view an extended debugging session here.
>>> h
Partition of a set of 300643 objects. Total size = 41626536 bytes.
Index Count % Size % Cumulative % Kind (class / dict of class)
0 130043 43 15682088 38 15682088 38 str
1 76123 25 6978416 17 22660504 54 tuple
2 1015 0 2794024 7 25454528 61 dict of module
3 20181 7 2583168 6 28037696 67 types.CodeType
4 20610 7 2473200 6 30510896 73 function
5 2321 1 2095216 5 32606112 78 type
6 2319 1 2045160 5 34651272 83 dict of type
7 1277 0 1162808 3 35814080 86 dict (no owner)
8 2890 1 918352 2 36732432 88 unicode
9 494 0 440912 1 37173344 89 dict of class
So our task added 41.6M of RAM to the current process. Let's see where it comes from, starting by these 15M of strings:
>>> h[0].byvia
Partition of a set of 130043 objects. Total size = 15682088 bytes.
Index Count % Size % Cumulative % Referred Via:
0 8208 6 3890208 25 3890208 25 '.func_doc', '[0]'
1 20065 15 3239664 21 7129872 45 '.co_code'
2 16673 13 1706224 11 8836096 56 '.co_filename'
3 2398 2 1606864 10 10442960 67 "['__doc__']"
4 19810 15 1109640 7 11552600 74 '.co_lnotab'
5 419 0 308392 2 11860992 76 '.func_doc'
6 4311 3 285232 2 12146224 77 '[1]'
7 2788 2 167616 1 12313840 79 '[2]'
8 2153 2 129136 1 12442976 79 '[3]'
9 1006 1 109560 1 12552536 80 "['__file__']"
<21212 more rows. Type e.g. '_.more' to view.>
Here, it seems that surprisingly, most of the strings are actually docstrings. This can happen if you work with large Python modules like scipy or boto. One might consider stripping them manually or with Python's optimized mode.
Running a full worker
Sometimes memory leaks are complex and need a big number of jobs to be detected. To do that you can run a full worker, wrapped in calls to guppy to inspect memory usage after N jobs.
This real-life example also includes code that clears some Python caches to avoid polluting results.
$ pip install guppy
$ python
# Be as close to a Worker environment as possible
import os
if "GEVENT_RESOLVER" not in os.environ:
os.environ["GEVENT_RESOLVER"] = "ares"
from gevent import monkey
monkey.patch_all()
from mrq.context import setup_context, run_task, log, get_current_config
from mrq.queue import send_tasks
from mrq.utils import load_class_by_path
from guppy import hpy
import sys, urlparse, gc, re, linecache
hp = hpy()
setup_context(config_type="worker", extra={"greenlets": 10, "queues": ["default"], "mongodb_logs":0, "report_interval":10000})
worker_class = load_class_by_path(get_current_config()["worker_class"])
w = worker_class()
w.work_init()
def work_heap(max_jobs):
gc.collect()
hp.setrelheap()
w.work_loop(max_jobs=max_jobs)
# Clear some Python caches to clean the results
urlparse.clear_cache()
re.purge()
linecache.clearcache()
log.handler.reset()
gc.collect()
return hp.heap()
# Optional: Queue some jobs
send_tasks("tests.tasks.general.Leak", [{} for _ in range(110)])
# Run a few jobs to eliminate side-effects from loading code and modules for the first time
work_heap(10)
# Now we can run a larger number of jobs
h = work_heap(100)
Then you can debug the heap with Guppy:
>>> h
Partition of a set of 347 objects. Total size = 61320 bytes.
Index Count % Size % Cumulative % Kind (class / dict of class)
0 215 62 18920 31 18920 31 __builtin__.weakref
1 4 1 8800 14 27720 45 dict of mongokit.document.DocumentProperties
2 17 5 8328 14 36048 59 list
3 4 1 5792 9 41840 68 mongokit.helpers.DotCollapsedDict
4 8 2 4544 7 46384 76 dict (no owner)
5 4 1 3616 6 50000 82 mongokit.document.DocumentProperties
6 5 1 1160 2 51160 83 __builtin__.set
7 4 1 1120 2 52280 85 dict of mongokit.helpers.DotCollapsedDict
8 1 0 1048 2 53328 87 dict of 0x279e1f0
9 1 0 1048 2 54376 89 dict of 0x2905040
<18 more rows. Type e.g. '_.more' to view.>
>>> h[0].byvia
Partition of a set of 215 objects. Total size = 18920 bytes.
Index Count % Size % Cumulative % Referred Via:
0 4 2 352 2 352 2 '[19]'
1 4 2 352 2 704 4 '[20]'
2 3 1 264 1 968 5 '[17]'
3 3 1 264 1 1232 7 '[18]'
4 3 1 264 1 1496 8 '[27]'
5 3 1 264 1 1760 9 '[28]'
6 3 1 264 1 2024 11 '[29]'
7 3 1 264 1 2288 12 '[33]'
8 3 1 264 1 2552 13 '[35]'
9 3 1 264 1 2816 15 '[5]'
<147 more rows. Type e.g. '_.more' to view.>
>>> h[0].byvia[0].referrers
Partition of a set of 4 objects. Total size = 2144 bytes.
Index Count % Size % Cumulative % Kind (class / dict of class)
0 4 100 2144 100 2144 100 list
>>> h[0].byvia[0].referrers.byvia
Partition of a set of 4 objects. Total size = 2144 bytes.
Index Count % Size % Cumulative % Referred Via:
0 4 100 2144 100 2144 100 '->tp_subclasses'
In this case, it seems we are leaking references to subclasses, which is a known Python issue.