Hunting Memory Leaks

Memory leaks can be a big issue with gevent workers because several tasks share the same python process.

Thankfully, MRQ provides tools to track down such issues. Memory usage of each worker is graphed in the dashboard and makes it easy to see if memory leaks are happening.

When a worker has a steadily growing memory usage, here are the steps to find the leak:

  • Check which jobs are running on this worker and try to isolate which of them is leaking and on which queue
  • Start a dedicated worker with --trace_memory --greenlets 1 on the same queue : This will start a worker doing one job at a time with memory profiling enabled. After each job you should see a report of leaked object types.
  • Find the most unique type in the list (usually not 'list' or 'dict') and restart the worker with --trace_memory --greenlets 1 --trace_memory_type=XXX --trace_memory_output_dir=memdbg (after creating the directory memdbg).
  • There you will find a graph for each task generated by objgraph which is incredibly helpful to track down the leak.

Using guppy

If you want to get an interactive debugging session to deal with high memory usage, you can use guppy. Here is how:

First, initialize a REPL with MRQ configured and guppy loaded:

$ pip install guppy
$ python
>>> from mrq.context import setup_context, run_task
>>> setup_context()
>>> from guppy import hpy
>>> hp = hpy()

Then, wrap your memory-intensive task or code around guppy calls

>>> hp.setrelheap()  # Used as reference point for memory usage
>>> run_task("tasks.your.MemoryHungryTask", {"a": 1, "b": 2})
>>> h = hp.heap()

At this point h should contain all the infos you need. You can view an extended debugging session here.

>>> h
Partition of a set of 300643 objects. Total size = 41626536 bytes.
 Index  Count   %     Size   % Cumulative  % Kind (class / dict of class)
     0 130043  43 15682088  38  15682088  38 str
     1  76123  25  6978416  17  22660504  54 tuple
     2   1015   0  2794024   7  25454528  61 dict of module
     3  20181   7  2583168   6  28037696  67 types.CodeType
     4  20610   7  2473200   6  30510896  73 function
     5   2321   1  2095216   5  32606112  78 type
     6   2319   1  2045160   5  34651272  83 dict of type
     7   1277   0  1162808   3  35814080  86 dict (no owner)
     8   2890   1   918352   2  36732432  88 unicode
     9    494   0   440912   1  37173344  89 dict of class

So our task added 41.6M of RAM to the current process. Let's see where it comes from, starting by these 15M of strings:

>>> h[0].byvia
Partition of a set of 130043 objects. Total size = 15682088 bytes.
 Index  Count   %     Size   % Cumulative  % Referred Via:
     0   8208   6  3890208  25   3890208  25 '.func_doc', '[0]'
     1  20065  15  3239664  21   7129872  45 '.co_code'
     2  16673  13  1706224  11   8836096  56 '.co_filename'
     3   2398   2  1606864  10  10442960  67 "['__doc__']"
     4  19810  15  1109640   7  11552600  74 '.co_lnotab'
     5    419   0   308392   2  11860992  76 '.func_doc'
     6   4311   3   285232   2  12146224  77 '[1]'
     7   2788   2   167616   1  12313840  79 '[2]'
     8   2153   2   129136   1  12442976  79 '[3]'
     9   1006   1   109560   1  12552536  80 "['__file__']"
<21212 more rows. Type e.g. '_.more' to view.>

Here, it seems that surprisingly, most of the strings are actually docstrings. This can happen if you work with large Python modules like scipy or boto. One might consider stripping them manually or with Python's optimized mode.

Running a full worker

Sometimes memory leaks are complex and need a big number of jobs to be detected. To do that you can run a full worker, wrapped in calls to guppy to inspect memory usage after N jobs.

This real-life example also includes code that clears some Python caches to avoid polluting results.

$ pip install guppy
$ python

# Be as close to a Worker environment as possible
import os
if "GEVENT_RESOLVER" not in os.environ:
    os.environ["GEVENT_RESOLVER"] = "ares"

from gevent import monkey
monkey.patch_all()

from mrq.context import setup_context, run_task, log, get_current_config
from mrq.queue import send_tasks
from mrq.utils import load_class_by_path
from guppy import hpy
import sys, urlparse, gc, re, linecache

hp = hpy()

setup_context(config_type="worker", extra={"greenlets": 10, "queues": ["default"], "mongodb_logs":0, "report_interval":10000})
worker_class = load_class_by_path(get_current_config()["worker_class"])
w = worker_class()
w.work_init()

def work_heap(max_jobs):
  gc.collect()
  hp.setrelheap()
  w.work_loop(max_jobs=max_jobs)
  # Clear some Python caches to clean the results
  urlparse.clear_cache()
  re.purge()
  linecache.clearcache()
  log.handler.reset()
  gc.collect()
  return hp.heap()

# Optional: Queue some jobs
send_tasks("tests.tasks.general.Leak", [{} for _ in range(110)])

# Run a few jobs to eliminate side-effects from loading code and modules for the first time
work_heap(10)

# Now we can run a larger number of jobs
h = work_heap(100)

Then you can debug the heap with Guppy:

>>> h
Partition of a set of 347 objects. Total size = 61320 bytes.
 Index  Count   %     Size   % Cumulative  % Kind (class / dict of class)
     0    215  62    18920  31     18920  31 __builtin__.weakref
     1      4   1     8800  14     27720  45 dict of mongokit.document.DocumentProperties
     2     17   5     8328  14     36048  59 list
     3      4   1     5792   9     41840  68 mongokit.helpers.DotCollapsedDict
     4      8   2     4544   7     46384  76 dict (no owner)
     5      4   1     3616   6     50000  82 mongokit.document.DocumentProperties
     6      5   1     1160   2     51160  83 __builtin__.set
     7      4   1     1120   2     52280  85 dict of mongokit.helpers.DotCollapsedDict
     8      1   0     1048   2     53328  87 dict of 0x279e1f0
     9      1   0     1048   2     54376  89 dict of 0x2905040
<18 more rows. Type e.g. '_.more' to view.>
>>> h[0].byvia
Partition of a set of 215 objects. Total size = 18920 bytes.
 Index  Count   %     Size   % Cumulative  % Referred Via:
     0      4   2      352   2       352   2 '[19]'
     1      4   2      352   2       704   4 '[20]'
     2      3   1      264   1       968   5 '[17]'
     3      3   1      264   1      1232   7 '[18]'
     4      3   1      264   1      1496   8 '[27]'
     5      3   1      264   1      1760   9 '[28]'
     6      3   1      264   1      2024  11 '[29]'
     7      3   1      264   1      2288  12 '[33]'
     8      3   1      264   1      2552  13 '[35]'
     9      3   1      264   1      2816  15 '[5]'
<147 more rows. Type e.g. '_.more' to view.>
>>> h[0].byvia[0].referrers
Partition of a set of 4 objects. Total size = 2144 bytes.
 Index  Count   %     Size   % Cumulative  % Kind (class / dict of class)
     0      4 100     2144 100      2144 100 list
>>> h[0].byvia[0].referrers.byvia
Partition of a set of 4 objects. Total size = 2144 bytes.
 Index  Count   %     Size   % Cumulative  % Referred Via:
     0      4 100     2144 100      2144 100 '->tp_subclasses'

In this case, it seems we are leaking references to subclasses, which is a known Python issue.